Data!

This is a collection of datasets from my research projects. I strive to make the data used in my research easily accessible. If you encounter problems, please email me at arb@cs.cornell.edu.

Temporal higher-order networks (hypergraphs)
Each of these datasets is a timestamped sequence of simplices, where a simplex is a set of k nodes from some vertex set. The datasets also contain weighted projected graphs, where the weight is the number of times that two nodes co-appear in a simplex. These datasets were used in the following paper:
  • Simplicial closure and higher-order link prediction.
    Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg.
    Proceedings of the National Academy of Sciences (PNAS), 2018.
    Code available at github.com/arbenson/ScHoLP-Tutorial.
Dataset pages:
Hypergraphs with labeled nodes
Each of these datasets is a hypergraph where the nodes are labeled into discrete classes. These can be used for community detection or node prediction experiments. We used them in the following papers:
  • Generative hypergraph clustering: from blockmodels to modularity.
    Philip S. Chodrow, Nate Veldt, and Austin R. Benson.
    Science Advances, 2021.
    Code available at github.com/PhilChodrow/HypergraphModularity.
  • Minimizing Localized Ratio Cut Objectives in Hypergraphs.
    Nate Veldt, Austin R. Benson, and Jon Kleinberg.
    Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020.
    Code available at github.com/nveldt/HypergraphFlowClustering.
  • Clustering in graphs and hypergraphs with categorical edge labels.
    Ilya Amburg, Nate Veldt, and Austin R. Benson.
    Proceedings of the Web Conference (WWW), 2020.
    Code available at github.com/nveldt/CategoricalEdgeClustering.
Dataset pages:
  • stackoverflow-answers: sets of questions answered by users on Stack Overflow, where labels are question tags.
  • mathoverflow-answers: sets of questions answered by users on Math Overflow, where labels are question tags.
  • walmart-trips: sets of products bought on Walmart shopping trips, where labels are departments of products.
  • amazon-reviews: sets of products reviewed by users on Amazon, where labels are product categories.
  • trivago-clicks: sets of hotels clicked on in a Web browsing session, where labels are the countries of the accomodation.
  • contact-primary-school: sets of students in proximity, where labels are classrooms.
  • contact-high-school: sets of students in proximity, where labels are classrooms.
  • senate-bills: bill cosponsorship in the US Senate, where labels are political affiliation.
  • house-bills: bill cosponsorship in the US House of Representatives, where labels are political affiliation.
  • senate-committees: committee membership in the US Senate, where labels are political affiliation.
  • house-committees: committee membership in the US House of Representatives, where labels are political affiliation.
US county networks for node regression
These are networks of US counties, where edges come from physical adjacency or Facebook connectedness. The nodes are accompanies by various covariates, such as demographic features, climate measurements, and election statistics, depending on the dataset. We used these for transductive node regression experiments. Some of the datasets have demographic features and election statistics from both 2012 and 2016, and we used these for inductive learning experiments. The data was used in the following papers:
  • A Unifying Generative Model for Graph Learning Algorithms: Label Propagation, Graph Convolutions, and Combinations.
    Junteng Jia and Austin R. Benson.
    arXiv:2101.07730, 2021.
    Code available at github.com/000Justin000/GaussianMRF.
  • Residual Correlation in Graph Neural Network Regression.
    Junteng Jia and Austin R. Benson.
    Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020.
    Code available at github.com/000Justin000/gnn-residual-correlation.
Dataset pages:
Hypergraphs with categorical edge labels
Each of these datasets is a hypergraph or just a graph where the edges have a discrete label (a categorical label). These datasets were collected for and analyzed in the following papers: Dataset pages:
  • cat-edge-Cooking: sets of ingedients in recipes labeld by cuisine type.
  • cat-edge-DAWN: sets of drugs used by patients recorded in emergency room visits labeled by the most common patient disposition for that combination of drugs.
  • cat-egde-Walmart-Trips: sets of products bought on Walmart shopping trips categorized by a trip type.
  • cat-edge-MAG-10: co-authorship with publication venue labels.
  • cat-edge-Brain: set of brain region coactivation scores based on two categories of measurement.
  • cat-edge-music-blues-reviews: sets of reviewers categorized by product review type, with set membership given by timestamp similarity.
  • cat-edge-madison-restaurant-reviews: sets of reviewers categorized by establishment review type, with set membership given by timestamp similarity.
  • cat-edge-vegas-bars-reviews: sets of reviewers categorized by establishment review type, with set membership given by timestamp similarity.
  • cat-edge-algebra-questions: sets of users categorized by question tag type, with set membership given by timestamp similarity.
  • cat-edge-geometry-questions: sets of users categorized by question tag type, with set membership given by timestamp similarity.
Largeish weighted graphs
Each of these datasets is an undirected weighted graph of nontrivial size. These datasets were used in the following paper: Dataset pages:
Graphs and hypergraphs with core-fringe structure
Each of these datasets is a graph or hypergraph, where the nodes are labeled as "core" or "fringe" according to the data collection process. Specifically, all of the graphs measured communication involving a set of nodes, and this set of nodes serves as the core. This induces what we call "core-fringe" structure in the network. We studied how well one can recover the core-labeled nodes from the network structure. In this setup, the core nodes form a "planted vertex cover" in the graph case and a "planted hitting set" in the hypergraph case. We studied these datasets in the following papers:
  • Found Graph Data and Planted Vertex Covers.
    Austin R. Benson and Jon Kleinberg.
    Advances in Neural Information Processing Systems, 2018.
    Code available at github.com/arbenson/FGDnPVC.
  • Planted Hitting Set Recovery in Hypergraphs.
    Ilya Amburg, Jon Kleinberg, and Austin R. Benson.
    Journal of Physics: Complexity (Special Issue on Higher-Order Structures in Networks and Network Dynamical Systems), 2021.
    Code available at github.com/ilyaamburg/Hypergraph-Planted-Hitting-Set-Recovery.
Dataset pages:
Stack exchange co-tagging networks
These are weighted networks for 168 co-tagging networks on Stack Exchange communities, where the weight of edge (i, j) is the number of questions that were annotated with both tags i and j. The data was analyzed in the following paper:
  • Modeling and Analysis of Tagging Networks in Stack Exchange Communities.
    Xiang Fu*, and Shangdi Yu*, and Austin R. Benson (*equal contribution)
    Journal of Complex Networks, 2019.
    Code available at github.com/yushangdi/stack-exchange-cotagging
Dataset page:
Temporal networks
These are temporal networks where (i, j, t) signifies a directed edge from i to j at time t. The networks were used in the following paper:
  • A sampling framework for counting temporal motifs.
    Paul Liu, Austin R. Benson, and Moses Charikar.
    Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2019.
    Code available at gitlab.com/paul.liu.ubc/sampling-temporal-motifs.
Dataset pages:
Spatial networks
Each of these datasets is a network with its spatial coordinate. These datasets were used in the following paper:
  • Detecting Core-Periphery Structure in Spatial Networks.
    Junteng Jia and Austin R. Benson.
    Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2019.
    Code available at github.com/000Justin000/spatial_core_periphery.
Dataset pages:
Sequences of Sets
These datasets are sequences of sets. Formally, a dataset consists of a collection of sequences, where each sequence is a time-ordered list of subsets of some universal set. These datasets were used in the following paper:
  • Sequences of Sets.
    Austin R. Benson, Ravi Kumar, and Andrew Tomkins.
    Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2018.
    Code available at github.com/arbenson/Sequences-of-Sets.
Dataset pages:
Discrete subset choices
These datasets are from people making choices from a discrete set of alternatives. In datasets with "universal choice sets," the set of alternatives is the same for every choice that is made. In datasets with "variable choice sets," the set of alternatives changes with each subset selection. These datasets were used in the following paper:
  • A Discrete Choice Model for Subset Selection.
    Austin R. Benson, Ravi Kumar, and Andrew Tomkins.
    Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2018.
    Code available at github.com/arbenson/discrete-subset-choice.
Universal choice dataset pages: Variable choice dataset pages:
  • vchoice-Yc-Items: sets of items purchased from the items viewed in a browsing session on an e-commerce web site.
  • vchoice-Yc-Cats: sets of product categories from which purchases were made from a browsing session on an e-commerce web site.
Genius.com data
This is a curated dataset of users, songs, and lyrical annotation on the web site Genius.com. The dataset was used in the following paper:
  • Expertise and Dynamics within Crowdsourced Musical Knowledge Curation: A Case Study of the Genius Platform.
    Derek Lim and Austin R. Benson.
    Proceedings of International Conference on Web and Social Media (ICWSM), 2021.
    Code available at github.com/cptq/genius-expertise.
Dataset page:
Manhattan taxi cab trajectories
This dataset contains 1,000 sequences of neighborhoods of Manhattan visited by taxi cabs over a one year period. The dataset was used in the following paper:
  • The spacey random walk: a stochastic process for higher-order data.
    Austin R. Benson, David F. Gleich, and Lek-Heng Lim.
    SIAM Review (Research Spotlights), 2017.
    Code available at github.com/arbenson/spacey-random-walks.
Dataset page:
Flow cytometry
This flow cytometry dataset represents abundances of fluorescent molecules labeling antibodies that bind to specific targets on the surface of blood cells. The dataset was used in the following paper:
  • Scalable methods for nonnegative matrix factorizations of near-separable tall-and-skinny matrices.
    Austin R. Benson, Jason D. Lee, Bartek Rajwa, and David F. Gleich.
    Advances in Neural Information Processing Systems (NeurIPS), 2014.
    Code available at github.com/arbenson/mrnmf.
Dataset page: