Data!

This is a collection of datasets from my research projects. I strive to make the data used in my research easily accessible. If you encounter problems, please email me at arb@cs.cornell.edu.

Temporal higher-order networks

Each of these datasets is a timestamped sequence of simplices, where a simplex is a set of k nodes from some vertex set. The datasets also contain weighted projected graphs, where the weight is the number of times that two nodes co-appear in a simplex. These datasets were used in the following paper:
  • Simplicial closure and higher-order link prediction.
    Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg.
    arXiv:1802.06916, 2018.
    Code available at github.com/arbenson/ScHoLP-Tutorial.
Dataset pages:

Graphs with planted vertex covers

Each of these datasets is a timestamped set of edges in a graph, where the graph has some "planted" vertex cover coming from the data collection process. Specifically, all of the graphs measured communication involving a set of nodes, and this set of nodes serves as the planted vertex cover.
  • Found Graph Data and Planted Vertex Covers.
    Austin R. Benson and Jon Kleinberg.
    arXiv:1805.01209, 2018.
    Code available at github.com/arbenson/FGDnPVC.
Dataset pages:

Sequences of Sets

These datasets are sequences of sets. Formally, a dataset consists of a collection of sequences, where each sequence is a time-ordered list of subsets of some universal set. These datasets were used in the following paper:
  • Sequences of Sets.
    Austin R. Benson, Ravi Kumar, and Andrew Tomkins.
    In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2018.
    Code available at github.com/arbenson/Sequences-of-Sets.
Dataset pages:

Discrete subset choices

These datasets are from people making choices from a discrete set of alternatives. In datasets with "universal choice sets," the set of alternatives is the same for every choice that is made. In datasets with "variable choice sets," the set of alternatives changes with each subset selection. These datasets were used in the following paper:
  • A Discrete Choice Model for Subset Selection.
    Austin R. Benson, Ravi Kumar, and Andrew Tomkins.
    In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), 2018.
    Code available at github.com/arbenson/discrete-subset-choice.
Universal choice dataset pages: Variable choice dataset pages:
  • vchoice-Yc-Items: sets of items purchased from the items viewed in a browsing session on an e-commerce web site.
  • vchoice-Yc-Cats: sets of product categories from which purchases were made from a browsing session on an e-commerce web site.

Manhhatan taxi cab trajectories

This dataset contains 1,000 sequences of neighborhoods of Manhattan visited by taxi cabs over a one year period. The dataset was used in the following paper:
  • The spacey random walk: a stochastic process for higher-order data.
    Austin R. Benson, David F. Gleich, and Lek-Heng Lim.
    SIAM Review (Research Spotlights) 59:2, 321–345, 2017.
    Code available at github.com/arbenson/spacey-random-walks.
Dataset page:

Flow cytometry

This flow cytometry dataset represents abundances of fluorescent molecules labeling antibodies that bind to specific targets on the surface of blood cells. The dataset was used in the following paper:
  • Scalable methods for nonnegative matrix factorizations of near-separable tall-and-skinny matrices.
    Austin R. Benson, Jason D. Lee, Bartek Rajwa, and David F. Gleich.
    In Proceedings of Neural Information Processing Systems (NIPS), 2014.
    Code available at github.com/arbenson/mrnmf.
Dataset page: