This is a collection of datasets from my research projects. I strive to make the data used in my research easily accessible. If you encounter problems, please email me at email@example.com.
Temporal higher-order networks
Each of these datasets is a timestamped sequence of simplices, where a simplex is a set of k nodes from some vertex set. The datasets also contain weighted projected graphs, where the weight is the number of times that two nodes co-appear in a simplex. These datasets were used in the following paper:
- Simplicial closure and higher-order link prediction.
Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg.
Code available at github.com/arbenson/ScHoLP-Tutorial.
- coauth-DBLP: co-authorship on DBLP papers.
- coauth-MAG-Geology: co-authorship on Geology papers.
- coauth-MAG-History: co-authorship on History papers.
- tags-stack-overflow: sets of tags applied to questions on stackoverflow.com.
- tags-math-sx: sets of tags applied to questions on math.stackexchange.com.
- tags-ask-ubuntu: sets of tags applied to questions on askubuntu.com.
- threads-stack-overflow: sets of users asking and answering questions on threads on stackoverflow.com.
- threads-math-sx: sets of users asking and answering questions on threads on math.stackexchange.com.
- threads-ask-ubuntu: sets of users asking and answering questions on threads on askubuntu.com.
- NDC-substances: sets of substances making up drugs.
- NDC-classes: sets of classifications applied to drugs.
- DAWN: sets of drugs used by patients recorded in emergency room visits.
- congress-bills: sets of congresspersons cosponsoring bills.
- email-Eu: sets of email addresses on emails.
- email-Enron: sets of email addresses on emails.
- contact-high-school: groups of people in contact at a high school.
- contact-primary-school: groups of people in contact at a primary school.
Graphs with planted vertex covers
Each of these datasets is a timestamped set of edges in a graph, where the graph has some "planted" vertex cover coming from the data collection process. Specifically, all of the graphs measured communication involving a set of nodes, and this set of nodes serves as the planted vertex cover.
- Found Graph Data and Planted Vertex Covers.
Austin R. Benson and Jon Kleinberg.
In Proceedings of Neural Information Processing Systems (NIPS), 2018.
Code available at github.com/arbenson/FGDnPVC.
This is a temporal network where (i, j, t) signifies a directed edge from i to j at time t. The network was used in the following paper:
- A sampling framework for counting temporal motifs.
Paul Liu, Austin R. Benson, and Moses Charikar.
Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2019.
- temporal-reddit-reply: timestamped comment interactions on reddit.
Sequences of Sets
These datasets are sequences of sets. Formally, a dataset consists of a collection of sequences, where each sequence is a time-ordered list of subsets of some universal set. These datasets were used in the following paper:
- Sequences of Sets.
Austin R. Benson, Ravi Kumar, and Andrew Tomkins.
In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2018.
Code available at github.com/arbenson/Sequences-of-Sets.
- sos-email-Enron-core: sets of recipients on emails from email addresses.
- sos-email-Eu-core: sets of recipients on emails from email addresses.
- sos-coauth-Business: sets of co-authors on publications from researchers.
- sos-coauth-Geology: sets of co-authors on publications from researchers.
- sos-tags-mathoverflow: sets of tags on MathOverflow questions from users.
- sos-tags-math-sx: sets of tags on Mathematics Stack Exchange questions from users.
- sos-contact-high-school: sets of proximity-based contacts from individuals at a high school.
- sos-contact-prim-school: sets of proximity-based contacts from individuals at a primary school.
Discrete subset choices
These datasets are from people making choices from a discrete set of alternatives. In datasets with "universal choice sets," the set of alternatives is the same for every choice that is made. In datasets with "variable choice sets," the set of alternatives changes with each subset selection. These datasets were used in the following paper:
- A Discrete Choice Model for Subset Selection.
Austin R. Benson, Ravi Kumar, and Andrew Tomkins.
In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), 2018.
Code available at github.com/arbenson/discrete-subset-choice.
- uchoice-Bakery: sets of items purchased at a bakery.
- uchoice-Walmart-Items: sets of items purchased at Walmart.
- uchoice-Walmart-Depts: sets of departments from which items were purchased at Walmart.
- uchoice-Kosarak: sets of web pages viewed in a browsing session.
- uchoice-Instacart: sets of items purchased from Instacart.
- uchoice-Lastfm-Genres: sets of genres of music played by users in listening sessions.
Spatial networks with core-periphery structure
Each of these datasets is a network with its spatial coordinate. These datasets were used in the following paper:
- Detecting Core-Periphery Structure in Spatial Networks.
Junteng Jia, and Austin R. Benson.
Code available at github.com/000Justin000/spatial_core_periphery.
- spatial-Celegans: C. elegans neural network.
- spatial-underground-London: Tube transportation network in London.
- spatial-fungi: Fungal networks constructed from experimental data.
- spatial-OpenFlights: World airline network from openflights.org.
- spatial-Brightkite: Brightkite location-based social network.
Manhattan taxi cab trajectories
This dataset contains 1,000 sequences of neighborhoods of Manhattan visited by taxi cabs over a one year period. The dataset was used in the following paper:
- The spacey random walk: a stochastic process for higher-order data.
Austin R. Benson, David F. Gleich, and Lek-Heng Lim.
SIAM Review (Research Spotlights) 59:2, 321–345, 2017.
Code available at github.com/arbenson/spacey-random-walks.
This flow cytometry dataset represents abundances of fluorescent molecules labeling antibodies that bind to specific targets on the surface of blood cells. The dataset was used in the following paper:
- Scalable methods for nonnegative matrix factorizations of near-separable tall-and-skinny matrices.
Austin R. Benson, Jason D. Lee, Bartek Rajwa, and David F. Gleich.
In Proceedings of Neural Information Processing Systems (NIPS), 2014.
Code available at github.com/arbenson/mrnmf.