This is a collection of datasets from my research projects. I strive to make the data used in my research easily accessible. If you encounter problems, please email me at firstname.lastname@example.org.
Temporal higher-order networks
Each of these datasets is a timestamped sequence of simplices, where a simplex is a set of k nodes from some vertex set. The datasets also contain weighted projected graphs, where the weight is the number of times that two nodes co-appear in a simplex. These datasets were used in the paper
- Simplicial closure and higher-order link prediction.
Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg.
- coauth-DBLP: co-authorship on DBLP papers.
- coauth-MAG-Geology: co-authorship on Geology papers.
- coauth-MAG-History: co-authorship on History papers.
- tags-stack-overflow: sets of tags applied to questions on stackoverflow.com.
- tags-math-sx: sets of tags applied to questions on math.stackexchange.com.
- tags-ask-ubuntu: sets of tags applied to questions on askubuntu.com.
- threads-stack-overflow: sets of users asking and answering questions on threads on stackoverflow.com.
- threads-math-sx: sets of users asking and answering questions on threads on math.stackexchange.com.
- threads-ask-ubuntu: sets of users asking and answering questions on threads on askubuntu.com.
- NDC-substances: sets of substances making up drugs.
- NDC-classes: sets of classifications applied to drugs.
- DAWN: sets of drugs used by patients recorded in emergency room visits.
- congress-bills: sets of congresspersons cosponsoring bills.
- email-Eu: sets of email addresses on emails.
- email-Enron: sets of email addresses on emails.
- contact-high-school: groups of people in contact at a high school.
- contact-primary-school: groups of people in contact at a primary school.
Discrete subset choices
These datasets are from people making choices from a discrete set of alternatives. In datasets with "universal choice sets," the set of alternatives is the same for every choice that is made. In datasets with "variable choice sets," the set of alternatives changes with each subset selection. These datasets were used in the paper
- A Discrete Choice Model for Subset Selection.
Austin R. Benson, Ravi Kumar, and Andrew Tomkins.
In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), 2018.
Code available at github.com/arbenson/discrete-subset-choice.
- uchoice-Bakery: sets of items purchased at a bakery.
- uchoice-Walmart-Items: sets of items purchased at Walmart.
- uchoice-Walmart-Depts: sets of departments from which items were purchased at Walmart.
- uchoice-Kosarak: sets of web pages viewed in a browsing session.
- uchoice-Instacart: sets of items purchased from Instacart.
- uchoice-Lastfm-Genres: sets of genres of music played by users in listening sessions.
Manhhatan taxi cab trajectories
This dataset contains 1,000 sequences of neighborhoods of Manhattan visited by taxi cabs over a one year period. The dataset was used in the paper
- The spacey random walk: a stochastic process for higher-order data.
Austin R. Benson, David F. Gleich, and Lek-Heng Lim.
SIAM Review (Research Spotlights) 59:2, 321–345, 2017.
Code available at github.com/arbenson/spacey-random-walks.
This flow cytometry dataset represents abundances of fluorescent molecules labeling antibodies that bind to specific targets on the surface of blood cells. The dataset was used in the paper
- Scalable methods for nonnegative matrix factorizations of near-separable tall-and-skinny matrices.
Austin R. Benson, Jason D. Lee, Bartek Rajwa, and David F. Gleich.
In Proceedings of Neural Information Processing Systems (NIPS), 2014.
Code available at github.com/arbenson/mrnmf.