## Data!

This is a collection of datasets from my research projects.
I strive to make the data used in my research easily accessible.
If you encounter problems, please email me at arb@cs.cornell.edu.

#### Temporal higher-order networks

Each of these datasets is a timestamped sequence of simplices, where
a simplex is a set of k nodes from some vertex set. The datasets
also contain weighted projected graphs, where the weight is the
number of times that two nodes co-appear in a simplex. These datasets
were used in the following paper:

- Simplicial closure and higher-order link prediction.

Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg.

arXiv:1802.06916, 2018.

Code available at github.com/arbenson/ScHoLP-Tutorial.

- coauth-DBLP: co-authorship on DBLP papers.
- coauth-MAG-Geology: co-authorship on Geology papers.
- coauth-MAG-History: co-authorship on History papers.
- tags-stack-overflow: sets of tags applied to questions on stackoverflow.com.
- tags-math-sx: sets of tags applied to questions on math.stackexchange.com.
- tags-ask-ubuntu: sets of tags applied to questions on askubuntu.com.
- threads-stack-overflow: sets of users asking and answering questions on threads on stackoverflow.com.
- threads-math-sx: sets of users asking and answering questions on threads on math.stackexchange.com.
- threads-ask-ubuntu: sets of users asking and answering questions on threads on askubuntu.com.
- NDC-substances: sets of substances making up drugs.
- NDC-classes: sets of classifications applied to drugs.
- DAWN: sets of drugs used by patients recorded in emergency room visits.
- congress-bills: sets of congresspersons cosponsoring bills.
- email-Eu: sets of email addresses on emails.
- email-Enron: sets of email addresses on emails.
- contact-high-school: groups of people in contact at a high school.
- contact-primary-school: groups of people in contact at a primary school.

#### Graphs with planted vertex covers

Each of these datasets is a timestamped set of edges in a graph, where the
graph has some "planted" vertex cover coming from the data collection process.
Specifically, all of the graphs measured communication involving a set of nodes,
and this set of nodes serves as the planted vertex cover.

- Found Graph Data and Planted Vertex Covers.

Austin R. Benson and Jon Kleinberg.

*arXiv:1805.01209*, 2018.

Code available at github.com/arbenson/FGDnPVC.

- pvc-email-W3C: email on W3C mailing lists.
- pvc-email-Enron: email involving Enron employees.
- pvc-call-Reality: phone calls made and received by participants in the reality mining project.
- pvc-text-Reality: SMS texts made and received by participants in the reality mining project.

#### Sequences of Sets

These datasets are sequences of sets. Formally, a dataset consists of a collection
of sequences, where each sequence is a time-ordered list of subsets of some universal
set. These datasets were used in the following paper:

- Sequences of Sets.

Austin R. Benson, Ravi Kumar, and Andrew Tomkins.

In*Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)*, 2018.

Code available at github.com/arbenson/Sequences-of-Sets.

- sos-email-Enron-core: sets of recipients on emails from email addresses.
- sos-email-Eu-core: sets of recipients on emails from email addresses.
- sos-coauth-Business: sets of co-authors on publications from researchers.
- sos-coauth-Geology: sets of co-authors on publications from researchers.
- sos-tags-mathoverflow: sets of tags on MathOverflow questions from users.
- sos-tags-math-sx: sets of tags on Mathematics Stack Exchange questions from users.
- sos-contact-high-school: sets of proximity-based contacts from individuals at a high school.
- sos-contact-prim-school: sets of proximity-based contacts from individuals at a primary school.

#### Discrete subset choices

These datasets are from people making choices from a discrete set of
alternatives. In datasets with "universal choice sets," the set of
alternatives is the same for every choice that is made. In datasets
with "variable choice sets," the set of alternatives changes with each
subset selection. These datasets were used in the following paper:

- A Discrete Choice Model for Subset Selection.

Austin R. Benson, Ravi Kumar, and Andrew Tomkins.

In*Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM)*, 2018.

Code available at github.com/arbenson/discrete-subset-choice.

- uchoice-Bakery: sets of items purchased at a bakery.
- uchoice-Walmart-Items: sets of items purchased at Walmart.
- uchoice-Walmart-Depts: sets of departments from which items were purchased at Walmart.
- uchoice-Kosarak: sets of web pages viewed in a browsing session.
- uchoice-Instacart: sets of items purchased from Instacart.
- uchoice-Lastfm-Genres: sets of genres of music played by users in listening sessions.

- vchoice-Yc-Items: sets of items purchased from the items viewed in a browsing session on an e-commerce web site.
- vchoice-Yc-Cats: sets of product categories from which purchases were made from a browsing session on an e-commerce web site.

#### Manhhatan taxi cab trajectories

This dataset contains 1,000 sequences of neighborhoods of Manhattan visited
by taxi cabs over a one year period. The dataset was used in the following paper:

- The spacey random walk: a stochastic process for higher-order data.

Austin R. Benson, David F. Gleich, and Lek-Heng Lim.

*SIAM Review (Research Spotlights)*59:2, 321–345, 2017.

Code available at github.com/arbenson/spacey-random-walks.

#### Flow cytometry

This flow cytometry dataset represents abundances of fluorescent
molecules labeling antibodies that bind to specific targets on the surface
of blood cells. The dataset was used in the following paper:

- Scalable methods for nonnegative matrix factorizations of near-separable tall-and-skinny matrices.

Austin R. Benson, Jason D. Lee, Bartek Rajwa, and David F. Gleich.

In*Proceedings of Neural Information Processing Systems (NIPS)*, 2014.

Code available at github.com/arbenson/mrnmf.