threads-stack-overflow dataset
This is a temporal higher-order network dataset, which here means a sequence of timestamped simplices where each simplex is a set of nodes. In this dataset, nodes are users on stackoverflow.com, and a simplex comes from users participating in a thread that lasts for at most 24 hours. The timestamps are the time of the post in millisecond but normalized so that the earliest post starts at 0. The projected graph is a weighted undirected graph representing how many times each pair of nodes co-appears in a simplex. We restricted to simplices that consist of at most 25 nodes. Some basic statistics of this dataset are:
  • number of nodes: 2,675,955
  • number of timestamped simplices: 11,305,343
  • number of unique simplices: 9,705,709
  • number of edges in projected graph: 20,999,838
Data restricted to simplices with at most 25 nodes: Full data without restriction on simplex size: If you use this data, please cite the following paper:
  • Simplicial closure and higher-order link prediction.
    Austin R. Benson, Rediet Abebe, Michael T. Schaub, Ali Jadbabaie, and Jon Kleinberg.
    Proceedings of the National Academy of Sciences (PNAS), 2018. [bibtex]