## What's in here?

This is version 1.1 of the data to accompany:

@inproceedings{hessel-lee-2019-somethings,
  title = "Something{'}s Brewing! Early Prediction of Controversy-causing Posts from Discussion Features",
  author = "Hessel, Jack  and Lee, Lillian",
  booktitle = "Proceedings of NAACL, Volume 1 (Long and Short Papers)",
  year = "2019",
  url = "https://www.aclweb.org/anthology/N19-1166",
  pages = "1648--1659"}

In particular, we provide:

15_comment_filtered.zip:

  posts and comment trees in the communities we consider with at least
  15 comments. Each line of the jsonlist files represent a separate
  post that can be loaded with standard json readers. Comments are
  recursively nested in the 'children' entry of these jsons.

posts_from_paper.zip (new in v1.1)

  these are the posts/labels/training splits used in the paper.  the
  posts contained in these jsonlists are a subset of the posts in
  15_comment_filtered.zip, because these have 30+ comments, and also
  have undergone the sort-and-filter process described in the paper:

  """We assign binary controversy labels (i.e., relatively
  controversial vs. relatively non-controversial) to posts according
  to the following process: first, we discard posts where the observed
  variability across 10 API queries for percent-upvoted exceeds 5%; in
  these cases, we assume that there are too few total votes for a
  stable estimate. Next, we discard posts where neither the observed
  upvote ratio nor the observed score vary at all; in these cases, we
  cannot be sure that the upvote ratio is insensitive to the vote
  fuzzing function. Finally, we sort each community's surviving posts
  by upvote percentage, and discard the small number of posts with
  percent-upvoted below 50%. The top quartile of posts according to
  this ranking (i.e., posts with mostly only upvotes) are labeled
  "non-controversial." The bottom quartile of posts, where the number
  of downvotes cannot exceed but may approach the number of upvotes,
  are labeled as "controversial." For each community, this process
  yields a balanced, labeled set of controversial/non-controversial
  posts."""

  the label used in the paper is stored in the "controversy_label"
  field. For each post, we provide a list of 15 integers indicating
  which of the train/val/test set each post belongs, e.g., if
  "train_val_test" is [1, 0, 0, 2, 0 ...] for a given post, then the
  post was in the dev set for the first cross-val split, the train set
  for the second, the train set for the third, the test set for the
  fourth the train set for the fifth, and so on (0/1/2 =
  train/val/test).

  Comment trees can be filled-in by joining the data in this file with
  the data in the 15_comment_filtered according to the id of the post.

vote_scrapes:

  ten directories representing the ten rounds of vote scraping we
  did. each vote_info file has three meaningful columns: the first
  column is the post id, the second column is the percent upvoted, and
  the third column is the post score (the fourth column is unused).
  The second and third columns are subject to vote noising, which is
  why we scraped the vote statistics 10 times (see the paper for more
  details). There may be slightly more vote scrapes than necessary.


The paper provides detail about how we used/further filtered this data
for our study, e.g., we only considered posts that eventually received
30 comments (but provide posts with 15+ comments).

## Where is this data from?

This data is a combination of several data scrapes, principally those
released by Jason Baumgartner of pushshift.io, and our own scraping
efforts. If you use this data, we encourage you to cite both this
work, and also pushshift.io. If you would like more detail, you can
read about this dataset here:
http://www.cs.cornell.edu/~jhessel/reddit/gaps.html

If you would like access to the full post sets (or the post sets
for any of the communities we have reconstructed comment trees for)
please feel free to get in touch.

## More Details

If you have any questions/comments/thoughts, feel free to contact us!