## What's in here? This is version 1.1 of the data to accompany: @inproceedings{hessel-lee-2019-somethings, title = "Something{'}s Brewing! Early Prediction of Controversy-causing Posts from Discussion Features", author = "Hessel, Jack and Lee, Lillian", booktitle = "Proceedings of NAACL, Volume 1 (Long and Short Papers)", year = "2019", url = "https://www.aclweb.org/anthology/N19-1166", pages = "1648--1659"} In particular, we provide: 15_comment_filtered.zip: posts and comment trees in the communities we consider with at least 15 comments. Each line of the jsonlist files represent a separate post that can be loaded with standard json readers. Comments are recursively nested in the 'children' entry of these jsons. posts_from_paper.zip (new in v1.1) these are the posts/labels/training splits used in the paper. the posts contained in these jsonlists are a subset of the posts in 15_comment_filtered.zip, because these have 30+ comments, and also have undergone the sort-and-filter process described in the paper: """We assign binary controversy labels (i.e., relatively controversial vs. relatively non-controversial) to posts according to the following process: first, we discard posts where the observed variability across 10 API queries for percent-upvoted exceeds 5%; in these cases, we assume that there are too few total votes for a stable estimate. Next, we discard posts where neither the observed upvote ratio nor the observed score vary at all; in these cases, we cannot be sure that the upvote ratio is insensitive to the vote fuzzing function. Finally, we sort each community's surviving posts by upvote percentage, and discard the small number of posts with percent-upvoted below 50%. The top quartile of posts according to this ranking (i.e., posts with mostly only upvotes) are labeled "non-controversial." The bottom quartile of posts, where the number of downvotes cannot exceed but may approach the number of upvotes, are labeled as "controversial." For each community, this process yields a balanced, labeled set of controversial/non-controversial posts.""" the label used in the paper is stored in the "controversy_label" field. For each post, we provide a list of 15 integers indicating which of the train/val/test set each post belongs, e.g., if "train_val_test" is [1, 0, 0, 2, 0 ...] for a given post, then the post was in the dev set for the first cross-val split, the train set for the second, the train set for the third, the test set for the fourth the train set for the fifth, and so on (0/1/2 = train/val/test). Comment trees can be filled-in by joining the data in this file with the data in the 15_comment_filtered according to the id of the post. vote_scrapes: ten directories representing the ten rounds of vote scraping we did. each vote_info file has three meaningful columns: the first column is the post id, the second column is the percent upvoted, and the third column is the post score (the fourth column is unused). The second and third columns are subject to vote noising, which is why we scraped the vote statistics 10 times (see the paper for more details). There may be slightly more vote scrapes than necessary. The paper provides detail about how we used/further filtered this data for our study, e.g., we only considered posts that eventually received 30 comments (but provide posts with 15+ comments). ## Where is this data from? This data is a combination of several data scrapes, principally those released by Jason Baumgartner of pushshift.io, and our own scraping efforts. If you use this data, we encourage you to cite both this work, and also pushshift.io. If you would like more detail, you can read about this dataset here: http://www.cs.cornell.edu/~jhessel/reddit/gaps.html If you would like access to the full post sets (or the post sets for any of the communities we have reconstructed comment trees for) please feel free to get in touch. ## More Details If you have any questions/comments/thoughts, feel free to contact us!