Datasets

Reddit data
- Argument trees, "successful persuasion" metadata, and related data from the subreddit ChangeMyView (first release 2016; 321MB)
- Multimodal posts for popularity prediction. (first release 2017; 3.3GB)
- 1B Comments and posts for highly-related ("affix pair") subreddits. The starting point was data collected and released by Jason Baumgartner with additional processing done for the dataset below. (first release 2016; 13 GB)
- Multi-community engagement (users posting, or not posting, in different subreddits since Reddit's inception). Data includes the texts of 76.7M posts made and associated metadata, such as the subreddit, the "number" of upvotes, and the time stamp. The starting point was data collected and released by Jason Baumgartner. (first release 2015; 24GB)
Multimodal datasets for quantifying visual concreteness (first release 2018; Wikipedia dataset: 4.9GB; British Library dataset: 38GB)
1977-2008 FOMC transcripts: multiple many-hour meetings where very consequential decisions (what the US Federal Interest Rate will be) are made between participants who know each other very well. (First release 2016; 231M when unzipped)
Cornell natural-experiment tweet pairs: data for investigating whether whether phrasing affects message propagation, controlling for user and topic. Note that, in compliance with Twitter policy, we cannot distribute the tweets themselves, but rather tweet IDs. zip file can be retrieved from the given URL (first release 2014)
Sentential revisions in academic writing, with a focus on changes in strength of assertion.
Cornell movie-quotes corpus: paired memorable and non-memorable movie quotes, controlling for speaker, scene, and length (first release 2012)
Supreme Court dialogs corpus: conversations and metadata (such as vote outcomes) from oral arguments before the US Supreme Court (first release 2012)
Wikipedia editor conversations corpus (first release 2012)
GMOHedging: data for studying hedging and framing in GMO debates and in professional- vs. pop-science discourse (first release 2012)
Cornell movie-dialogs corpus: conversations and metadata (IMDB rating, genre, character gender, etc.) from movie scripts (first release 2011)
Files associated with extracting lexical-level simplifications from Simple Wikipedia (first release 2010)
Data related to sentiment analysis, broadly construed
- Cornell movie-review corpus: Sentiment-classified movie reviews (positive/negative or number of stars), subjective/objective sentences, etc. (released in 2002/2004/2005)
- Convote: Congressional floor-debate transcripts, with support/oppose labels (first release 2006)
- Search-set results for review-oriented queries, with subjective/objective labels (first release 2008)

Some older datasets:

AP88 data for some similarity-based pseudoword disambiguation experiments
Multi-parallel proof/verbalization data for a project on verbalizing NuPrl mathematical proofs using multiple-sequence alignment

And here are some results from experiments.

Downward entailing operators for English automatically discovered. Also, automatically discovered Romanian downward-entailing operators
Extracted paraphrases together with human evaluation judgments, from a project using multiple-sequence alignment to learn paraphrases from comparable corpora.

The work described in the publications above was supported in part by the National Science Foundation under several grants (to see which grants supported a particular dataset, please consult the acknowledgments of the associated publication). Any opinions, findings, and conclusions or recommendations expressed above are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Lillian Lee's home page.
Cornell NLP homepage