Some older datasets:
- Reddit data
- 1977-2008 FOMC transcripts: multiple many-hour meetings where very consequential decisions (what the US Federal Interest Rate will be) are made between participants who know each other very well. (First release 2016; 231M when unzipped)
Cornell natural-experiment tweet pairs: data for investigating whether whether phrasing affects message propagation, controlling for user and topic. Note that, in compliance with Twitter policy, we cannot distribute the tweets themselves, but rather tweet IDs. zip file can be retrieved from the given URL (first release 2014)
- Sentential revisions in academic writing, with a focus on changes in
strength of assertion.
- Cornell movie-quotes
corpus: paired memorable and non-memorable movie quotes, controlling for speaker, scene, and length (first release 2012)
- Supreme Court dialogs corpus: conversations and metadata (such as vote outcomes) from oral arguments before the US Supreme Court (first release 2012)
- Wikipedia editor conversations corpus (first release 2012)
for studying hedging and framing in GMO debates and in professional-
vs. pop-science discourse (first release 2012)
- Cornell movie-dialogs
corpus: conversations and metadata (IMDB rating, genre, character gender, etc.) from movie scripts (first release 2011)
- Files associated with extracting lexical-level simplifications from
Simple Wikipedia (first release 2010)
- Data related to sentiment analysis, broadly construed
And here are some results from experiments.
The work described in the publications above was supported in part by
the National Science Foundation under several grants (to see which
grants supported a particular dataset, please consult the
acknowledgments of the associated publication). Any opinions, findings, and conclusions or recommendations expressed above are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Lee's home page.
Cornell NLP homepage