Datasets from transcripts of US Congressional floor debates

Congressional speech data

This page is a distribution site for a congressional-speech corpus and related extracted information. This data includes speeches as individual documents, together with:

automatically-derived labels for whether the speaker supported or opposed the legislation discussed in the debate the speech appears in, allowing for experiments with this kind of sentiment analysis
- We also maintain and distribute another corpus of data suitable for work in sentiment analysis, the Cornell movie-review data set.
indications of which "debate" each speech comes from, allowing for consideration of conversational structure
indications of by-name references between speakers, and the scores that our agreement/disagreement classifier(s) automatically assigned to such references, allowing for experiments on agreement classification if one assigns "true" labels from the support/oppose labels assigned to the pair of speakers in question
the edge weights and other information we derived to create the graphs we used for our experiments upon this data, facilitating implementation of alternative graph-based methods upon the graphs we constructed

If you have used this data, we would appreciate hearing about it (Lillian Lee is our designated contact person); a list of those papers we know about can be found below.

References

This data was introduced in Matt Thomas, Bo Pang, and Lillian Lee, Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. The original version of the paper appeared in the Proceedings of EMNLP, 2006, pp. 327–335. However, the paper has been updated since then; the link provided is to the most current version.

Data download

convote dataset v1.1 (9.8 Mb, tar.gz format), including README.v1.1.txt, January 2008. The only difference from v1.0 is that a typo in the first line of graph_edge_data/edges_individual_document.v1.0.csv has been corrected. (This affects just a single file and our calculations used the correct value.)

convote dataset v1.0 was released in December 2006. Please use the one-line-different newer version v.1.1.

Other papers using this data

Chronological order, then alphabetically within a given year.

Stephan Greene. Spin: Lexical Semantics, Transitivity, and the Identification of Implicit Sentiment. Ph.D. thesis, University of Maryland, 2007.
Bei Yu, Stefan Kaufmann, and Daniel Diermeier. Ideology classifiers for political speech. Available at SSRN: http://ssrn.com/abstract=1026925 (click on “download” link), working paper dated November 1, 2007.
Ben Allison. Sentiment Detection Using Lexically-Based Classifiers. Proceedings of TSD '08.
Mohit Bansal, Claire Cardie and Lillian Lee. The power of negative thinking: Exploiting label disagreement in the min-cut classification framework. Proceedings of COLING: Companion volume: Posters, pp. 13–16, 2008.
Clint Burfoot. Using multiple sources of agreement information for sentiment classification of political transcripts. Australasian Language Technology Workshop (ALTA) 2008.
Marina Sokolova and Guy Lapalme. Verbs Speak Loud: Verb Categories in Learning Polarity and Strength of Opinions. Proceedings of the 20th Canadian Conference on Artificial Intelligence (AI 2008), vol. 5032, series. Lecture Notes in Artificial Intelligence, p. 320--331, 2008.
Marina Sokolova and Guy Lapalme. Verbs as the most "affective" words. Affective language in human and machine, 2008.
Bei Yu, Stefan Kaufmann, and Daniel Diermeier. 2008. Classifying party affiliation from political speech. Journal of Information Technology & Politics 5 (1): 33-48.
Alexandra Balahur, Zornitsa Kozareva, Andrés Montoyo: Determining the Polarity and Source of Opinions Expressed in Political Debates. CICLing 2009: 468-480.
Eric Gilbert, Tony Bergstrom, and Karrie Karahalios. Blogs Are echo chambers: Blogs Are echo chambers. HICSS 2009.
Justin Martineau and Tim Finin. Delta TFIDF: An improved feature space for sentiment analysis. ICWSM 2009.
Justin Martineau, Tim Finin, Anupam Joshi, and Shamit Patel. Improving binary classification on text problems using differential word features. CIKM 2009.
Daniel Hopkins and Gary King. A method of automated nonparametric content analysis for social science. American Journal of Political Science 54(1):229–247 (2010).
Dong Nguyen, Elijah Mayfield, and Carolyn Penstein Rosé. An analysis of perspectives in interactive settings. Workshop on Social Media Analytics, KDD 2010.
Fernanda S. Pimenta, Darko Obradovi, Rafael Schirru, Stephan Baumann and Andreas Dengel. Automatic sentiment monitoring of specific topics in the blogosphere. ECML PKDD Workshop on Dynamic Networks and Knowledge Discovery, 2010.
Robert West, Doina Precup, and Joelle Pineau. Automatically suggesting topics for augmenting text documents. CIKM 2010.
Ainur Yessenalina, Yisong Yue, and Claire Cardie. Multi-level structured models for document-level sentiment classification. EMNLP 2010
Clinton Burfoot, Steven Bird, and Timothy Baldwin. Collective classification of Congressional floor-debate transcripts. ACL poster, 2011.
Christopher Potts. On the negativity of negation. SALT, 2011.
Jordan Bates, Jennifer Neville, and Jim Tyler. Using latent communication styles to predict individual characteristics. 3rd SOMA workshop, KDD, 2012
Mahesh Joshi, Mark Dredze, William W. Cohen and Carolyn P. Rosé. Multi-Domain Learning: When Do Domains Matter? EMNLP 2012
Veselin Stoyanov and Jason Eisner. Minimum-risk training of approximate CRF-based NLP systems. NAACL 2012.
Mohit Iyyer, Peter Enns, Jordan Boyd-Graber, and Philip Resnik. Political Ideology Detection Using Recursive Neural Networks. ACL 2014
Robert West, Hristo S. Paskov, Jure Leskovec, Christopher Potts. Exploiting Social Network Structure for Person-to-Person Sentiment Analysis. TACL(2), 2014.
Vesile Evrim, Aliyu Awwal. Effect of Personality Traits on Classification of Political Orientation. International Journal of Social, Behavioral, Educational, Economic and Management Engineering Vol:9, No:6, 2015

The creation of this website is based upon work supported in part by the National Science Foundation (NSF) under grant no. IIS-0329064, an Alfred P. Sloan Research Fellowship, and Google Anita Borg Memorial Scholarship funds. Any opinions, findings, and conclusions or recommendations expressed above are those of the authors and do not necessarily reflect the views of the National Science Foundation or Sloan Foundation and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity.

Back to Lillian Lee's home page.
Go to the CUCS NLP home page.