Congressional speech data
This page is a distribution site for a congressional-speech corpus and
related extracted information. This data includes
speeches as individual documents, together with:
- automatically-derived labels for whether the speaker supported or opposed the
legislation discussed in the debate the speech appears in, allowing for experiments
with this kind of sentiment analysis
- indications of which "debate" each speech comes from,
allowing for consideration of conversational structure
- indications of by-name references between speakers, and the
scores that our agreement/disagreement classifier(s) automatically
assigned to such references, allowing for experiments on agreement
classification if one assigns "true" labels from
the support/oppose labels assigned to the pair of speakers in question
- the edge weights and other information we derived to
create the graphs we used for our experiments upon this data, facilitating
implementation of alternative graph-based methods upon the graphs we constructed
If you have used this data, we would appreciate hearing about it (Lillian Lee is our
designated contact person); a list of those papers we know about can
be found below.
References
This data was introduced in Matt Thomas, Bo Pang, and Lillian Lee, Get out the vote: Determining support or opposition from
Congressional floor-debate transcripts. The original version of
the paper appeared in the Proceedings of EMNLP, 2006,
pp. 327–335. However, the paper has been updated since then; the
link provided is to the most current version.
Data download
convote dataset v1.1 (9.8 Mb, tar.gz format), including
README.v1.1.txt,
January 2008. The only difference from v1.0 is that a typo in the first line of
graph_edge_data/edges_individual_document.v1.0.csv has been
corrected. (This affects just a single file and our calculations used
the correct value.)
convote dataset v1.0 was released in December 2006. Please use the
one-line-different newer version v.1.1.
Other papers using this data
Chronological order, then
alphabetically within a given year.
- Stephan Greene. Spin:
Lexical Semantics, Transitivity, and the Identification of Implicit
Sentiment. Ph.D. thesis, University of Maryland, 2007.
- Bei Yu, Stefan Kaufmann, and Daniel Diermeier. Ideology classifiers for political speech.
Available at SSRN: http://ssrn.com/abstract=1026925 (click on
“download” link), working paper
dated November 1, 2007.
- Ben Allison. Sentiment
Detection Using Lexically-Based Classifiers. Proceedings of TSD '08.
- Mohit Bansal, Claire Cardie and Lillian Lee. The
power of negative thinking: Exploiting label disagreement in the
min-cut classification framework. Proceedings of COLING: Companion volume: Posters, pp. 13–16, 2008.
- Clint Burfoot. Using multiple sources of agreement information
for sentiment classification of political transcripts. Australasian
Language Technology Workshop (ALTA) 2008.
- Marina Sokolova and Guy Lapalme. Verbs
Speak Loud: Verb Categories in Learning Polarity and Strength of
Opinions. Proceedings of the 20th Canadian Conference on
Artificial Intelligence (AI 2008),, vol. 5032, series. Lecture Notes
in Artificial Intelligence, p. 320--331, 2008.
- Alexandra Balahur, Zornitsa Kozareva, Andrés Montoyo: Determining the Polarity and Source of Opinions Expressed in Political Debates. CICLing 2009: 468-480.
The creation of this website is based upon work supported in part by
the National Science Foundation (NSF) under grant no. IIS-0329064, an
Alfred P. Sloan Research Fellowship, and Google Anita Borg Memorial
Scholarship funds. Any opinions, findings, and conclusions or
recommendations expressed above are those of the authors and do not
necessarily reflect the views of the National Science Foundation or
Sloan Foundation and should not be interpreted as representing the
official policies, either expressed or implied, of any sponsoring
institution, the U.S. government or any other entity.
Back
to Lillian Lee's home page.
Go to the
CUCS NLP home page.