Movie Review Data
This page is a distribution site for movie-review data for use in
sentiment-analysis experiments. Available are collections of
movie-review documents labeled with respect to their overall sentiment polarity
(positive or negative) or subjective rating (e.g., "two and a
half stars) and sentences labeled with respect to their
subjectivity status (subjective or objective) or polarity. These data sets were
introduced in the following papers:
We also have available an addtional sentiment-analysis dataset,
Congressional
floor-debate transcripts, with support/oppose labels.
If you have results to report on these corpora,
please send email to Bo Pang and/or Lillian Lee so we can add you to our list of other papers using this data. Thanks!
Please cite the version number of the
dataset you used in any publications, in order to facilitate
comparison of results. Thank you.
Sentiment polarity datasets
- polarity dataset v2.0 (
3.0Mb) (includes
README v2.0): 1000 positive and 1000 negative processed reviews.
Introduced in Pang/Lee ACL 2004. Released June 2004.
- Pool of
27886 unprocessed html files
(81.1Mb) from which the polarity dataset v2.0 was derived.
(This file is identical to movie.zip from data release v1.0.)
- sentence polarity dataset v1.0
(includes sentence polarity dataset README v1.0:
5331 positive and 5331 negative processed sentences / snippets.
Introduced in Pang/Lee ACL 2005. Released July 2005.
- archive:
-
polarity
dataset v1.0 (2.8Mb) (includes README): 700 positive and 700 negative processed reviews. Released
July 2002.
- polarity
dataset v1.1 (2.2Mb) (includes README.1.1): approximately 700 positive and 700 negative processed
reviews. Released November 2002. This alternative version was
created by Nathan
Treloar, who removed a few non-English/incomplete reviews and
changing some of the labels (judging some polarities to be different
from the original author's rating). The complete list of changes made to
v1.1 can be found in
diff.txt.
-
polarity
dataset v0.9 (2.8Mb) (includes a README):. 700 positive and 700 negative processed
reviews. Introduced in Pang/Lee/Vaithyanathan
EMNLP 2002. Released July 2002.
Please read the "Rating Information - WARNING" section
of the README.
-
movie.zip (81.1Mb): all html files we collected from the IMDb archive.
Sentiment scale datasets
Subjectivity datasets
The creation of this website is based upon work supported in part by
the National Science Foundation (NSF) under grant no. ITR/IM
IIS-0081334, IIS-0329064, CCR-0122581, and BES-0329549; SRI
International under subcontract no. 03-000211 on their project funded
by the Department of the Interior, National Business Center; a Cornell
Graduate Fellowship in Cognitive Studies; and by an Alfred P. Sloan
Research Fellowship. Any opinions, findings, and conclusions or
recommendations expressed above are those of the authors and do not
necessarily reflect the views of the National Science Foundation or
Sloan Foundation and should
not be interpreted as representing the official policies, either expressed
or implied, of any sponsoring institution, the U.S. government or any other
entity.
If you have any questions or comments regarding this site, please send
email to Bo Pang.
NLP at
Cornell