
=======

Introduction

This README v1.0 (June, 2005) for the v0.9 and v1.0 scale datasets comes
from the URL
http://www.cs.cornell.edu/people/pabo/movie-review-data .

=======

Citation Info 

This data was first used in Bo Pang and Lillian Lee,
``Seeing stars: Exploiting class relationships for sentiment categorization
with respect to rating scales.'', Proceedings of the ACL, 2005.
  
@InProceedings{Pang+Lee:05a,
  author =       {Bo Pang and Lillian Lee},
  title =        {Seeing stars: Exploiting class relationships for sentiment
                  categorization with respect to rating scales},
  booktitle =    {Proceedings of the ACL},
  year =         2005
}

=======

Data Format Summary 

There are two tar files, roughly corresponding to (1) the reviews
after pre-processing, including subjectivity extraction (i.e., the
data we used in our experiments) and (2) the reviews after very light
pre-processing (provided in case these prove convenient to others; to
date we have not experimented directly with them).


(1) scale_data.tar.gz (scale dataset v1.0): contains this readme and
  data files that were used in the experiments described in Pang/Lee
  ACL 2005.

  Specifically: 

  Each sub-directory $author contains data extracted from reviews written 
  by some single author; altogether, there are four author sub-directories.

  In each such sub-directory, each line in the file subj.$author
  corresponds to the subjective extract of one review.  The
  corresponding line in the file id.$author specifies the source html
  file for the review from which the extract was created; these source
  files can be found in polarity_html.zip, available from
  http://www.cs.cornell.edu/people/pabo/movie-review-data ("Pool of
  27886 unprocessed html files").

  We automatically tokenized and applied pattern matching technique to 
  remove explicit rating indications from the reviews.  Subjective 
  sentences were automatically identified using the system described in 
  our 2004 ACL paper
  (http://www.cs.cornell.edu/home/llee/papers/cutsent.home.html).  We
  did not apply any feature selection algorithms in our experiments;
  we simply used all unigrams as features, and used feature
  presence/absence to create feature vectors.
  
  The class label for each extract is given in the corresponding line
  of the file label.3class.$author (for the {0,1,2} three-category
  classification task) or label.4class.$author (for the {0,1,2,3}
  four-categority classification task).  

  For those who wish to experiment with more fine-grained labels, we
  also provide normalized ratings (in the range [0-1] with stepsize
  0.1 or smaller, depending on the smallest unit used by the author)
  in the file rating.$author.

  EXAMPLE: consider the information corresponding to the extract
  represented by the first line of Steve+Rhodes/subj.Steve+Rhodes:
 
  % paste Steve+Rhodes/label.3class.Steve+Rhodes \
  Steve+Rhodes/label.4class.Steve+Rhodes Steve+Rhodes/id.Steve+Rhodes \
  Steve+Rhodes/rating.Steve+Rhodes | head -1
  0       0       11790   0.1
  
  The class labels for both the three-class and four-class tasks are 0.
  The original review was written by Steve Rhodes and extracted from
  11790.html (see above for location of original reviews).
  The numerical rating converted from the four-star system used by the author 
  (1/2 star was the smallest unit he employed) is 0.1 (see section
  "Label Decision" below for more information on rating normalization).
  
  	 
(2) scale_whole_review.tar.gz (scale dataset v0.9): Contains this
  README and the review files in their entireties before passing
  through tokenization, sentence separation, and subjectivity
  extraction.
  
  Specifically:

  The entire review for each subjective extract in $author/subj.$author 
  (of scale dataset v1.0) can be identified by the id number specified
  in the correponding line of $author/id.$author and located as file
  $author/txt.parag/$id.txt
  where each line of $id.txt corresponds to one paragraph of the review.
   
=======

Label Decision 

The numerical ratings were derived from texts in the original html
files.  Note that with our particular conversion scheme, 0-to-4 stars
within a four star system translates into 0.1-to-0.9 in our normalized
numerical ratings, whereas 0-to-5 stars within a five star system
translates into 0-to-1.  (The reasoning was that in a four-star
system, an author is more likely to assign "endpoint" scores because
the dynamic range of the rating scheme is smaller.)
 
The class labels were then derived from the normalized numerical ratings.    
  * for the three-class task: 
  	0: rating <= 0.4
	1: 0.4 < rating < 0.7
	2: rating >= 0.7

  * for the four-class task: 
        0: rating <=.3
        1: .4 <= rating <=.5
	2: .6 <= rating <= .7
	3: .8 <= rating

