=======

Introduction

This README v1.0 (September, 2008) for the v1.0 search-set dataset comes from
the URL http://www.cs.cornell.edu/home/llee/data/search-subj.html.

=======

Citation Info 

This data was first used in Bo Pang and Lillian Lee,
``Using very simple statistics for review search: An exploration.'',
Proceedings of COLING: Companion Volume: Posters, pp. 73--76, 2008

@InProceedings{Pang+Lee:08a,
   author={Bo Pang and Lillian Lee},
   title={Using very simple statistics for review search: An exploration},
   booktitle={Proceedings of COLING: Companion Volume: Posters},
   year={2008},
   pages={73--76}
}

=======

Data Format Summary 

This dataset consists of hand-annotated documents drawn from webpages
retrieved by the Yahoo! search engine in response to real and
publicly-available user queries.  More specifically, the file
ss_data.tar.gz contains this README, a file with human annotations
(ss-annotation.txt), and copies of the content of the search results
used in our experiments.

- ss-annotation.txt: judgments on 1346 documents (69 search-sets).  Each
  line in this file
  [$q.$i, a_$k] $exp_label ($ann_label) $rel  $comment
  encodes the judgments of annotator a_$k over the $i-th search result for
  query $q.
  
  $ann_label:  a number between 0 and 3. 
  This is the original label assigned by the annotator; it is labeled 
  according to the following convention (see Appendix A for more details 
  on the four-way classification scheme):

  * 0. single review
  * 1. multiple reviews
  * 2. mixture of review and objective information
  * 3. objective document

  $exp_label: {subj|obj}
  For experimental purposes, we subsequently collapsed the four classes 
  into two categories subj (``subjective'') and obj (``objective''), where
  the ``subjective'' category covered the first three of the four original
  annotation labels.

  $rel: {on|off}-topic
  The annotator was also asked to judge whether a search result was
  ``on-topic'' or ``off-topic'' given the query $q (i.e., is it about the
  subject matter on which the user was looking for reviews?).  We did not 
  use this information in our experiments since the main focus of the paper 
  was not on improving topic-based relevancy; we include this
  information in this dataset release for researchers interested in
  analyzing topic-sentiment interactions.

  $comment:
  Optionally, annotators can enter a short note to provide further 
  explanation or comments.  For instance, an annotator may put 
  "sales pitch" in the comment field to indicate that was the basis for
  assigning an "objective" label to a given entry.
  
  As an example, the following line
  
  [age_reform_book_review.4, a_3] subj (2) on-topic
  
  encodes the labels by annotator a_3 regarding the 4th search result for
  query ``age reform book review''.  The annotator labeled it as a mixture
  of review and objective information (2), which was converted to a ``subj''
  label in our experiments.  In addition, the annotator considered this
  document to be an on-topic search result for the query.  No additional
  comments recorded for this entry.

  * Lines starting with "### ":
  As noted in more detail in Appendix A, a number of search-sets were
  annotated by two annotators (for agreement studies).  In our
  experiments, we randomly selected one annotator whose labels were
  used as the experimental gold standard.  The labels provided by the
  other annotator are also included in this release.  These lines are
  marked by "### " at the beginning to indicate they were not used in
  our experiments.

- ss-ann-instruction.html: the Instruction page shown to the annotators
  with links to example pages for different categories.

- search_results/$q/: files related to the content of the search set for 
  query $q (in most cases, the top 20 search results returned by Yahoo! Web 
  Search)  
  
  Within each search-set directory search_results/$q, files related to the
  $i-th search result for query $q can be found as $q.$i.*
  * $q.$i.info: contains the URL for that search result, as well as other 
    meta-information provided by the Yahoo! Search API
  * $q.$i.html: a copy of the webpage in its original html format cached by
    the search engine (to ensure that the webpages we retrieved were 
    identical to the pages used by the search engine as the basis for its 
    ranking); or a copy of the (then) current content of the given URL if a 
    cached version was not available.
  * $q.$i.txt: a copy of the content of the webpage in textual format
    (created by the lynx program).  Note that we conducted further
    cleaning of the text (e.g., removing the list of URLs at the end
    of the file) before running our algorithms over them. Unfortunately, we 
    no longer have access to the post-processed versions of the .txt files.
    The version provided here consists of files generated by the lynx program.

=======

Annotation Procedure (Appendix A)

- Annotation scheme
  
  We relied on a pool of more than a dozen annotators, most of whom were
  Cornell  students, and all of whom were tech-savvy.  The mode of
  interaction consisted of self-scheduled interactions with a web-based
  system that allowed for participation to be broken up over multiple
  sessions.

  The annotators were first presented with detailed instructions.
  Each then classified three or more search-sets on a query-by-query basis, 
  where the documents in the search-set for a given query were presented in 
  random order.  For almost every annotator, at least two of his or her 
  search-sets were labeled by another person as well, so that we could
  measure pair-wise agreement.

  The annotators labeled the search-set documents according to the following
  four-way classification:

  * single review: the main purpose of the page is to present a single
    coherent review on the subject defined by the query.
  
  * multiple reviews:  the main purpose of the page is to present multiple 
    reviews on the given subject (for instance, a typical page from 
    tripadvisor.com consisting mostly of an aggregation of snippets of 
    reviews).
  
  * mixture of review and objective information: the
    page contains a significant amount of objective information that
    is not a coherent part of a review, and the main purpose of the
    page is not just to provide reviews. An example would be a page 
    containing product specifications as well as reviews.
    
  * objective: no textual subjective opinions expressed on the page
    (graphical depictions of stars or links to reviews do not count),
    or ``sales pitches'' (biased reviews, such as might be written by
    the manufacturers of the product in question).  The motivation 
    behind calling such texts ``objective'' is that users are 
    probably only interested in unbiased reviews, so a re-ranking
    system should ideally place biased reviews low on the list of
    retrieved results.  Naturally, we expect that the task of
    automatically identifying biased reviews is an interesting task in
    and of itself.  Also, we believe that non-NLP methods can be used
    to handle graphical depictions of stars and links to reviews, so
    our focus was on the downstream task, testing whether methods
    could be employed to analyze the text of a document for
    subjectivity, rather than on trying to, say, detect stars.

  Our intent in designing the four-way classification scheme just described 
  was to capture the following intuition: among pages containing subjective
  content, some  (e.g., those whose main purpose was to provide subjective
  information) may be more desirable than others (e.g., those with minimal 
  subjective information) in our setting. However, although we provided 
  detailed instructions and accompanying example pages to the annotators, 
  the distinction seemed to be too subtle in some cases,  and
  agreement studies on the four-way labeling were not satisfactory.
  For experimental purposes, therefore,  we collapsed the four classes into
  the two categories ``subjective'' and ``objective'', where the ``subjective''
  category covered the first three of the four original annotation labels: 
  single review, multiple reviews, and mixture of review and objective
  information.

  We next discarded search-sets in which all the documents shared the same
  label, as these are useless for comparing algorithms that engage in
  subjectivity re-ranking.  This left 1346 hand-labeled documents
  distributed across 69 search-sets that are released with this dataset.
  Sixty-four of these search-sets contain 20 valid documents; the other 
  five contain from 7 to 19 documents, due mostly to the query being too
  narrow for the initial search engine to bring up 20 documents.
  
- Agreement study
 
  The queries we worked with exhibited a range of apparent topics and
  specificity.  Examples include:
  * ``civic 2000 review''
  * ``acedemy [sic] awards reviews''
  * ``reviver ade reviews'' [Note that ``ade'' appears to be a typo
    for ``AED'', ``automated external defibrillator''.  One of our
    annotators explicitly mentioned being unable to decrypt this query,
    which illustrates how it can sometimes be difficult for another
    party to determine the
    intent of a particular user's query.]
  * ``telephones read reviews compare prices yahoo''

  We suspect that this variety affects the ease of the annotation
  decisions for particular search-sets; and it also makes it difficult to
  consider any single query to be representative of the queries as a whole.  
  Therefore, rather than have the annotators all label a single search-set 
  in common and compute multi-party agreement on this single set of documents
  (thus putting all our eggs in one uncertain basket), we arranged matters 
  so as to have a dozen search-sets that were each examined by a different 
  pair of annotators, as mentioned above.
  
  On average, the annotators agreed on 88.2% of the documents per
  search-set, with the average Kappa coefficient being an acceptable 0.73,
  reflecting in part the difficulty of the judgment.  The lowest Kappa 
  coefficient occurs on a search-set with a  75% agreement rate.

  As is to be expected in situations where humans are asked to provide
  labels that are not necessarily obvious to determine, some disagreements 
  center on items that are difficult to classify, whereas other 
  disagreements may be due to clerical errors.  But one source of 
  disagreement that stems from the specifics of our design
  is that we instructed annotators to mark ``sales pitch'' documents as 
  non-reviews, on the premise that although such texts are subjective, 
  they are not so in a way that is valuable to a user searching for 
  unbiased reviews. (Note that this policy presumably makes the dataset 
  more challenging for automated algorithms.)  There are several cases 
  where only one annotator identified this type of bias, which is not
  surprising since the authors of sales pitches may actively try to fool
  readers into believing the text to be unbiased.


=======

Acknowledgments

We are very grateful to our annotators:
Mohit Bansal, Eric Breck, Yejin Choi, Matt Connelly, Tom Finley, 
Effi Georgala, Asif-ul Haque, Kersing Huang, Evie Kleinberg, Art Munson, 
Ben Pu, Ari Rabkin, Benyah Shaparenko, Ves Stoyanov, and Yisong Yue.