======= Introduction This README v1.0 (September, 2008) for the v1.0 search-set dataset comes from the URL http://www.cs.cornell.edu/home/llee/data/search-subj.html. ======= Citation Info This data was first used in Bo Pang and Lillian Lee, ``Using very simple statistics for review search: An exploration.'', Proceedings of COLING: Companion Volume: Posters, pp. 73--76, 2008 @InProceedings{Pang+Lee:08a, author={Bo Pang and Lillian Lee}, title={Using very simple statistics for review search: An exploration}, booktitle={Proceedings of COLING: Companion Volume: Posters}, year={2008}, pages={73--76} } ======= Data Format Summary This dataset consists of hand-annotated documents drawn from webpages retrieved by the Yahoo! search engine in response to real and publicly-available user queries. More specifically, the file ss_data.tar.gz contains this README, a file with human annotations (ss-annotation.txt), and copies of the content of the search results used in our experiments. - ss-annotation.txt: judgments on 1346 documents (69 search-sets). Each line in this file [$q.$i, a_$k] $exp_label ($ann_label) $rel $comment encodes the judgments of annotator a_$k over the $i-th search result for query $q. $ann_label: a number between 0 and 3. This is the original label assigned by the annotator; it is labeled according to the following convention (see Appendix A for more details on the four-way classification scheme): * 0. single review * 1. multiple reviews * 2. mixture of review and objective information * 3. objective document $exp_label: {subj|obj} For experimental purposes, we subsequently collapsed the four classes into two categories subj (``subjective'') and obj (``objective''), where the ``subjective'' category covered the first three of the four original annotation labels. $rel: {on|off}-topic The annotator was also asked to judge whether a search result was ``on-topic'' or ``off-topic'' given the query $q (i.e., is it about the subject matter on which the user was looking for reviews?). We did not use this information in our experiments since the main focus of the paper was not on improving topic-based relevancy; we include this information in this dataset release for researchers interested in analyzing topic-sentiment interactions. $comment: Optionally, annotators can enter a short note to provide further explanation or comments. For instance, an annotator may put "sales pitch" in the comment field to indicate that was the basis for assigning an "objective" label to a given entry. As an example, the following line [age_reform_book_review.4, a_3] subj (2) on-topic encodes the labels by annotator a_3 regarding the 4th search result for query ``age reform book review''. The annotator labeled it as a mixture of review and objective information (2), which was converted to a ``subj'' label in our experiments. In addition, the annotator considered this document to be an on-topic search result for the query. No additional comments recorded for this entry. * Lines starting with "### ": As noted in more detail in Appendix A, a number of search-sets were annotated by two annotators (for agreement studies). In our experiments, we randomly selected one annotator whose labels were used as the experimental gold standard. The labels provided by the other annotator are also included in this release. These lines are marked by "### " at the beginning to indicate they were not used in our experiments. - ss-ann-instruction.html: the Instruction page shown to the annotators with links to example pages for different categories. - search_results/$q/: files related to the content of the search set for query $q (in most cases, the top 20 search results returned by Yahoo! Web Search) Within each search-set directory search_results/$q, files related to the $i-th search result for query $q can be found as $q.$i.* * $q.$i.info: contains the URL for that search result, as well as other meta-information provided by the Yahoo! Search API * $q.$i.html: a copy of the webpage in its original html format cached by the search engine (to ensure that the webpages we retrieved were identical to the pages used by the search engine as the basis for its ranking); or a copy of the (then) current content of the given URL if a cached version was not available. * $q.$i.txt: a copy of the content of the webpage in textual format (created by the lynx program). Note that we conducted further cleaning of the text (e.g., removing the list of URLs at the end of the file) before running our algorithms over them. Unfortunately, we no longer have access to the post-processed versions of the .txt files. The version provided here consists of files generated by the lynx program. ======= Annotation Procedure (Appendix A) - Annotation scheme We relied on a pool of more than a dozen annotators, most of whom were Cornell students, and all of whom were tech-savvy. The mode of interaction consisted of self-scheduled interactions with a web-based system that allowed for participation to be broken up over multiple sessions. The annotators were first presented with detailed instructions. Each then classified three or more search-sets on a query-by-query basis, where the documents in the search-set for a given query were presented in random order. For almost every annotator, at least two of his or her search-sets were labeled by another person as well, so that we could measure pair-wise agreement. The annotators labeled the search-set documents according to the following four-way classification: * single review: the main purpose of the page is to present a single coherent review on the subject defined by the query. * multiple reviews: the main purpose of the page is to present multiple reviews on the given subject (for instance, a typical page from tripadvisor.com consisting mostly of an aggregation of snippets of reviews). * mixture of review and objective information: the page contains a significant amount of objective information that is not a coherent part of a review, and the main purpose of the page is not just to provide reviews. An example would be a page containing product specifications as well as reviews. * objective: no textual subjective opinions expressed on the page (graphical depictions of stars or links to reviews do not count), or ``sales pitches'' (biased reviews, such as might be written by the manufacturers of the product in question). The motivation behind calling such texts ``objective'' is that users are probably only interested in unbiased reviews, so a re-ranking system should ideally place biased reviews low on the list of retrieved results. Naturally, we expect that the task of automatically identifying biased reviews is an interesting task in and of itself. Also, we believe that non-NLP methods can be used to handle graphical depictions of stars and links to reviews, so our focus was on the downstream task, testing whether methods could be employed to analyze the text of a document for subjectivity, rather than on trying to, say, detect stars. Our intent in designing the four-way classification scheme just described was to capture the following intuition: among pages containing subjective content, some (e.g., those whose main purpose was to provide subjective information) may be more desirable than others (e.g., those with minimal subjective information) in our setting. However, although we provided detailed instructions and accompanying example pages to the annotators, the distinction seemed to be too subtle in some cases, and agreement studies on the four-way labeling were not satisfactory. For experimental purposes, therefore, we collapsed the four classes into the two categories ``subjective'' and ``objective'', where the ``subjective'' category covered the first three of the four original annotation labels: single review, multiple reviews, and mixture of review and objective information. We next discarded search-sets in which all the documents shared the same label, as these are useless for comparing algorithms that engage in subjectivity re-ranking. This left 1346 hand-labeled documents distributed across 69 search-sets that are released with this dataset. Sixty-four of these search-sets contain 20 valid documents; the other five contain from 7 to 19 documents, due mostly to the query being too narrow for the initial search engine to bring up 20 documents. - Agreement study The queries we worked with exhibited a range of apparent topics and specificity. Examples include: * ``civic 2000 review'' * ``acedemy [sic] awards reviews'' * ``reviver ade reviews'' [Note that ``ade'' appears to be a typo for ``AED'', ``automated external defibrillator''. One of our annotators explicitly mentioned being unable to decrypt this query, which illustrates how it can sometimes be difficult for another party to determine the intent of a particular user's query.] * ``telephones read reviews compare prices yahoo'' We suspect that this variety affects the ease of the annotation decisions for particular search-sets; and it also makes it difficult to consider any single query to be representative of the queries as a whole. Therefore, rather than have the annotators all label a single search-set in common and compute multi-party agreement on this single set of documents (thus putting all our eggs in one uncertain basket), we arranged matters so as to have a dozen search-sets that were each examined by a different pair of annotators, as mentioned above. On average, the annotators agreed on 88.2% of the documents per search-set, with the average Kappa coefficient being an acceptable 0.73, reflecting in part the difficulty of the judgment. The lowest Kappa coefficient occurs on a search-set with a 75% agreement rate. As is to be expected in situations where humans are asked to provide labels that are not necessarily obvious to determine, some disagreements center on items that are difficult to classify, whereas other disagreements may be due to clerical errors. But one source of disagreement that stems from the specifics of our design is that we instructed annotators to mark ``sales pitch'' documents as non-reviews, on the premise that although such texts are subjective, they are not so in a way that is valuable to a user searching for unbiased reviews. (Note that this policy presumably makes the dataset more challenging for automated algorithms.) There are several cases where only one annotator identified this type of bias, which is not surprising since the authors of sales pitches may actively try to fool readers into believing the text to be unbiased. ======= Acknowledgments We are very grateful to our annotators: Mohit Bansal, Eric Breck, Yejin Choi, Matt Connelly, Tom Finley, Effi Georgala, Asif-ul Haque, Kersing Huang, Evie Kleinberg, Art Munson, Ben Pu, Ari Rabkin, Benyah Shaparenko, Ves Stoyanov, and Yisong Yue.