Sentiment Analysis Datasets with Latent Explanation Initializations

Version 1.0

Contact: Ainur Yessenalina
ainur at cs dot cornell dot edu


Sentiment Analysis Datasets with Latent Explanation Initializations contain two standard datasets for sentiment analysis in preprocessed format for running SVMsle package: Movie Reviews and U.S. Congressional floor debates datasets together with initializations of latent explanations.

Movie Reviews data has latent explanation initializations for using OpinionFinder and Annotator initializations. The experimental set-up for the Movie Reviews data is the same as described in ([2], [1]). OpinionFinder initializations were derived using contextual word-level polarity classifier OpinionFinder. Then if majority polarity vote of the polar words found by OpinionFinder was the same as document-level polarity the sentence was considered to add to the latent explanation set. Annotator initializatiions were the sentences that contain human "annotator rationale" from [2]. The process of forming the initial Latent Explanations is the following:
  • Select L sentences from the set of sentences selected by initialization method.
  • If there are less than L sentences, then add sentences starting from the end of the document.
  • If there are more than L sentences, then remove sentences starting from the beginning of the document.
L   = 0.3*|x| , where |x| is the length of the document (30% of the sentences in the document)

Floor Debates data contains only OpinionFinder initializations. Since the experiments with Latent Explanations in [1] were done for speaker-based speech-segment setting, we applied the procedure described above to each speech segment and then merged all latent explanations for all speeches by the same speaker.

Data & Source Code

These are ready to run archives for running SVMsle package.

You will have to download the source code of SVMsle from the following location:

Sentence Splits

These are sentence splits for both dataset. The sentence splits for Movie Reviews were obtained using component of OpinionFinder tool. The sentence splits for Floor Debates data were the used as in [3].

Note* All the experiments in [1] were done with L=0.3*|x|


[1] A. Yessenalina, Y. Yue and C. Cardie. Multi-level Structured Models for Document-level Sentiment Classification. In Proceedings of EMNLP, 2010 (to appear) ((pdf)
[2] O. Zaidan, J. Eisner and C. Piatko. Using "Annotator Rationales" to Improve Machine Learning for Text Categorization. In Proceedings of NAACL, 2007 (pdf)
[3] M. Thomas, B. Pang and L. Lee. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In Proceedings of EMNLP, 2006 (pdf)
[4] T. Wilson, J. Wiebe and P. Hoffmann. Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. In Proceedings of HLT/EMNLP, 2005. (pdf )
[5] B. Pang, L. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of ACL, 2004 (pdf)