This talk addresses issues in document
classification, which we construe broadly to mean the grouping together of
texts that have similar content. While this task is presumably easier than
explicitly determining document content, it has great utility in practice
and is still plenty hard.
One problem currently attracting a great deal of attention is that of
classifying documents by their overall *sentiment*: for example, one might
want to determine whether a movie review is "thumbs up" or "thumbs down".
Sentiment analysis has empirically been shown to be resistant to traditional
text-categorization approaches, and involves more subtlety than one might at
first imagine. We demonstrate that new learning techniques based on finding
minimum cuts in graphs yield state-of-the-art results even when no explicit
linguistic information is used.
We also discuss the long-standing problem of representing topical content.
In particular, we present an analysis of the widely-used SVD-based Latent
Semantic Indexing algorithm; this analysis motivates an intuitive
generalization providing striking empirical improvements over LSI.
Portions of this talk describe joint work with Rie Kubota Ando, Carmel
Domshlak, Oren Kurland, and Bo Pang.
|