Mining Associations from Web Text

Search engine providers such as Google regularly receive queries that contain subjective predicates. Users query for instance for "big cities" or "cute animals". In order to answer those queries from structured data, search engine providers need to understand what subjective properties the average user associates to which entities. This was the motivation for the "Subjective Property Mining Project", a collaboration with researchers from Google Mountain View. During this project, we developed the Surveyor system that mines the entire Web to find billions of subjective associations.

We use natural language analysis to identify statements in Web text that express an opinion about whether or not a specific property applies to a specific entity. As we consider subjective properties, user opinions diverge and we need to resolve conflicting opinions in order to correctly infer the majority opinion. This is surprisingly difficult and simple resolution strategies (e.g., use the majority vote among conflicting opinions) lead to poor results (i.e., poor match with opinions collected from test users). The reason for that are various types of skew that influence the probability that users express a certain opinion on the Web. For instance, users who think that a certain city is big are more likely to write about it on the Web. Hence, they are always overrepresented in a sample of opinions collected from the Web. We overcome those challenges by learning property and entity type specific user behavior models by unsupervised machine learning using an expectation-maximization approach. We show by a large user study that we can use those models to quite reliably infer the opinions of the average user.


Immanuel Trummer, Alon Halevy, Hongrae Lee, Sunita Sarawagi, Rahul Gupta. "Mining Subjective Properties on the Web". SIGMOD 2015.
Paper  Poster  Video