Speaker: Thorsten Joachims
Affiliation: GMD
Date: 4/17/01
Time and Location: 4:15 PM, B11 Kimball Hall
Title: The Maximum-Margin Approach to Learning Text Classifiers Methods, Theory, and Algorithms

Text classification, or the task of automatically assigning semantic categories to natural language text, has become one of the key methods for organizing online information. Since hand-coding such classification rules is costly or even impractical, seemingly every machine learning algorithm ever invented has been applied to the problem of learning text classifiers from examples - many with success on some tasks. This has led to a wealth of empirical results, but no theory that explains the results. Nor was any conventional method found to be both efficient and effective without additional, difficult to control heuristics. Based on ideas from Support Vector Machines (SVMs), my dissertation presents the first approach to learning text classifiers that is highly effective without heuristic components, that is computationally efficient, and that comes with a learning theory operational enough to guide applications.

In this talk I will give an overview of the approach, summarizing results on support vector methods for induction and transduction, efficient performance estimation, model and parameter selection, and large-scale training algorithms for SVMs. In more detail I will present the statistical learning model of text classification with SVMs. It formalizes how SVMs suit the statistical properties of text, leading to constructive and intuitive bounds on the expected generalization performance.