Department of Computer Science Colloquium
Tuesday March 2nd, 2002 at 4:15pm 
Upson Hall B17

Statistical Methods for Natural Language Processing

Mike Collins
AT&T Research Labs

http://www.research.att.com/~mcollins

Sequential data is everywhere: obvious examples being text (the web, or digital libraries), speech, and biological sequences. Algorithms which recover structure underlying this data are becoming increasingly important. This leads to an interesting class of learning problems: how to learn functions which map strings to other discrete structures such as trees, segmentations, or underlying state sequences? 

In the first part of the talk I will review recent work on probabilistic, history-based approaches for natural language parsing. Many of these methods fall into the general category of history-based models, where a tree is represented as a derivation (sequence of decisions) and the probability of the tree is then calculated as a product of decision probabilities. While these approaches have many advantages, it can be awkward to encode some constraints within this framework. It is often easy to think of features which might be useful in discriminating between candidate trees for a sentence, but much more difficult to alter the derivation to take these features into account. 

As an alternative to these methods, I will present new algorithms for these problems, derived from Freund and Schapire's voted perception algorithm for classification tasks. A first motivation for the new algorithms concerns *representation*: in comparison to hidden markov models, or probabilistic context-free grammars, the methods are considerably more flexible in the features that can be used to discriminate between competing structures. A second theme is*discriminative training*: the parameter estimation methods make relatively weak assumptions about the distribution underlying examples. During the talk I will present experiments with the methods, showing their utility on a number of natural language problems.