Research Area: Natural Language Processing
Advisor: Claire Cardie
Interests: Syntax and semantics of natural and artificial languages; Corpus-based natural language processing; Information extraction; Question answering.

Our research uses machine learning techniques to build components for understanding natural language. This methodology typically requires a corpus of examples describing the task one wishes to accomplish. For example, the corpus might consist of sentences with their corresponding parse trees. The component -- in this case a parser -- is built by learning from the example parses.

At this time our chief goal is to lower the cost of corpus-based NLP by reducing the amount of training data required. This should allow language learning to be more widely applied, especially by non-experts in computational linguistics.

Partial Parsing Framework

Partial parsing is a simplified version of the parsing task in which the goal is to identify major constituents and relationships, such as NPs and predicate-argument structure, while disregarding difficult ambiguities, such as prepositional phrase attachment. We designed a simple corpus-based framework for partial parsing that uses sequences of part-of-speech (syntactic category) tags as rules. We instantiated this framework for NPs and subject-verb-object relationships. The framework provides a testbed for experimenting with some techniques for increasing the performance of a parser.
Grammar Pruning
We tracked the mistakes made by the grammar when we used it to reparse the training. Rules that made more mistakes were discarded to improve the grammar.

Lexical Information
Lexical information was added as an additional source of information by piggy-backing a ``probabilistic'' model on top of the grammar to score constituents and choose between alternate combinations of constituents.

Error-driven pruning of treebank grammars for base noun phrase identification. Claire Cardie and David Pierce. In Proceedings of the 36th Annual Meeting of the ACL and COLING-98, pages 218-224, 1998. Available as cmp-lg/9808015.
[abstract, ps, pdf]

The role of lexicalization and pruning for base noun phrase grammars. Claire Cardie and David Pierce. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), pages 423-430, 1999.
[abstract, ps, pdf]

Combining error-driven pruning and classification for partial parsing. Claire Cardie, Scott Mardis, and David Pierce. In Proceedings of the Sixteenth Internation Conference on Machine Learning, pages 87-96, 1999.
[abstract, ps, pdf]

Information Extraction

Information extraction is the identification of domain-specific structured information in natural language text. We are in the process of implementing an IE system. We hope to explore an interesting new paradigm of IE in which an IE component learner interacts tightly with a human user so that the learner can help the human quickly identify appropriate training events.

Proposal for an interactive environment for information extraction. Claire Cardie and David Pierce. Technical Report TR98-1702, Cornell University Computer Science, September 1998. Available as ncstrl.cornell/TR98-1702.
[abstract, ps, pdf]

Question Answering

Question answering is a more fine-grained form of information retrieval (IR). Standard IR retrieves documents based on natural language queries. But users typically want shorter responses. Our question answering system uses the Smart retrieval engine and then attempts to locate chunks of text in the top-ranked documents that specifically answer the query. Currently we can retrieve NPs to answer who, what, when, where, and which questions. Features of the system include some simple semantic type checking between the question and answer and use of text summarization to narrow the search.

Examining the role of statistical and linguistic knowledge sources in a general-knowledge question-answering system. Claire Cardie, Vincent Ng, David Pierce, and Chris Buckley. In Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP-2000), pages 180-187, 2000.
[abstract, ps, pdf]

Unsupervised and Weakly-Supervised Language Learning

We are currently applying some new machine learning algorithms within our partial parsing framework to experiment with reducing the training data required by bracketer learning. Cotraining is one such algorithm that leverages the inherent redundancy in language. For example, NPs can often be detected both by their context (e.g. John ate ____) or by their content (e.g. the ____). If learners are built to use separate features of the problem, they can bootstrap each other starting with very little data.
David Pierce