The research in the Natural Language Processing (NLP) group at Cornell focuses on developing corpus-based techniques for understanding and extracting information from natural language texts. In particular, my group investigates the use of machine learning
techniques as tools for guiding natural language system development and for exploring the mechanisms that underlie language acquisition. Our work encompasses three related areas: (1) machine learning of natural language, (2) the use of corpus-based natural language processing (NLP) techniques to aid information retrieval (IR) systems, and (3) the design of user-trainable NLP systems that can efficiently and reliably extract the important information from a document.
In the past year or so we have made progress on both the natural language processing and machine learning aspects of our research. First, we have developed a new approach to partial parsing of natural language texts that relies on machine learning methods. The approach combines corpus-based grammar induction with a very simple pattern-matching algorithm and an optional constituent verification step. In evaluations on a number of large-scale partial parsing tasks involving on-line text, the approach produces partial parsers that are both fast and accurate. In addition, we have developed a new approach to the problem of noun phrase coreference resolution that combines supervised and unsupervised learning techniques. The method differs from existing techniques in that it views coreference resolution as a clustering task.
The techniques described above can, in turn, be used to support a number of practical, end-to-end text-processing tasks. In particular, we are using the corpus-based partial-parsing techniques as the primary linguistic component in a new system for general-knowledge question answering. The system combines techniques for standard information retrieval, query-dependent text summarization, and shallow syntactic and semantic sentence analysis. In a series of experiments we examined the role of each statistical and linguistic knowledge source in the question-answering system and find that even very weak linguistic knowledge can offer substantial improvements over purely IR-based techniques for question answering, especially when smoothly integrated with statistical preferences computed by the IR subsystems.
NSF Faculty Early CAREER Development Award (1996-2000).
Selection Committee: Cognitive Studies Summer Fellowships; Cognitive Studies Continuing Fellowships; Cognitive Studies Incoming Fellowship.
Member: Independent Major Advisory Board; Cognitive Studies Steering Committee; Engineering College Lecture Series Committee; Engineering College Fall Faculty Reception Committee.
Secretary: North American Association for Computational Linguistics, 2000–2001; and SIGNLL, Association for Computational Linguistics Special Interest Group on Natural Language Learning, 1999–2001.
Editorial Board: Machine Learning, 1999–2001.
Program committees: 17th International Conference on Machine Learning (area chair); 1st Meeting of the North American Chapter of the Association for Computational Linguistics.
Executive Board: SIGDAT, Special Interest Group of ACL for Linguistic Data and Corpus-based approaches to NLP.
NSF Review Panel: Knowledge and Cognitive Systems, 2000.
Panelist: 2000 North American Association for Computational Linguistics Student Research Workshop, a mentoring workshop for Ph.D. students in Natural Language Processing.
New Directions in Machine Learning for Information Extraction, Invited talk. Workshop on Machine Learning for Information Extraction, 16th National Conference on Artificial Intelligence, July 1999.
Noun Phrase Coreference for Information Extraction. Purdue Univ., West Lafayette, IN, October 1999. —. Univ. of Montreal, Montreal, Quebec, March 2000.
Noun Phrase Coreference for Information Extraction and Text Summarization, Washington Univ., St. Louis, MO, November 1999.
A Clustering Approach to Noun Phrase Coreference, Williams College, Williamstown, MA, December 1999.
Rapidly Portable Translingual Information Extraction and Interactive Multidocument Summarization, Darpa TIDES Kickoff Meeting, Santa Monica, CA, March 2000.
“The Role of Lexicalization and Pruning for Base Noun Phrase Grammars.” Proceedings of the 16th National Conference on Artificial Intelligence, AAAI Press/MIT Press (1999), 423–430 (with D. Pierce).
“The Smart/Empire TIPSTER IR System.” TIPSTER Phase III Proceedings, Morgan Kaufmann (1999), 107–121 (with C. Buckley, S. Mardis, M. Mitra, D. Pierce, K. Wagstaff, and J. Walz).
“Examining the Role of Statistical and Linguistic Knowledge Sources in a General-Knowledge Question-Answering System.” Proceedings of the 6th Applied Natural Language Processing Conference (ANLP-2000), Association for Computational Linguistics/Morgan Kaufmann (2000), 180–187 (with V. Ng, D. Pierce, and C. Buckley).
“Using Clustering and SuperConcepts within SMART: TREC 6.” Information Processing and Management, 36(1) (2000), 109–131 (with C. Buckley, M. Mitra, and J. Walz).
“Towards Translingual Information Access Using Portable Information Extraction.” Proceedings of the ANLP/NAACL Workshop on Embedded Machine Translation Systems (2000), 31–37 (with M. White, C. Han, N. Kim, B. Lavoie, M. Palmer, O. Rambow, and J. Yoon).