PhD Univ. of Mass., Amherst, 1994
My research focuses on developing corpus-based techniques for
understanding and extracting information from natural language texts. We are investigating
the use of machine learning techniques as tools for guiding natural language system
development and for exploring the mechanisms that underly language acquisition. Our work
encompasses three related areas: (1) machine learning of natural language, (2) the use of
corpus-based NLP techniques to aid information retrieval systems, and (3) the design of
user-trainable systems that can efficiently and reliably extract the important information
from a document.
In the Kenmore Project, we are developing techniques
to automate the knowledge acquisition tasks that comprise the building of any NLP system.
We want to determine the conditions under which machine learning techniques can be
expected to offer a cost-effective approach to knowledge acquisition. Very generally,
Kenmore acquires linguistic knowledge using a combination of symbolic machine learning
techniques and robust sentence analysis. It has been used with corpora from two real-world
domains to perform part-of-speech tagging, semantic feature tagging, and concept
activation and to find the antecedents of relative pronouns. We are extending Kenmore to
handle larger text corpora and additional disambiguation tasks, including pronoun
resolution, syntactic parsing, and gap-filling phenomena.
In work with PhD student D. Pierce, we developed a
fast, accurate algorithm for identifying base noun phrases. We are also working with SaBIR
Research and Cornell's Information Retrieval group to develop a unified approach to
improving the end-user efficiency of state-of-the-art text retrieval systems. Our
underlying technology is a new combination of statistical and linguistic approaches to
text analysis in which a trainable, high-precision, partial parser is used to recognize
linguistic relationships that are most important for the larger IR system. We are applying
the approach to three distinct IR tasks: near-duplicate document detection, high-precision
text retrieval, and query-dependent text summarization.
NSF Career Award (1996-2000)
Selection Committee: Engineering College Teaching
Awards, Cognitive Studies Summer Fellowships, Cognitive Studies Dissertation Fellowships,
Cognitive Studies Entering Fellowships
Program chair: Second Conf. on Empirical Methods
in Natural Language Processing, Aug. 1997
Editor: Special Issue of Machine Learning J. on
Natural Language Learning; Proc. Second Conf. Empirical Methods in Natural Language
Program committees: Fifteenth Nat. Conf.
Artificial Intelligence; 36th Ann. Meeting Assn. for Computational Linguistics; Fifteenth
Int. Conf. Machine Learning; 17th Int. Conf. Computational Linguistics (COLING); Sixth
Workshop on Very Large Corpora; AAAI-98 Workshop on Case-Based Reasoning Integrations;
Fourth Int. Colloq. Grammatical Inference; Third Int. Conf. on New Methods in Language
Processing; Int. Conf. on Computational Natural Language Learning
Executive Board: SIGDAT, Special Interest Group of
ACL for Linguistic Data and Corpus-based approaches to NLP
Reviewer: J. Artificial Intelligence Research
Empire and SMART working together. TIPSTER
Text-Processing Initiative 18-Month Workshop, May 1998.
A corpus-based method for finding base noun
phrases. Johns Hopkins, April 1998.
Combining multiple knowledge sources for NLP via
symbolic machine learning techniques. Interfacing Models of Language Workshop, NIPS, Dec.
Case-based learning of natural language. Univ.
Rochester, Nov. 1997.
Using Empire and SMART for high-precision IR and
summarization. TIPSTER Text-Processing Initiative 12-Month Workshop, Oct. 1997.
Improving minority class prediction using
case-specific feature weights. Univ. of Utah, Oct. 1997.
Knowledge acquisition for natural language
understanding using machine learning techniques. Harvard, April 1997.
Improving end-user efficiency in a
TIPSTER-compliant IR system. TIPSTER Text-Processing Initiative 6-Month Meeting, April
Error-driven pruning of treebank grammars for base
noun phrase identification. Proc. Ann. Conf. Assoc. Computational Linguistics and
COLING-98, Assoc. for Computational Linguistics (1998) (with D. Pierce).
Using clustering and superconcepts within SMART:
TREC 6. Proc. Sixth Text Retrieval Conf. (TREC-6), Morgan Kaufmann, 1998 (with C.
Buckley, M. Mitra, and J. Waltz).
Empirical methods in information extraction. AI
Magazine 18, 4 (1997), 65-79.
Improving minority class prediction using
case-specific feature weights. Proc. Fourteenth Int. Conf. Machine Learning, D.
Fisher (editor), Morgan Kaufmann (1997), 57-65 (with N. Howe).
Examining locally varying weights for nearest
neighbor algorithms. Case-Based Reasoning Research and Development: Second Inter. Conf.
Case-Based Reasoning. D. Leake and E. Plaza (eds.), Lecture Notes in Artificial
Intelligence, Springer-Verlag, (1997), 455-466 (with N. Howe).
An analysis of statistical and syntactic phrases. 5TH
RIAO Conf.: Computer-Assisted Information Searching On the Internet, 1997 (with M.
Mitra, C. Buckley, and A. Singhal).