faculty.gif (20410 bytes)
choices.gif (4488 bytes)

Claire Cardie

Assistant Professor

PhD Univ. of Mass., Amherst, 1994

My research focuses on developing corpus-based techniques for understanding and extracting information from natural language texts. We are investigating the use of machine learning techniques as tools for guiding natural language system development and for exploring the mechanisms that underly language acquisition. Our work encompasses three related areas: (1) machine learning of natural language, (2) the use of corpus-based NLP techniques to aid information retrieval systems, and (3) the design of user-trainable systems that can efficiently and reliably extract the important information from a document.

claire.tif (273836 bytes)

In the Kenmore Project, we are developing techniques to automate the knowledge acquisition tasks that comprise the building of any NLP system. We want to determine the conditions under which machine learning techniques can be expected to offer a cost-effective approach to knowledge acquisition. Very generally, Kenmore acquires linguistic knowledge using a combination of symbolic machine learning techniques and robust sentence analysis. It has been used with corpora from two real-world domains to perform part-of-speech tagging, semantic feature tagging, and concept activation and to find the antecedents of relative pronouns. We are extending Kenmore to handle larger text corpora and additional disambiguation tasks, including pronoun resolution, syntactic parsing, and gap-filling phenomena.

In work with PhD student D. Pierce, we developed a fast, accurate algorithm for identifying base noun phrases. We are also working with SaBIR Research and Cornell's Information Retrieval group to develop a unified approach to improving the end-user efficiency of state-of-the-art text retrieval systems. Our underlying technology is a new combination of statistical and linguistic approaches to text analysis in which a trainable, high-precision, partial parser is used to recognize linguistic relationships that are most important for the larger IR system. We are applying the approach to three distinct IR tasks: near-duplicate document detection, high-precision text retrieval, and query-dependent text summarization.


NSF Career Award (1996-2000)

University Activities

Selection Committee: Engineering College Teaching Awards, Cognitive Studies Summer Fellowships, Cognitive Studies Dissertation Fellowships, Cognitive Studies Entering Fellowships

Professional Activities

  • Program chair: Second Conf. on Empirical Methods in Natural Language Processing, Aug. 1997

  • Editor: Special Issue of Machine Learning J. on Natural Language Learning; Proc. Second Conf. Empirical Methods in Natural Language Processing

  • Program committees: Fifteenth Nat. Conf. Artificial Intelligence; 36th Ann. Meeting Assn. for Computational Linguistics; Fifteenth Int. Conf. Machine Learning; 17th Int. Conf. Computational Linguistics (COLING); Sixth Workshop on Very Large Corpora; AAAI-98 Workshop on Case-Based Reasoning Integrations; Fourth Int. Colloq. Grammatical Inference; Third Int. Conf. on New Methods in Language Processing; Int. Conf. on Computational Natural Language Learning

  • Executive Board: SIGDAT, Special Interest Group of ACL for Linguistic Data and Corpus-based approaches to NLP

  • Reviewer: J. Artificial Intelligence Research


  • Empire and SMART working together. TIPSTER Text-Processing Initiative 18-Month Workshop, May 1998.

  • A corpus-based method for finding base noun phrases. Johns Hopkins, April 1998.

  • Combining multiple knowledge sources for NLP via symbolic machine learning techniques. Interfacing Models of Language Workshop, NIPS, Dec. 1997.

  • Case-based learning of natural language. Univ. Rochester, Nov. 1997.

  • Using Empire and SMART for high-precision IR and summarization. TIPSTER Text-Processing Initiative 12-Month Workshop, Oct. 1997.

  • Improving minority class prediction using case-specific feature weights. Univ. of Utah, Oct. 1997.

  • Knowledge acquisition for natural language understanding using machine learning techniques. Harvard, April 1997.

  • Improving end-user efficiency in a TIPSTER-compliant IR system. TIPSTER Text-Processing Initiative 6-Month Meeting, April 1997.


  • Error-driven pruning of treebank grammars for base noun phrase identification. Proc. Ann. Conf. Assoc. Computational Linguistics and COLING-98, Assoc. for Computational Linguistics (1998) (with D. Pierce).

  • Using clustering and superconcepts within SMART: TREC 6. Proc. Sixth Text Retrieval Conf. (TREC-6), Morgan Kaufmann, 1998 (with C. Buckley, M. Mitra, and J. Waltz).

  • Empirical methods in information extraction. AI Magazine 18, 4 (1997), 65-79.

  • Improving minority class prediction using case-specific feature weights. Proc. Fourteenth Int. Conf. Machine Learning, D. Fisher (editor), Morgan Kaufmann (1997), 57-65 (with N. Howe).

  • Examining locally varying weights for nearest neighbor algorithms. Case-Based Reasoning Research and Development: Second Inter. Conf. Case-Based Reasoning. D. Leake and E. Plaza (eds.), Lecture Notes in Artificial Intelligence, Springer-Verlag, (1997), 455-466 (with N. Howe).

  • An analysis of statistical and syntactic phrases. 5TH RIAO Conf.: Computer-Assisted Information Searching On the Internet, 1997 (with M. Mitra, C. Buckley, and A. Singhal).