Lillian Lee: Research Interests

Now, a few words on looking for things. When you go looking for something specific, your chances of finding it are very bad. Because of all the things in the world, you're only looking for one of them. When you go looking for anything at all, your chances of finding it are very good. Because of all the things in the world, you're sure to find some of them.
--The Zero Effect, written and directed by Jake Kasdan

The goal of natural language processing is to enable computers to use human language as a communication medium accurately, robustly, and gracefully. My research interests are in the empirical and theoretical problems that arise in the pursuit of this goal.

Because of the subtleties of human language, high-performance language-capable systems cannot be developed without access to high-quality linguistic and domain knowledge. Unfortunately, the process of gathering and encoding such information by hand is typically tedious and time-consuming; furthermore, the task often requires the aid of human experts. These factors give rise to the knowledge acquisition bottleneck, widely recognized to be one of the biggest obstacles to building sound, general-purpose natural language processing systems.

A major focus of my work has been to create knowledge-lean methods (the term is due to Rebecca Bruce and Ted Pedersen) for overcoming the knowledge acquisition bottleneck. I have developed algorithms that allow a system to automatically acquire linguistic and domain knowledge directly from text samples; in keeping with the ``knowledge-lean'' idea (and in contrast to supervised machine learning techniques), I have concentrated on approaches that work on essentially raw text rather than human-annotated data.

Applications I have considered range from automatically constructing thesauruses to finding word boundaries in streams of Japanese text to creating English versions of computer-generated mathematical proofs. Some of the data from previous experiments is available.


Since it was proving impossible for me to keep this page up to date, what is below is now simply a *sampling* of projects I've been involved in, in the hope that these prove illustrative.

Information Retrieval and Sentiment Analysis (IR papers and sentiment papers)

Jointly with Rie Kubota Ando, I have given a new analysis and powerful, yet intuitive extension to the well-known technique of latent semantic indexing. Joint work with Oren Kurland has explored the integration if clustering techniques wih the language-modeling approach to information retrieval. Bo Pang and I have been looking into applying text classification techniques to problems involving the "opinion orientation", or sentiment, of documents.

Segmentation (papers)

In joint work with Rie Kubota Ando, I've considering the question of how one can recover word boundaries from non-word-delimited languages (e.g. Japanese and Chinese) using as little presegmented data or pre-compiled knowledge sources as possible. The crucial pre-processing task of recovering these boundaries is usually accomplished by relying on large hand-crafted grammars or by training a segmentation algorithm on thousands of hand-segmented sentences. In contrast,our work employs simple statistics drawn from raw (unsegmented) data. The experimental results on Japanese have been striking: the performance of our algorithm rivals that of state-of-the-art grammar-based segmenters. We intend to apply our method to Chinese as well; the fact that our algorithm does not explicitly incorporate language-specific information should greatly aid its portability to this new domain.

Distributional Similarity(papers, examples)

Distributional similarity is a powerful and effective tool for ameliorating the knowledge acquisition bottleneck, with applications not only in natural language processing but potentially other areas as well. The idea is simple and intuitive: two objects can be considered similar if they tend to appear in the same contexts (for instance, one can infer that ``coffee'' and ``espresso'' are similar because they both occur after the words ``drink'' and ``sip''). But despite its simplicity, the notion of distributional similarity forms the basis of a very powerful principle for bootstrapping the knowledge acquisition process: information about an object can be gleaned from the objects that are similar to it, where similarity can be computed from unannotated data alone.

In joint work, we have studied two complementary, highly effective paradigms --- distributional clustering and nearest neighbors --- for incorporating distributional similarity information. Our experiments show that similarity-based estimation yields significant reductions in speech-recognition error rates, and outperforms standard techniques by over 40% in providing statistical estimates in sparse data situations. An interesting direction of further work is to apply these techniques to more semantically-oriented problems, such as the automatic construction of thesauri, which are a valuable resource in many language processing tasks.

Data-driven Methods for Natural-Language Generation (papers)

Recently, Regina Barzilay and I have been working on applying corpus-based techniques for tasks in generation. We have applied multiple-sequence alignment techniques to the problem of learning how to create natural-language versions of computer-generated data and to paraphrase extraction from comparable corpora. We've also studied using techniques based on hidden Markov models for sentence ordering and summarization.

Parsing (papers)

I am also interested in more formal issues in natural language processing. I have established a fundamental relationship between context-free grammar (CFG) parsing (which uncovers the syntactic structure of sentences) and Boolean matrix multiplication. This result shows why there has been so little success in developing practical general CFG parsers with sub-cubic running times, and furthermore, naturally complements the converse relationship discovered by L. Valiant in 1975.

Papers on these topics.
Lillian Lee's home page
Cornell Natural Language Processing Group