Lillian Lee: Research Interests
Now, a few words on looking for things. When you go
looking for something specific, your chances of finding it are very
bad. Because of all the things in the world, you're only looking for
one of them. When you go looking for anything at all, your chances of
finding it are very good. Because of all the things in the world,
you're sure to find some of them.
--The Zero Effect, written and directed by Jake Kasdan
The goal of natural language processing is to enable computers
to use human language as a communication medium accurately, robustly,
and gracefully. My research interests are in the empirical and
theoretical problems that arise in the pursuit of this goal.
Because of the subtleties of human language, high-performance
language-capable systems cannot be developed without access to
high-quality linguistic and domain knowledge. Unfortunately, the
process of gathering and encoding such information by hand is
typically tedious and time-consuming; furthermore, the task often
requires the aid of human experts. These factors give rise to the
knowledge acquisition bottleneck, widely recognized to be one of
the biggest obstacles to building sound, general-purpose natural
language processing systems.
A major focus of my work has been to create
knowledge-lean methods (the term is due to Rebecca Bruce and Ted
Pedersen) for overcoming the knowledge acquisition bottleneck. I have
developed algorithms that allow a system to automatically
acquire linguistic and domain knowledge directly from text samples; in
keeping with the ``knowledge-lean'' idea (and in contrast to
supervised machine learning techniques), I have concentrated
on approaches that work on essentially raw text rather than
human-annotated data.
Applications I have considered range from
automatically constructing thesauruses to finding word
boundaries in streams of Japanese text to creating English versions
of computer-generated mathematical proofs. Some of the data from previous experiments is available.
Since it was proving impossible for me to keep this page up to
date, what is below is now simply a *sampling* of projects I've been
involved in, in the hope that these prove illustrative.
Information Retrieval and Sentiment Analysis (IR papers
and sentiment papers)
Jointly with Rie
Kubota Ando, I have given a new analysis and powerful, yet
intuitive extension to the well-known technique of latent semantic
indexing. Joint work with Oren Kurland has explored the integration
if clustering techniques wih the language-modeling approach to
information retrieval. Bo Pang and I have
been looking into applying text classification techniques to problems
involving the "opinion orientation", or sentiment, of documents.
Segmentation (papers)
In joint work with Rie Kubota Ando,
I've considering the question of how one can recover word boundaries
from non-word-delimited languages (e.g. Japanese and Chinese) using as
little presegmented data or pre-compiled knowledge sources as
possible. The crucial pre-processing task of recovering these
boundaries is usually accomplished by relying on large hand-crafted
grammars or by training a segmentation algorithm on thousands of
hand-segmented sentences. In contrast,our work employs simple
statistics drawn from raw (unsegmented) data. The experimental
results on Japanese have been striking: the performance of our
algorithm rivals that of state-of-the-art grammar-based segmenters.
We intend to apply our method to Chinese as well; the fact that our
algorithm does not explicitly incorporate language-specific
information should greatly aid its portability to this new domain.
Distributional Similarity(papers,
examples)
Distributional similarity is a powerful and effective tool for
ameliorating the knowledge acquisition bottleneck, with applications
not only in natural language processing but potentially other areas as
well. The idea is simple and intuitive: two objects can be considered
similar if they tend to appear in the same contexts (for instance, one
can infer that ``coffee'' and ``espresso'' are similar because they
both occur after the words ``drink'' and ``sip''). But despite its
simplicity, the notion of distributional similarity forms the basis of
a very powerful principle for bootstrapping the knowledge
acquisition process: information about an object can be gleaned from
the objects that are similar to it, where similarity can be computed
from unannotated data alone.
In joint work, we have studied two complementary, highly effective paradigms ---
distributional clustering and
nearest neighbors --- for
incorporating distributional similarity information. Our experiments
show that similarity-based estimation yields significant reductions in
speech-recognition error rates, and outperforms standard techniques by
over 40% in providing statistical estimates in sparse data
situations. An interesting direction of further work is to apply these
techniques to more semantically-oriented problems, such as the
automatic construction of thesauri, which are a valuable resource in
many language processing tasks.
Data-driven Methods for Natural-Language Generation (papers)
Recently, Regina
Barzilay and I have been working on applying corpus-based
techniques for tasks in generation. We have applied
multiple-sequence
alignment techniques to the problem of learning how to create
natural-language versions of computer-generated data and to
paraphrase
extraction from comparable corpora. We've also studied using
techniques based on hidden Markov models for sentence ordering and summarization.
I am also interested in more formal issues in natural language
processing. I have established a fundamental relationship between
context-free grammar (CFG) parsing (which uncovers the syntactic
structure of sentences) and Boolean matrix multiplication. This
result shows why there has been so little success in developing
practical general CFG parsers with sub-cubic running times, and
furthermore, naturally complements the converse relationship
discovered by L. Valiant in 1975.
Papers
on these topics.
Lillian Lee's home page
Cornell Natural
Language Processing Group