PhD Harvard, 1997
I am interested in sparse data issues, especially in the realm of
natural language processing. Statistical methods for automatically extracting information
about language from large samples would have considerable impact in a number of areas,
such as information retrieval. However, even huge collections of texts yield highly
unreliable estimates of the probability of not-so-uncommon events; this pervasive
phenomenon is known as the sparse data problem. My work is focused on finding
similarity-based methods for overcoming this problem.
I am pursuing two complementary approaches. In joint
work with F. Pereira and N. Tishby, I am developing techniques for distributional
probabilistic clustering of data (most traditional clustering techniques create boolean
classes). Our methods are based on information-theoretic principles. A recent paper by
Baker and McCallum (to appear in SIGIR '98) describes experiments showing that our
distributional clustering framework can work significantly better than latent semantic
indexing (LSI) and other algorithms at feature selection for document classification.
I am looking at similar-neighbor algorithms (akin to
nearest-neighbor approaches). In recent work with I. Dagan and F. Pereira, I have shown
that using information-theoretic similarity functions outperforms several other
commonly-used functions, even in a parameter-free setting.
I have an interest in formal language issues related
to natural language processing. Recent work has yielded results on the hardness of
context-free grammar parsing.
Planning Meeting for Johns Hopkins 1998 Workshop
on Language Engineering, Oct. 1997.
Program Committee: Sixth Workshop on Very Large
Reviewer: Computational Linguistics, Machine
Similarity-based methods for word sense
disambiguation. Proc. 35th Ann. Meeting Assoc. Computational Linguistics (1997),
56-63 (with I. Dagan and F. Pereira).
Fast context-free parsing requires fast boolean
matrix multiplication. Ibid, 9-15.