faculty.gif (20410 bytes)
choices.gif (4488 bytes)

Lillian Lee

Assistant Professor

PhD Harvard, 1997

I am interested in sparse data issues, especially in the realm of natural language processing. Statistical methods for automatically extracting information about language from large samples would have considerable impact in a number of areas, such as information retrieval. However, even huge collections of texts yield highly unreliable estimates of the probability of not-so-uncommon events; this pervasive phenomenon is known as the sparse data problem. My work is focused on finding similarity-based methods for overcoming this problem.

I am pursuing two complementary approaches. In joint work with F. Pereira and N. Tishby, I am developing techniques for distributional probabilistic clustering of data (most traditional clustering techniques create boolean classes). Our methods are based on information-theoretic principles. A recent paper by Baker and McCallum (to appear in SIGIR '98) describes experiments showing that our distributional clustering framework can work significantly better than latent semantic indexing (LSI) and other algorithms at feature selection for document classification.

I am looking at similar-neighbor algorithms (akin to nearest-neighbor approaches). In recent work with I. Dagan and F. Pereira, I have shown that using information-theoretic similarity functions outperforms several other commonly-used functions, even in a parameter-free setting.

I have an interest in formal language issues related to natural language processing. Recent work has yielded results on the hardness of context-free grammar parsing.

University Activities

  • Field of Cognitive Studies

  • Engineering Faculty phonathon

Professional Activities

  • Planning Meeting for Johns Hopkins 1998 Workshop on Language Engineering, Oct. 1997.

  • Program Committee: Sixth Workshop on Very Large Corpora, 1998.

  • Reviewer: Computational Linguistics, Machine Learning J.


  • Similarity-based methods for word sense disambiguation. Proc. 35th Ann. Meeting Assoc. Computational Linguistics (1997), 56-63 (with I. Dagan and F. Pereira).

  • Fast context-free parsing requires fast boolean matrix multiplication. Ibid, 9-15.