Similarity-Based Approaches to Natural Language Processing.
Lillian Lee
Ph.D. thesis.
Harvard University Technical Report TR-11-97.

Abstract: This thesis presents two similarity-based approaches to sparse data problems. The first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. Our second approach is a nearest-neighbor approach: instead of calculating a centroid for each class, as in the hierarchical clustering approach, we in essence build a cluster around each word. We compare several such nearest-neighbor approaches on a word sense disambiguation task and find that as a whole, their performance is far superior to that of standard methods. In another set of experiments, we show that using estimation techniques based on the nearest-neighbor model enables us to achieve perplexity reductions of more than 20 percent over standard techniques in the prediction of low-frequency events, and statistically significant speech recognition error-rate reduction.

Paper formats: ps, pdf, single-spaced ps, other

BibTeX entry:

@PhdThesis{Lee:thesis,
  author =	 {Lillian Lee},
  title =	 {Similarity-Based Approaches to Natural Language
                  Processing},
  school =	 {Harvard University},
  year =	 1997,
  address =	 {Cambridge, MA},
  annote =	 {Harvard University Technical Report TR-11-97.}
}


Back links: Lillian Lee's home page or papers page; Cornell NLP page.