Similarity-Based Estimation of Word Cooccurrence Probabilities.
Ido Dagan, Fernando Pereira, and Lillian Lee
Proceedings of the 32nd ACL, pp 272--78, 1994.

Abstract: In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations ``eat a peach'' and ``eat a beach'' is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in a given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on ``most similar'' words. We describe a probabilistic word association model based on distributional word similarity, and apply it to improving probability estimates for unseen word bigrams in a variant of Katz's back-off model. The similarity-based method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error.

Paper formats: ps, pdf, other

BibTeX entry:

@inproceedings{Dagan+Pereira+Lee:94a,
  author =	 {Ido Dagan and Fernando Pereira and Lillian Lee},
  title =	 {Similarity-based estimation of word cooccurrence
                  probabilities},
  booktitle = 	 "32nd Annual Meeting of the ACL",
  pages =	 {272-278},
  year = 	 1994
}


Back links: Lillian Lee's home page or papers page; Cornell NLP page.