Untitled Document

Measures of Distributional Similarity

One would like to approximate the cooccurence probability of a word pair. That is if the word company occurs then what would be the probability that a certain verb occurs. The problem with this approximation is that the majority of word pairs possible don't appear in the corpus hence the probabilities appear to be zero. There have been several methods to make this estimate not zero. One of the methods is to use the closests noun to the given noun but we need a way to measure how close two nouns are. The probability distributions are often used to see how similar two nouns are. The ideal measure is the KL-divergence but when the probability for a certain pair is zero the KL-divergence can be undefined. This research deals with trying to solve this problem.

Project Advisor: Lillian Lee