Datasets from Some Distributional Similarity Experiments

This page, http://www.cs.cornell.edu/home/llee/data/sim.html, contains the data first introduced in Dagan, Lee, and Pereira ACL '97 and then subsequently used in Dagan, Lee, and Pereira MLJ '99Lee ACL '99, and Lee AISTATS '01. Please be sure to read this page, especially the notes, before using this data.

If you have any questions or comments regarding this site, want to be notified of updates, or downloaded and used this data, please contact Lillian Lee.


Data description

The data was derived from verb-object cooccurrence pairs in the 1988 Associated Press newswire involving the 1000 most frequent nouns, extracted via Church's (1988) and Yarowsky's processing tools. We split this corpus into an 80% portion and a 20% portion. The 80% portion (587,833 pairs) served as a training set from which base probability distributions (and hence similarities) were computed. We then prepared test sets for the pseudoword disambiguation task, as follows.

Important notes

File format and download

All files are ascii. The training set train consists of lines of the form count noun verb. The tuning/test sets, test1, ..., test5 consist of lines of the form noun verb alt-verb count, where count is the number of times (noun,verb) occurred in that partition.

V1.1 August 5, 2002: gzipped tarball simdata.tar.gz (740K): directory containing 7 files, largest file is train.gz (3M unzipped), smallest is README (a text version of this webpage)

References

@inproceedings{Church:88a,
  author =	 {Kenneth Church},
  title =	 {A Stochastic Parts Program and Noun Phrase Parser
                  for Unrestricted Text},
  booktitle =	 {Proceedings of the Second Conference on Applied
                  Natural Language Processing},
  pages =	 {136-143},
  year =	 {1988}
}

@inproceedings{Dagan+Lee+Pereira:97a,
  author =       "Ido Dagan and Lillian Lee and Fernando Pereira",
  title =        "Similarity-Based Methods for Word Sense Disambiguation",
  booktitle =    "35th Annual Meeting of the ACL",
  year =         1997,
  pages =        {56--63}
}

@InProceedings{Lee:99a,
  author =       {Lillian Lee},
  title =        {Measures of Distributional Similarity},
  booktitle =    "37th Annual Meeting of the Association for Computational Linguistics",
  pages={25--32},
  year =         1999,
}

@InProceedings{Lee+Pereira:99a,
  author =       {Lillian Lee and Fernando Pereira},
  title =        {Distributional similarity models: Clustering vs. nearest neighbors},
  booktitle =    "37th Annual Meeting of the Association for Computational Linguistics",
  pages =        {33--40},
  year =         1999
}

@Article{Dagan+Lee+Pereira:99a,
  author =       {Ido Dagan and Lillian Lee and Fernando Pereira},
  title =        {Similarity-Based Models of Cooccurrence Probabilities},
  journal =      {Machine Learning},
  year =         1999,
  volume={34},
  number={1-3},
  pages={43-69}
}

@InProceedings{Lee:01a,
  author =       {Lillian Lee},
  title =        {On the Effectiveness of the Skew Divergence for
  Statistical Language Analysis},
  booktitle =    "Artificial Intelligence and Statistics 2001",
  pages =        {65--72},
  year =         2001
}


Back to Lillian Lee's home page.
Go to the CUCS NLP home page.