Datasets from Some Distributional Similarity Experiments

This page, http://www.cs.cornell.edu/home/llee/data/sim.html, contains the
data first introduced in Dagan, Lee, and Pereira ACL '97 and then
subsequently used in Dagan, Lee, and Pereira MLJ '99,  Lee ACL '99, and Lee
AISTATS '01. Please be sure to read this page, especially the notes, before
using this data.

If you have any questions or comments regarding this site, want to be
notified of updates, or downloaded and used this data, please contact
Lillian Lee.
  ------------------------------------------------------------------------

Data description

The data was derived from verb-object cooccurrence pairs in the 1988
Associated Press newswire involving the 1000 most frequent nouns, extracted
via Church's (1988) and Yarowsky's processing tools. We split this corpus
into an 80% portion and a 20% portion. The 80% portion (587,833 pairs)
served as a training set from which base probability distributions (and
hence similarities) were computed. We then prepared test sets for the
pseudoword disambiguation task, as follows.

   * We first needed to determine which verb pairs would serve as confusion
     sets. We simply sorted the verbs by frequency, and created confusion
     sets by going down the list two words at a time. Hence, the two words
     in each confusion set would generally be close to the same frequency.

   * Next came the creation of the test sets. From the 20% portion of the
     original data, we discarded the noun-verb pairs appearing in the other
     80% (our work focused on estimation for unseen events). Then, we split
     the remaining pairs into five partitions, and replaced each noun-verb
     pair (N,V) with a noun-verb-verb triple (N,V,V') where {V,V'} was one
     of our confusion sets. The task for the language model under evaluation
     was to determine which of (N,V) or (N,V') was the original
     cooccurrence. Observe that by construction the first verb was always
     the correct answer with respect to this task (see note below) and the
     two alternatives would generally have similar frequencies.

Important notes

   * In the test sets, there is no guarantee that (N,V') did not occur in
     the data (either in the 80% or the 20%), and in fact it is possible
     that (N,V') occurred more times than (N,V). That is, the evaluation
     task measured whether the method could recover the original verb, not
     whether the method could select the more likely verb to take the noun
     as object.

   * The results of the cited papers are averages over the five test sets,
     where for each set the other four served as parameter tuning data. But
     the 80% training set was the same for all five test runs, so this was
     not standard cross-validation. The reason is that we view the
     computation of similarity as potentially divorced from the task of
     estimating probabilities based on the similarities (see Lee ACL '99 for
     further discussion). So the similarity training data, consisting of
     noun-verb pairs, was separated from the task-specific parameter-tuning
     data, consisting of noun-verb-alt_verb triples.

   * The datasets for the experiments of Lee and Pereira ACL '99, which were
     stored at AT&T, are not currently available. For the record, we
     recommend considering the experimental setup described in that paper.
     The differences include a dataset creation technique more closely
     mirroring standard cross-validation, no restriction to the most
     frequent nouns, and guarantees that in the test data the alternative
     verb was indeed (empirically) less likely to co-occur with the noun
     than the original verb.

File format and download

All files are ascii. The training set train consists of lines of the form
count noun verb. The tuning/test sets, test1, ..., test5 consist of lines of
the form noun verb alt-verb count, where count is the number of times
(noun,verb) occurred in that partition.

V1.1 August 5, 2002: gzipped tarball simdata.tar.gz (740K): directory
containing 7 files, largest file is train.gz (3M unzipped), smallest is
README (a text version of this webpage)

References

@inproceedings{Church:88a,
  author =       {Kenneth Church},
  title =        {A Stochastic Parts Program and Noun Phrase Parser
                  for Unrestricted Text},
  booktitle =    {Proceedings of the Second Conference on Applied
                  Natural Language Processing},
  pages =        {136-143},
  year =         {1988}
}

@inproceedings{Dagan+Lee+Pereira:97a,
  author =       "Ido Dagan and Lillian Lee and Fernando Pereira",
  title =        "Similarity-Based Methods for Word Sense Disambiguation",
  booktitle =    "35th Annual Meeting of the ACL",
  year =         1997,
  pages =        {56--63}
}

@InProceedings{Lee:99a,
  author =       {Lillian Lee},
  title =        {Measures of Distributional Similarity},
  booktitle =    "37th Annual Meeting of the Association for Computational Linguistics",
  pages={25--32},
  year =         1999,
}

@InProceedings{Lee+Pereira:99a,
  author =       {Lillian Lee and Fernando Pereira},
  title =        {Distributional similarity models: Clustering vs. nearest neighbors},
  booktitle =    "37th Annual Meeting of the Association for Computational Linguistics",
  pages =        {33--40},
  year =         1999
}

@Article{Dagan+Lee+Pereira:99a,
  author =       {Ido Dagan and Lillian Lee and Fernando Pereira},
  title =        {Similarity-Based Models of Cooccurrence Probabilities},
  journal =      {Machine Learning},
  year =         1999,
  volume={34},
  number={1-3},
  pages={43-69}
}

@InProceedings{Lee:01a,
  author =       {Lillian Lee},
  title =        {On the Effectiveness of the Skew Divergence for
  Statistical Language Analysis},
  booktitle =    "Artificial Intelligence and Statistics 2001",
  pages =        {65--72},
  year =         2001
}

  ------------------------------------------------------------------------
Back to Lillian Lee's home page.
Go to the CUCS NLP home page.
