Data for lexical simplification experiments

This site relates to the paper

For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia (click link for paper and poster (which contains more examples of results))
Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-Mizil and Lillian Lee
Proceedings of the NAACL, 2010 (short paper).

On June 3, 2010, we released the output of various stages of our processing pipeline. These files are: enwiki.tar.gz (1.67 GB), simplewiki.tar.gz (88.9 MB), basefiles.tar.gz (135MB), fullwiki_files.tar.gz (1.66GB), simplewiki_files.tar.gz (92.7MB). The file simple.ids.titles is a map for simple article titles. The file full.ids.titles.sid is a map from "complex" english article IDs to titles to simple article IDs.

The file outputs.zip consists of the top 100 outputs of the five "systems" we compared in our paper. The files output.thresholded and output.translation, released April 6 2011, are the full output of our system.

We are in the midst of preparing a final README for these files. For now, please see our draft README (an evolving document, but should have enough details for interested parties to be able to start working with the data).

The file labels.v1.0.tar.gz (11KB), released April 1, 2010, consists of the output of our human annotators. Please see the file README.v1.0.txt (also included in the tarball) for more details, including the instructions we provided to the annotators.

This material is based upon work supported in part by the National Science Foundation under grant IIS-0910664. Any opinions, findings, and conclusions or recommendations expressed above are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Last edited June 4, 2012 11:18 AM