This is a collection of data from the NIPS conference. There are a total
of 14 years of data (1987-2000). There are approximately 150 papers per
year after the first two years which have about 100 papers each.

citations/
This is a manually constructed set of citations between NIPS documents in this
corpus.

clean/
This directory has the most recent version of the NIPS corpus. Note: The
document vectors here contain the reference sections of the publications so
that only the metadata files from this directory are used. Some of the scripts
used to generate these files are found in ~/doc_collections/datautil.  All the
metadata files included in this directory are the up-to-date versions, not the
ones in the code/ directory.

  clean/corpusTf.dat
  This is the overall TF counts for the entire corpus.

  clean/distribution.dat
  This file holds the number of documents per year in the corpus.

  clean/doclist.txt
  This is the list of documents. They are numbered in this order.

  clean/lexicon.txt
  This is a lexicon, with a list of words mapped to term IDs.

  clean/tf.arff
  clean/tf.dat
  clean/tf.svmlight
  These files hold the TF vectors of the documents in the corpus, which are
  stored in Weka's ARFF format, the Matlab sparse matrix format, and the SVM
  light format, respectively.

  clean/tfIdf.arff
  clean/tfIdf.dat
  clean/tfIdf.svmlight
  Document TFIDF vectors stored in their respective formats.

  clean/tfPercents.dat
  This file stores a list of the number of terms in the corpus-wide TFs would
  be needed to comprise a certain amount of the probability mass of the corpus.

  clean/tokenized/
  This directory contains a file for each document, sorted by year, where the
  terms of the documents are replaced by their term IDs.

  clean/tokenized_lxl/
  This directory contains a file for each document, with terms replaced by term
  IDs.  The difference is that in this file, the original line breaks in the
  data are preserved, so that presentation can be done more readably.

clean_noref/
This is the same version of the cleaned data, except that reference sections of
the documents were removed. This is the version used in the KDD-07 paper. When
certain files are missing, it is because they are identical to the versions
available in the clean/ directory.

meta_authors/
The author metadata extracted from the NIPS proceedings (http://books.nips.cc)
online.  The script to extract these files is code/metainfo.pl.

nips-txt.tar.gz
nips-xml.tar.gz
These are the original downloaded data files.

The directories tokenized and tokenized_notitle hold the docs with the content being simply the termIds. (notitle signifies the dataset with title documents removed.)
The directories code and code_notitle hold various files:
distribution.dat  The breakdown of files into years
firstLines.txt    The "titles" (actually first two lines) of the docs
doclist.txt       A list of the doc files one-by-one in temporal order
lexicon.txt       Mapping of words to termIds

