Sept 4, 2003: The datasets available for public download have been
finalized.
I. Citation Prediction Task
Available for contestants:

The LaTeX sources of all papers in the hepth portion of the arXiv until May 1,
2003 are available for download. Each paper is identified by a unique
arXiv id.
There are approximately 29,000 hepth papers with 1.7 gigs of data. The papers
have been compressed to about 500M and divided into separate years for
downloading.

The abstracts for all the hepth papers as a hepth
abstracts tarball.

The SLAC dates for each hepth paper as a
hepth slacdates tarball .

The format for the slac dates is a sorted 2 column vector where the left column
is the paper's arxiv id and the right column is the SLAC date:
[arxiv id] [date in YYYYMMDD format]

The citation graph of the hepth portion of the arXiv as a
hepth citations tarball.

The format for citations is a sorted 2 column vector where the left column is
the cited from paper arxiv id and the right column is the cited to paper arxiv
id:
[paper cited from] [paper cited to]
II. Data Cleaning Task
For this task the LaTeX sources of the hepph papers on March 1, 2003 are
available for download. A random paper id between 1 and 100,000 has been
assigned to each paper. Also, a small subset of papers were converted from
pdf/ps and only appear as plain text.
There are over 35,000 hepph papers with 1.8 gigs of data, so the download has
been broken into 10 separate tar gzips of 50MB each, plus 1 extra tarball with
the plain text papers.
Sept 4, 2003: The corresponding citation graph for hepph used as the
evaluation criteria is now available here.
III. Download Estimation Task
Available for this task are the same datasets for task 1 plus:

For each paper that was published in one of the listed six months (2/2000,
3/2000, 2/2001, 4/2001, 3/2002, 4/2002), the download logs from its first 60
days in the arXiv are provided.
Update Sept 4, 2003: Download data is no longer publicly available for download.
IV. Open Task
Contestants can use any of the hepth data from Tasks 1 or 3.
