KDD Cup 2003 - Datasets

Sept 4, 2003: The datasets available for public download have been finalized.

I. Citation Prediction Task

Available for contestants:

The LaTeX sources of all papers in the hep-th portion of the arXiv until May 1, 2003 are available for download. Each paper is identified by a unique arXiv id.

There are approximately 29,000 hep-th papers with 1.7 gigs of data. The papers have been compressed to about 500M and divided into separate years for downloading.

hep-th 1992 (22M)	hep-th 1996 (41M)	hep-th 2000 (53M)
hep-th 1993 (31M)	hep-th 1997 (44M)	hep-th 2001 (56M)
hep-th 1994 (39M)	hep-th 1998 (45M)	hep-th 2002 (59M)
hep-th 1995 (36M)	hep-th 1999 (48M)	hep-th 2003 (17M)

The abstracts for all the hep-th papers as a hep-th abstracts tarball.

The SLAC dates for each hep-th paper as a hep-th slacdates tarball .
- The format for the slac dates is a sorted 2 column vector where the left column is the paper's arxiv id and the right column is the SLAC date:
  [arxiv id] [date in YYYY-MM-DD format]

The citation graph of the hep-th portion of the arXiv as a hep-th citations tarball.
- The format for citations is a sorted 2 column vector where the left column is the cited from paper arxiv id and the right column is the cited to paper arxiv id:
  [paper cited from] [paper cited to]

II. Data Cleaning Task

For this task the LaTeX sources of the hep-ph papers on March 1, 2003 are available for download. A random paper id between 1 and 100,000 has been assigned to each paper. Also, a small subset of papers were converted from pdf/ps and only appear as plain text.

There are over 35,000 hep-ph papers with 1.8 gigs of data, so the download has been broken into 10 separate tar gzips of 50MB each, plus 1 extra tarball with the plain text papers.

hep-ph part 0	hep-ph part 4	hep-ph part 8
hep-ph part 1	hep-ph part 5	hep-ph part 9
hep-ph part 2	hep-ph part 6	hep-ph part 10 (plain text papers)
hep-ph part 3	hep-ph part 7

Sept 4, 2003: The corresponding citation graph for hep-ph used as the evaluation criteria is now available here.

III. Download Estimation Task

Available for this task are the same datasets for task 1 plus:

For each paper that was published in one of the listed six months (2/2000, 3/2000, 2/2001, 4/2001, 3/2002, 4/2002), the download logs from its first 60 days in the arXiv are provided.

Update Sept 4, 2003: Download data is no longer publicly available for download.

IV. Open Task

Contestants can use any of the hep-th data from Tasks 1 or 3.

Home

KDD 2003