I. Citation Prediction

Goal

The goal of this task is to predict changes in the number of citations to individual papers over time.

Timeline

The task and data will be available April 6, 2003. Submissions are due by May 16, 23:59pm EST. Submission instructions are online.

Input

Contestants will be given:
  1. the LaTeX source of all papers in the hep-th portion of the arXiv through March 1, 2003. For each paper, this includes the main .tex file but not separate include files or figures. It also includes the hep-th arxiv number as a unique ID.
  2. The abstracts for all of the hep-th papers in the arXiv. For each paper the abstract file contains:
    • arXiv submission date
    • revised date(s)
    • title
    • authors
    • abstract
  3. The SLAC/SPIRES dates for all hep-th papers. Some older papers were uploaded years after their intial publication and the arXiv submission date from the abstracts may not correspond to the publication date. An alternative date has been provided from SLAC/SPIRES that may be a better estimate for the initial publication of these old papers.
  4. The complete citation graph for the hep-th papers, obtained from SLAC/SPIRES. Each node will be labeled by its unique ID from (1). Note that revised papers may have updated citations. As such, citations may refer to future papers, i.e. a paper may cite another paper that was published after the first paper.
Update May 12, 2003: An updated version of the data for March and April of 2003 has been provided.

Output

For each paper P in the collection, contestants should report the predicted difference between
  • the number of citations P will receive from hep-th papers submitted during the period May 1, 2003 - July 31, 2003, and
  • the number of citations P will receive from hep-th papers submitted during the period February 1, 2003 - April 30, 2003. (So if there were more citations during the period May 1, 2003 - July 31, 2003, then the prediction should be a positive number.)
The format for the submission is a simple 2 column vector of [arxiv id] [difference] sorted by arxiv id.

Update May 6, 2003: This difference does not need to be an integer; floating point numbers are valid predictions.

Evaluation

The target result is a vector V with one coordinate for each paper in the initial collection (1) that receives at least 6 citations during the period February 1, 2003 - April 30, 2003. The P-th coordinate of V will consist of the true difference in number of citations for paper P.

Based on a contestant's predictions, a vector W will be constructed, over the same set of paper; the P-th coordinate of W will consist of the predicted difference in number of citations for paper P.

The score of a prediction vector W will be equal to the L_1 difference between the vectors V and W.