Goal
It is often estimated that data cleaning is the most expensive and arduous
tasks of the knowledge discovery process. Whether for industry or government
databases, the problem of linking records and identifying identical objects in
dirty data is a hard and costly task.
The goal of this task is to clean a very large set of real-life data: We would
like to re-create the citation graph of about 35,000 papers in the hep-ph
portion of the arXiv.
Timeline
The task and data will be available April 6, 2003. Submissions must be
completed by July 21, 2003.
Input
Contestants will be given the LaTeX sources of all papers in the hep-ph portion
of the arXiv on April 1, 2003. For each paper, this includes the main .tex
file, but not separate include files or figures. The references in each paper
have been "sanitized" through a script by removing all unique identifiers such
as arXiv codes or other code numbers. No attempts have been made to repair any
damages from the sanitization process. Each paper has been assigned a unique
number.
Output
For each paper P in the collection, a list of other papers {P1, ..., Pk) in the
collection such that P cites P1, ..., Pk. Note that P might cite papers that
are not in the collection.
The format for submission is a plain ASCII file with 2 columns. The left column
will be the citing-from paper id and the right column will be the cited-to
paper id. The file should also be sorted.
Evaluation
The target is a graph G=(V,E) with each paper P a node in the graph, and each
citation a directed edge in the graph. Assuming that a contestant submits a
graph G'=(V,E'), the score is the size of the symmetric difference between E
and E': |E-E'|.
|