II. Data Cleaning


It is often estimated that data cleaning is the most expensive and arduous tasks of the knowledge discovery process. Whether for industry or government databases, the problem of linking records and identifying identical objects in dirty data is a hard and costly task.

The goal of this task is to clean a very large set of real-life data: We would like to re-create the citation graph of about 35,000 papers in the hep-ph portion of the arXiv.


The task and data will be available April 6, 2003. Submissions must be completed by July 21, 2003.


Contestants will be given the LaTeX sources of all papers in the hep-ph portion of the arXiv on April 1, 2003. For each paper, this includes the main .tex file, but not separate include files or figures. The references in each paper have been "sanitized" through a script by removing all unique identifiers such as arXiv codes or other code numbers. No attempts have been made to repair any damages from the sanitization process. Each paper has been assigned a unique number.


For each paper P in the collection, a list of other papers {P1, ..., Pk) in the collection such that P cites P1, ..., Pk. Note that P might cite papers that are not in the collection.

The format for submission is a plain ASCII file with 2 columns. The left column will be the citing-from paper id and the right column will be the cited-to paper id. The file should also be sorted.


The target is a graph G=(V,E) with each paper P a node in the graph, and each citation a directed edge in the graph. Assuming that a contestant submits a graph G'=(V,E'), the score is the size of the symmetric difference between E and E': |E-E'|.