Welcome to KDD Cup 2003, a knowledge discovery and data mining competition held
in conjunction with the Ninth Annual
ACM SIGKDD Conference. This year's competition focuses on problems
motivated by network mining and the analysis of usage logs. Complex networks
have emerged as a central theme in data mining applications, appearing in
domains that range from communication networks and the Web, to biological
interaction networks, to social networks and homeland security. At the same
time, the difficulty in obtaining complete and accurate representations of
large networks has been an obstacle to research in this area.
This KDD Cup is based on a very large archive of research papers that provides
an unusually comprehensive snapshot of a particular social network in action;
in addition to the full text of research papers, it includes both explicit
citation structure and (partial) data on the downloading of papers by users. It
provides a framework for testing general network and usage mining techniques,
which will be explored via four varied and interesting task. Each task is a
separate competition with its own specific goals.
The first task involves predicting the future; contestants predict how many
citations each paper will receive during the three months leading up to the KDD
2003 conference. For the second task, contestants must build a citation graph
of a large subset of the archive from only the LaTex sources. In the third
task, each paper's popularity will be estimated based on partial download logs.
And the last task is open! Given the large amount of data, contestants can
devise their own questions and the most interesting result is the winner.
The e-print arXiv, initiated in Aug 1991, has
become the primary mode of research communication in multiple fields of
physics, and some related disciplines. It currently contains over 225,000 full
text articles and is growing at a rate of 40,000 new submissions per year. It
provides nearly comprehensive coverage of large areas of physics, and serves as
an on-line seminar system for those areas. It serves 10 million requests per
month, including tens of thousands of search queries per day. Its collections
are a unique resource for algorithmic experiments and model building. Usage
data has been collected since 1991, including Web usage logs beginning in 1993.
On average, the full text of each paper was downloaded over 300 times since
1996, and some were downloaded tens of thousands of times.
The Stanford Linear Accelerator
Center SPIRES-HEP database has been comprehensively cataloguing the
High Energy Particle Physics (HEP) literature online since 1974, and indexes
more than 500,000 high-energy physics related articles including their full