FAQs

General Questions

Question (June 19, 2003): Is the use of external allowed for Tasks II and III?

Revised Answer: For Tasks II, and III, our initial policy was to prohibit the use of external data. The intent of this was to prevent KDD Cup participants from designing solutions in which they explicitly make use of on-line resources that might be construed as containing partial solutions to the specified tasks.

However, after considerable communication with Cup participants, we feel it is necessary to resolve the policy more finely, in a way that still preserves its initial intent:

(1) The use of any bibliographic data, task-specific external data, or any other external resources specific to the task of indexing scientific literature, is prohibited.

(2) However, the use of generic lexical resources -- that is, general resources about the English language, such as WordNet, general-purpose dictionaries and thesauri, and lists of stop-words -- is permitted.

(3) If you are making use of external resources other than the examples specifically mentioned in (2), you must verify their eligibility with the KDD Cup chairs as soon as possible, and in no case later than July 1.

Task III (Download Estimation)

Question (June 9, 2003): Are we allowed to use all the data available in Task 1 or just the Latex sources?

Answer: All of the datasets from Task 1 are available for use for Task 3. The task description for Task 3 has been updated to clarify this.

Question: Is it acceptable to produce a vector of floating point numbers (i.e., representing the number of downloads for each paper), or are we required to output a vector of integers?

Answer: A vector of floating point number is acceptable. This has also been updated in the statement of the task.

Question: The naming convention on the files that seems to correspond with the month and year of most submissions is the "hep-th arxiv number." It has no meaning for the purpose of this problem, aside from being a unique ID. Let me know if this represents a publication date or something key like that.

Answer: The hep-th/yymmnnn number is simply a sequential accession number nnn within the yr/month yymm that it was submitted to the hep-th arXiv. It is assigned at the time of hep-th submission.

Question: The date that appears in all the abstracts after "Date: " is the "arXiv submission date." Is the date that the paper was submitted for publication?

Answer: No, this is the date it was deposited in the arXiv. It may have been submitted for journal publication before or after that date, though typically articles are submitted for publication shortly afterwards.

Question: Revised dates. These only appear in some articles. I didn't see any connection to how we need to use them.

Answer: Some articles are later replaced with (a series of) revised versions, some are not. Revisions sometimes involve added references. The relevant date is usually the earliest date associated with any submission.

Question: SLAC/SPIRES date. Our best estimate for when a paper was published. I am guessing it is included because it is the date we should use for determining the date of any given citation. The file containing these dates should list the arxiv numbers of every article we downloaded.

Answer: The SLAC/SPIRES date is sometimes a mysterious notion. Most often, it is a date shortly after the above arXiv received date, corresponding to when SLAC/SPIRES has downloaded the metadata. Sometimes it is a date long before the arXiv received date, which means that it is a pre-existing record corresponding to a back submission, e.g. a paper published in the 80's that someone has chosen to submit to hep-th during the 90s for historical or other purposes. In general, the earliest date associated to any given submission is typically the relevant one.

Question: What is the L_1 difference between two vectors X and Y?

Answer: By the L_1 difference we mean the L_1 norm of X - Y, and hence the sum of the absolute values of the differences

Task I (Citation Prediction)

Question (May 6, 2003): Is it acceptable to produce a vector of floating point numbers (i.e., representing partial differences in the number of citations between time periods), or are we required to output a vector of integers?

Answer: A vector of floating point number is acceptable. This has also been updated in the statement of the task.