Project Number IIS-9907892
June 30, 2000
The overall project has been named Open Citations, or OpCit for short [1,5].
While researchers at Southampton undertook the linking of the LANL archives using tools they developed for the Open Journal[6] project, we at Cornell took a more general view of the reference linking problem as it applied to online scholarly material, in order to determine a good framework for linking the online literature and to come up with tools that would encourage the linking of electronic publications.
It should be noted that a large group of journal publishers have their own project underway, called CrossRef[11]; this is a very worthwhile project, and when successful will be of great assistance to scientists and other technical workers. However, it relies on Digital Object Identifiers (DOIs), which might not be available to all online publications. At Cornell we are focusing on more informal materials, including born-digital materials such as D-Lib, grey literature such as articles published online by individuals, and repositories of technical reports, such as NCSTRL. We are interested in learning more about the extent to which reference linking information can be extracted from online documents in a variety of formats (plain text, postscript, HTML...), automatically.
To get a solid grounding for further research in reference linking, we started in October with a survey of what reference linking tools and services were currently available. This survey was completed by the end of November, and the results were presented to fellow researchers, along with short online demonstrations. The most useful work in reference linking that we found were CiteSeer (now ResearchIndex)[8], S-LinkS (and now Link Baton[4]), SFX[12], Cora[9], and Catch Word[2].
With the intuition gained from this survey, we were able to see what link information is most important, and how that information is put to use. With this background, an API for Reference Linking was developed. The motivation for this work was to answer the question, "what would be the ideal behavior of a digital object that supported reference linking (both incoming and outgoing)"? Answering this question led to an API that included four principle methods:
The API is realized by a Surrogate class, with one Surrogate object per archive item (an item is a document held in an online archive). The methods of the Surrogate object reflect the reference linking API just described. A Java interface was then written that specified the method signatures for the class; finally a "null implementation" was developed which added fields to the class and null implementations for the interface methods. The null implementation was working by the end of 1999.
A complete implementation is now underway. The output of each of the four methods given above is an XML file that can be used for further analysis, or for rendering a linked document, or for many other purposes.
Currently we are almost done coding the first two methods, using HTML input from the D-Lib collection as a testbed. We are making use of tools from other XML projects, and from the Distributed Link System at Southampton. Cornell's software can directly call Southampton citation analysis code, to analyze the references contained in the Reference section of an online document. We will use some tools from the ResearchIndex project to handle postscript documents, and other Southampton tools to handle PDF.
As part of this project, the question arose of whether the links that are normally found in an online journal such as D-Lib are sufficiently valid to be used in reference linking, or must we rely on the DOI method that is available to publishers in CrossRef. We did a study and found that over the 5 years of life online, more than 86% of the links in D-Lib papers are still valid [3].
This year we have been concentrating entirely on the analysis side of a two-stage process of creating a reference linking application:
This first stage, the analysis of an online document to get link data, is followed by presentation/rendering of the marked up document. In the second year we would look at using XSLT to transform one rendering of the document into another.
We also are awaiting the arrival of Herbert van de Sompel as a visiting professor. He is well known in the field of reference linking, and we anticipate his involvement in the reference linking project.
Finally, as in 1999-2000, we anticipate getting together with our colleagues at Southampton at least twice during 2000-2001 to trade experiences and technical advice.