Annual Report to the NSF:
Integrating and Navigating Eprint Archives through Citation Linking

Project Number IIS-9907892

June 30, 2000

Introduction

NSF funded the Cornell Digital Library Research Group (part of the Computer Science Department at Cornell) to work with researchers at Southampton University to investigate the linking of electronic archives through Citation Linking (known more widely in the United States as Reference Linking). A good example of an electronic archive is the E-Print archive at Los Alamos National Laboratory, a third partner in this project.

The overall project has been named Open Citations, or OpCit for short [1,5].

While researchers at Southampton undertook the linking of the LANL archives using tools they developed for the Open Journal[6] project, we at Cornell took a more general view of the reference linking problem as it applied to online scholarly material, in order to determine a good framework for linking the online literature and to come up with tools that would encourage the linking of electronic publications.

It should be noted that a large group of journal publishers have their own project underway, called CrossRef[11]; this is a very worthwhile project, and when successful will be of great assistance to scientists and other technical workers. However, it relies on Digital Object Identifiers (DOIs), which might not be available to all online publications. At Cornell we are focusing on more informal materials, including born-digital materials such as D-Lib, grey literature such as articles published online by individuals, and repositories of technical reports, such as NCSTRL. We are interested in learning more about the extent to which reference linking information can be extracted from online documents in a variety of formats (plain text, postscript, HTML...), automatically.

Accomplishments in First Year

To launch the collaboration between Cornell and Southampton, we hosted an informal visit in Ithaca for the Southampton researchers, in early November 1999. During this meeting we formulated our first year group work plan and also exchanged experiences with various tools and interfaces. This meeting involved Carl Lagoze and Donna Bergmark (from Cornell), Steve Hitchcock and Zhuoan Jiao (of Southampton) and others. We invited Steve Lawrence of NEC and Eric Hellman of Openly Informatics to Ithaca to talk about their reference linking work. A return visit was made by Bergmark to Southampton during May 2000 to exchange software and ideas.

To get a solid grounding for further research in reference linking, we started in October with a survey of what reference linking tools and services were currently available. This survey was completed by the end of November, and the results were presented to fellow researchers, along with short online demonstrations. The most useful work in reference linking that we found were CiteSeer (now ResearchIndex)[8], S-LinkS (and now Link Baton[4]), SFX[12], Cora[9], and Catch Word[2].

With the intuition gained from this survey, we were able to see what link information is most important, and how that information is put to use. With this background, an API for Reference Linking was developed. The motivation for this work was to answer the question, "what would be the ideal behavior of a digital object that supported reference linking (both incoming and outgoing)"? Answering this question led to an API that included four principle methods:

1.
getMyData() - the digital object should emit standard metadata describing that object, i.e. title, authors, year of publication, etc.
2.
getReferenceList() - the digital object should say what its list of references is (this is the fixed number of references contained in the online document).
3.
get CitationList() - the object can say what other works the object knows have cited it. (This list grows as more and more items are analyzed.)
4.
get linkedText() - returns the original content of the digital object but with link information added to it so that each reference can be used to go directly to an online copy of the referenced work, if an online copy is available.

The API is realized by a Surrogate class, with one Surrogate object per archive item (an item is a document held in an online archive). The methods of the Surrogate object reflect the reference linking API just described. A Java interface was then written that specified the method signatures for the class; finally a "null implementation" was developed which added fields to the class and null implementations for the interface methods. The null implementation was working by the end of 1999.

A complete implementation is now underway. The output of each of the four methods given above is an XML file that can be used for further analysis, or for rendering a linked document, or for many other purposes.

Currently we are almost done coding the first two methods, using HTML input from the D-Lib collection as a testbed. We are making use of tools from other XML projects, and from the Distributed Link System at Southampton. Cornell's software can directly call Southampton citation analysis code, to analyze the references contained in the Reference section of an online document. We will use some tools from the ResearchIndex project to handle postscript documents, and other Southampton tools to handle PDF.

As part of this project, the question arose of whether the links that are normally found in an online journal such as D-Lib are sufficiently valid to be used in reference linking, or must we rely on the DOI method that is available to publishers in CrossRef. We did a study and found that over the 5 years of life online, more than 86% of the links in D-Lib papers are still valid [3].

Plans for Year 2

Our work in reference linking made good progress during the first year, 1999-2000. We need to finish implementing the reference linking API in Java, move the Surrogate objects into a persistent database (such as MySQL or FEDORA[10]). Another goal is to fold the reference linking API into the widely used Dienst[7] architecture and protocol. Not only would this provide us persistent storage for the Surrogate objects, but it would simplify the task of interlinking various archives.

This year we have been concentrating entirely on the analysis side of a two-stage process of creating a reference linking application:


Overall reference linking architecture

This first stage, the analysis of an online document to get link data, is followed by presentation/rendering of the marked up document. In the second year we would look at using XSLT to transform one rendering of the document into another.

We also are awaiting the arrival of Herbert van de Sompel as a visiting professor. He is well known in the field of reference linking, and we anticipate his involvement in the reference linking project.

Finally, as in 1999-2000, we anticipate getting together with our colleagues at Southampton at least twice during 2000-2001 to trade experiences and technical advice.

Acknowledgements

Additional support for this Reference Linking work came from the DARPA/CNRI grant, #2057/57-02, particularly to support work regarding DLIB Magazine, an online journal about electronic publishing.

Bibliography

1
The open citation project.
<http://opcit.eprints.org>.

2
Active reference linking, 2000.
<http://www.catchword.co.uk/index.htm>.

3
D. Bergmark.
Link accessibility in electronic journal articles.
Technical Report TR 2000-1793, Cornell Computer Science Department, March 2000.
<http://www.cs.cornell.edu/bergmark/LinkAnalysis.ps>.

4
Eric Hellman.
LinkBaton:hyperlink personalization alnd localization, 2000.
<http://www.openly.com/linkbaton/>.

5
S. Hitchcock, L. Carr, Z. Jiao, D. Bergmark, W. Hall, C. Lagoze, and S. Harnad.
Developing services for open eprint archives: globalisation, integration and the impact of links.
In 5th ACM Conference on Digital Libraries, San Antonio, Texas, June 2 - June 7, 2000.

6
Steve Hitchcock, Les Carr, Wendy Hall, Stephen Harris, S. Probets, D. Evans, and D. Brailsford.
Linking electronic journals: Lessons from the Open Journal project.
D-Lib Magazine: The Magazine of Digital Library Research, December 1998. <http://www.dlib.org/dlib/december98/12hitchcock.html>.

7
C. Lagoze and J. Davis.
Dienst: An architecture for distributed document libraries.
Communications of the ACM, 38(4):47, April 1995.

8
Steve Lawrence, C. Lee Giles, and Kurt Bollacker.
Digital libraries and autonomous citation indexing.
IEEE Computer, 32(6):67-71, 1999.
<http://www.researchindex.com>.

9
Andrew McCallum.
About the cora search engine.
<http://cora.whizbang.com/about.html>.

10
S. Payette and C. Lagoze.
Flexible and extensible digital object and repository architecture (FEDORA).
In Second European Conference on Research and Advanced Technology for Digital Libraries, Heraklion, Crete, 1998.

11
Pila, Inc.
CrossRef: The central source for reference linking.
<http://www.crossref.org/>.

12
Herbert Van de Sompel and Patrick Hochstenbach.
Reference linking in a hybrid libary environment, part 2: SFX, a generic linking solution.
D-Lib Magazine: The Magazine of Digital Library Research, 5(4), April 1999. <http://www.dlib.org/dlib/april99/van_de_sompel/04van_de_sompel-pt2.html>.