API for Linkable References

(Donna Bergmark and Carl Lagoze, Final Version) (previous version)

A linkable reference is a work which has at least one copy somewhere on the network, is findable, and we know how to find it. Here is a list of term definitions, the most important of which are "reference" and "citation". These are "creations" in the IFLA sense, and include both works and manifestations. Creations are referenceable; items are linkable because they have URLs. One Creation may have many Items. Furthermore, when we say "copy of a paper" we are really saying "Item of a Creation." The two phrases are interchangeable.

Design for a Reference Linking API

This section is an attempt to answer the question, "In the ideal world, what would be the operational semantics to link in and out of an object? What methods would you want the object's linking surrogate to have?" In the ideal world we should like to think of objects linking to each other, but for practical reasons we use proxies, or surrogates, to perform the linking semantics on behalf of the items they represent.

There would be (in the perfect world) only one surrogate per creation, but in the real world we may not be able always to determine when two different sets of bibliographic data in fact represent a single creation. Instead we assume that each item in a repository has a surrogate. Two different copies of the same paper will have two different surrogates, even if they belong to the same Creation, if they exist in two different repositories..

Note that our definition of an API is to list the methods that we would like linking proxies to support. Each method contains a description of what sort of information is to be returned; however, the precise format of input parameters and return values is left to a more detailed specification of the API, later on.

Here is the final cut of the API:

getLinkedText() -
Disseminates the paper's contents, with live reference links embedded or accompanying it. This is the overall goal of the reference linking project. If the original item is in a modifiable format (HTML, PDF) then this method returns a data stream with the original item contents, but with linkable references within the paper demarcated by <reflinks>s. For items in a format that cannot be easily linked (e.g. TIFF), if the item is analyzable at all (doubtful) then the returned data stream could be a combination of the original item's contents and a separate entity containing references, including some with live links.
The following methods support this goal, but may serve other purposes as well. In the context of Open Archives, we are interested mainly in those records that include full text. In other words, it does not make much sense to link abstracts or Metadata files together.
getReferenceList()
The list of references contained in this item. The information for each reference includes as much "official" metadata as we have been able to find: missing data is filled in, authors' names are canonical zed, and errors are corrected, on a best effort basis. The data could be returned as XML. And example of such an XML stream can be viewed here.
Each reference contains its original form in this item as a string of parseable and processable characters (e.g. UTF-8/ASCII). This is essentially the reference as it is spelled out in the document. By looking at this original form, it should be possible to determine what type of reference this is (see the glossary). One of these types is a linkable reference. [This was new, April 5. How does one tell from parsing the original text whether the reference is linkable? - Donna. This whole issue of types of refererences is still unresolved.]
It would be desirable for each reference to carry the literal context in which it was used, because citation contexts have proven to be of great value to users.
Each reference has an associated identifier (see below) which is unique in the space of scholarly online works.
getCurrentCitationList()
Returns the list of known citations of this document, each citation (i.e. work) being in some canonical form, e.g. an XML stream. From the unique identifier of the citing document, one can retrieve the context of the reference. Note that the citing document should exist online, because we would have had to process it already in order to find the citation to this paper in the first place. But it has been noted that this is not the only way to get citations. One could ask the SCI, for example. Therefore this method also returns what type of citation this is. Only some types will have context(s).
The method returns as much information about each citation as possible. A client can decide how much/little to use or display, i.e. just XML, just a list of document ID's, etc. Each citation has an associated unique document identifier (see below).
getMyData()
For this repository item, return whatever information there is, such as title, author, year published, relating to this document. There would be an internal, private method, that could be called to look up or generate the unique document identifier that corresponds to this data ( see below ). As in the previous methods, all available metadata (excluding references and citations) is returned. Clients can choose how to display this data, such as original text fragments, XML metadata, canonical reference string (suitable for clicking and pasting as a reference into another paper), BibTeX (as several reference services do already), etc.
getRefID()
Given some bibliographic data as input, return true if this looks like one of the references in this paper. This answers Carl's question, "Is this one of your references?" (Actually, rather than true/false, the complete reference is returned, or null is returned.) This method also answers Carl's question, "How do you reference me?" if the complete reference includes its textual contexts.
This has not been implemented.

getCitationID()
Given some bibliographic data as input, return true if this data corresponds to one of the known citations of this document. (Actually, rather than true/false, full data regarding the citation is returned, or null is returned.) This answers Carl's question, "am I one of your citations?"
[ There may be issues here, Carl says. I agree. It is kind of a strange method. Why would an Item want to know if it is in another work's citation list?]

This has not been implemented.

getRelatedPapers()
This needs to be defined further, but is a placeholder for things like co-cited, co-referenced, other papers by the same author, etc.

One thing that needed resolution is what to use as a unique document identifier, and how best to distinguish between a paper (in the abstract) and a document (an online copy of the paper). Initially I proposed that the unique document identifier be like ResearchIndex's DID, while the online copy simply be identified by its URL. Carl, however, says to assume an ideal world, and that the handle system will never return a URL to the paper itself, but to the digital object representing that paper. The paper, the abstract entity that corresponds to this object, should be identified by its URN. For archives complying with the Open Archive Initiative, this can be the oai-identifier or RepositoryName. For other repositories, it could be a DOI.
In the end, we decided that the API would use a document id (DID) to identify works in the abstract. This DID is constructed from the first author's last name plus the year plus some of the title (following CiteSeer practice).
The API described here is formalized in this Java interface. The Linkable.API package, written in Java, is a draft implementation of this API.
Proposed Implementation in FEDORA
FEDORA is a digital object and repository architecture. It could be used to build the bibliographic data proxies described in the previous section, but also to store these permanently for reuse by many different reference linking applications.
The methods in the previous section become methods in a FEDORA behavior. For example, one could well imagine that having bibliographic data and disseminating it would be a good behavior for objects that are to behave like linkable references. Each document in an archive will have a FEDORA object which acts as a linking surrogate to disseminate bibliographic data about that document. Much of this bibliographic data is stored right within the FEDORA object as Internal DataStreams. For example, after first being extracted, the reference list could be stored within the digital object as a DataStream. The document itself is simply a ReferenceStream. It remains as part of the archive and is referenced from the repository of surrogates.

If more than one copy of the same creation exists in the set of archives being processed, then there will be two distinct FEDORA objects, both of which contain pretty much the same data since they both correspond to the same creation.
Originally there was the question of whether we should have just one "abstract" generic linking object that can disseminate bibliographic data no matter what paper it is pointing at, or one digital object per paper in an archive, or one digital object per archive. The main advantage to having one digital object per archive item is that the reference data, once extracted, would be permanently accessible. So, we have settled on this as being the best approach.
The collection of FEDORA objects then provides a uniform interface to various documents in various archives. With this uniform interface, inter-archive linking should be possible. We would keep a database of creation-related data. Each Fedora object has a pointer into this database, to the entry that contains the creation metadata for this FEDORA object (in the March 13th version of this API, we kept this metadata in the object itself). All works that have appeared as items in the archive or as references within items in the archive will appear in this database. Such a database facilitates document id lookup
In addition, we will keep a database containing cite-refs (links).   This database makes updating of citation data more efficient. It can also be used standalone to analyze reference and citation patterns. Each record would have the following fields (at least): target document id, source document id. The data base is indexed by the target id.
There might also be a name authority for names of authors, as well as for names of journals.

The link from the FEDORA digital object to these external databases will be programmatic.

Question: since the surrogate is a FEDORA object, what would its URL look like? We are assuming that a name service, given a URN, would return a list of surrogate URLs. But FEDORA objects have URNs (e.g. cornell.dli2/unique name of object) and not URLs. I think it would be better if the handle service mapped the creation's URN to the list of surrogate URNs, each of which then held a URL to the archive item the surrogate is representing.

The Linking Service

This section addresses the question, "What would a reference linking service look like?"

One can think of the linking service within the context of something like Dienst or other retrieval service. The group at the University of Southampton is doing a lot of good work in this area. Once you have retrieved a paper and are viewing its full text, you ought to be able to access the references while reading the original paper. At the very least, a javascript will bring up the reference in a separate window, as it appears at the end of or within the document you are reading. If the reference is linkable, then the user should be asked whether to retrieve the reference. If the answer is yes, then the retrieval of the referenced object should occur asynchronously, while the user continues to read the original paper.

Another function of the linking service is that when you retrieve a paper (full text or not) you should be able to ask the service for citations of this paper, related papers, contexts of the citations, and so on. The linking API described here should support that goal.

The API described here might become part of a larger service; for example, it could be merged with collection services. Or it could stand on its own as a reference linking service.

Using the API

One test of the API is to see if it can be used to construct a collection of surrogate objects in the first place. This page details the use of the API to construct such a collection.

Source: $HOME/private/DLRG/ReferenceLinking/API.html 2000/03/27 Updated 2000/05/04-7-18 based on comments from Carl Lagoze. Published: http://www.cs.cornell.edu/cdlrg/Reference%20Linking/APIforLinkableReferences/API.html