API for Linkable References

(Donna Bergmark, Draft 5, April 18, 2000)

One of the goals in the work plan for the Reference Linking project was to define the properties of a linkable object. Email from Carl Lagoze to project partners provides useful background for the plan, the first step of which is to think a bit about what linkability actually means. Basically, we concluded that linkable means to be on the network and be findable. We also came up with a list of term definitions, the most important of which are "reference" and "citation". These are "creations" in the IFLA sense, and include both works and manifestations. Creations are referenceable; items are linkable because they have URLs. One Creation may have many Items. Furthermore, when I say "copy of a paper" I am really saying "Item of a Creation." The two phrases are interchangeable.

Design for a Reference Linking API

This section is an attempt to answer the question, "In the ideal world, what would be the operational semantics to link in and out of an object? What methods would you want the object's linking surrogate to have?" In the ideal world we should like to think of objects linking to each other, but for practical reasons we use proxies, or surrogates, to perform the linking semantics on behalf of the items they represent.

The surrogate will rely on a separate service, such as a handle server, to resolve a reference (i.e. a creation) to its n items. (We will have one of these servers by mid-April or so.) In particular, the server will be populated with URNs that resolve to locations of digital surrogate objects. The surrogate object will know the URL of its associated item.

There would be (in the perfect world) only one surrogate per creation, but in the real world we may not be able always to determine when two different sets of bibliographic data in fact represent a single creation. Instead we assume that each item in an archive has a surrogate. Two different copies of the same paper will have two different surrogates, even if they belong to the same Creation.

Note that our definition of an API is to list the methods that we would like linking proxies to support. Each method contains a description of what sort of information is to be returned; however, the precise format of input parameters and return values is left to a more detailed specification of the API, later on.

Here is the fourth cut of the API, as of April 7.

getLinkedText() - Description significantly modified as of April 13
Disseminates the paper's contents, with live reference links embedded or accompanying it. This is the overall goal of the reference linking project. If the original item is in a modifiable format (HTML, PDF) then this method returns a data stream with the original item contents, but with linkable references within the paper demarcated by XLINKS (or OpenURLs). For items in a format that cannot be easily linked (e.g. TIFF), if the item is analyzable at all (doubtful) then the returned data stream could be a combination of the original item's contents and a separate entity containing references, including some with live links.
The following methods support this goal, but may serve other purposes as well. In the context of Open Archives, we are interested mainly in those records that include full text. In other words, it does not make much sense to link abstracts or Metadata files together.
getReferences() - Dropped, as of March 23
Returns the list of references contained in this document, as a string of parseable and processable characters (e.g. ASCII). This is essentially the references as they were spelled out in the document. First cut will be getting the Reference Section at the end of the paper, but other kinds of references need to be considered as well.
In any case, the references will be extracted, as is, from the parseable (e.g. ASCII) version of the document.
Note: Southampton feels this would be useful in the API and thinks it should be re-instated. The reason Carl originally suggested dropping it was because it implied the existence of a reference section at the end of the paper. This need not be the case, though, since references appearing as footnotes, e.g., could be collected together into a series of text fragments that together represents all the references in a text, as they appeared in the text.
Note: getReferences() could be reconsituted from the results of the getReferenenceList() call.
getReferenceList()
The list of references contained in this item. The information for each reference includes as much "official" metadata as we have been able to find: missing data is filled in, authors' names are canonical zed, and errors are corrected, on a best effort basis. The data could be returned as XML, where the tags include the ones defined by the Open Archives Santa Fe Convention. And example of such an XML stream can be viewed here.
Each reference contains its original form in this item as a string of parseable and processable characters (e.g. ASCII). This is essentially the reference as it is spelled out in the document. By looking at this original form, it should be possible to determine what type of reference this is (see the glossary). One of these types is a linkable reference. [This was new, April 5. How does one tell from parsing the original text whether the reference is linkable? - Donna]
It would be desirable for each reference to carry the literal context in which it was used, because citation contexts have proven to be of great value to users.
Each reference has an associated unique document identifier (see below).
getCurrentCitationList()
Returns the list of known citations of this document, each citation (i.e. work) being in some canonical form, e.g. an XML stream. From the unique identifier of the citing document, one can retrieve the context of the reference. Note that the citing document should exist on line, because we would have had to process it already in order to find the citation to this paper in the first place. But it has been noted that this is not the only way to get citations. One could ask the SCI, for example. Therefore this method also returns what type of citation this is. Only some types will have context(s).
The method returns as much information about each citation as possible. A client can decide how much/little to display, i.e. just XML, just a list of document ID's, etc. Each citation has an associated unique document identifier (see below).
getMyData()
For this archive item, return whatever information there is, such as title, author, year published, relating to this document. There would be an internal, private method, that could be called to look up or generate the unique document identifier that corresponds to this data ( see below ). As in the previous methods, all available metadata (excluding references and citations) is returned. Clients can choose how to display this data, such as original text fragments, XML metadata, canonical reference string (suitable for clicking and pasting as a reference into another paper), BibTeX (as several reference services do already), etc.
getID() - dropped, as of April 6
Returns the identifier of the creation that corresponds to a set of bibliographic data. It maps the bibliographic data to a URN which can be fed to a name server in order to get the list of URLs of the digital object surrogates that correspond to this creation. This method answers Carl's "is this you" question. The empty string is returned if the answer to the question is no or undecidable.
Note: This was dropped because it does not really related to this object. This method would not be in the API, but instead would appear as some utility routine.

getRefID()
Given some bibliographic data as input, return true if this looks like one of the references in this paper. This answers Carl's question, "Is this one of your references?" (Actually, rather than true/false, the complete reference is returned, or null is returned.) This method also answers Carl's question, "How do you reference me?" if the complete reference includes its textual contexts.
getCitationID()
Given some bibliographic data as input, return true if this data corresponds to one of the known citations of this document. (Actually, rather than true/false, full data regarding the citation is returned, or null is returned.) This answers Carl's question, "am I one of your citations?"
[ There may be issues here, Carl says. I agree. It is kind of a strange method. Why would an Item want to know if it is in another work's citation list?]

getRelatedPapers()
This needs to be defined further, but is a placeholder for things like co-cited, co-referenced, other papers by the same author, etc.
getLocationList() - Dropped, March 13.
This one is now off the list. Instead we will use a handle system to resolve a set of bibliographic data to a list of locations of surrogates for copies of the paper corresponding to this bibliographic data.
Original semantics here were that given a document id, return all known locations of the paper as URIs.

One thing that needed resolution is what to use as a unique document identifier, and how best to distinguish between a paper (in the abstract) and a document (an online copy of the paper). Initially I proposed that the unique document identifier be like ResearchIndex's DID, while the online copy simply be identified by its URL. Carl, however, says to assume an ideal world, and that the handle system will never return a URL to the paper itself, but to the digital object representing that paper. The paper, the abstract entity that corresponds to this object, should be identified by its URN. For archives complying with the Open Archive Convention, this can be the "Full ID" field. Alternatively it could be the display ID.
In the end, I decided that the API would use a document id (DID) to identify works in the abstract. If the work has a DOI, then the DID is the DOI. Other DIDs are constructed from the first author's last name plus the year plus some of the title (following CiteSeer practice).
The API described here is formalized in this Java interface. The Linkable.API package, written in Java, is a draft implementation of this API.
Proposed Implementation in FEDORA
FEDORA is a digital object and repository architecture. It could be used to build the bibliographic data proxies described in the previous section, but also to store these permanently for reuse by many different reference linking applications.
The methods in the previous section become methods in a FEDORA behavior. For example, one could well imagine that having bibliographic data and disseminating it would be a good behavior for objects that are to behave like linkable references. Each document in an archive will have a FEDORA object which acts as a linking surrogate to disseminate bibliographic data about that document. Much of this bibliographic data is stored right within the FEDORA object as Internal DataStreams. For example, after first being extracted, the reference list could be stored within the digital object as a DataStream. The document itself is simply a ReferenceStream. It remains as part of the archive and is referenced from the repository of surrogates.

If more than one copy of the same creation exists in the set of archives being processed, then there will be two distinct FEDORA objects, both of which contain pretty much the same data since they both correspond to the same creation.
Originally there was the question of whether we should have just one "abstract" generic linking object that can disseminate bibliographic data no matter what paper it is pointing at, or one digital object per paper in an archive, or one digital object per archive. The main advantage to having one digital object per archive item is that the reference data, once extracted, would be permanently accessible. So, we have settled on this as being the best approach.
The collection of FEDORA objects then provides a uniform interface to various documents in various archives. With this uniform interface, inter-archive linking should be possible. We would keep a database of creation-related data. Each Fedora object has a pointer into this database, to the entry that contains the creation metadata for this FEDORA object (in the March 13th version of this API, we kept this metadata in the object itself). All works that have appeared as items in the archive or as references within items in the archive will appear in this database. Such a database facilitates document id lookup
In addition, we will keep a database containing cite-refs (links).   This database makes updating of citation data more efficient. It can also be used standalone to analyze reference and citation patterns. Each record would have the following fields (at least): target document id, source document id. The data base is indexed by the target id.
There might also be a name authority for names of authors, as well as for names of journals.

The link from the FEDORA digital object to these external databases will be programmatic.

Question: since the surrogate is a FEDORA object, what would its URL look like? We are assuming that a name service, given a URN, would return a list of surrogate URLs. But FEDORA objects have URNs (e.g. cornell.dli2/unique name of object) and not URLs. I think it would be better if the handle service mapped the creation's URN to the list of surrogate URNs, each of which then held a URL to the archive item the surrogate is representing.

The Linking Service

This section addresses the question, "What would a reference linking service look like?"

One can think of the linking service within the context of something like Dienst or other retrieval service. The group at the University of Southampton is doing a lot of good work in this area. Once you have retrieved a paper and are viewing its full text, you ought to be able to access the references while reading the original paper. At the very least, a javascript will bring up the reference in a separate window, as it appears at the end of or within the document you are reading. If the reference is linkable, then the user should be asked whether to retrieve the reference. If the answer is yes, then the retrieval of the referenced object should occur asynchronously, while the user continues to read the original paper.

Another function of the linking service is that when you retrieve a paper (full text or not) you should be able to ask the service for citations of this paper, related papers, contexts of the citations, and so on. The linking API described here should support that goal.

The API described here might become part of a larger service; for example, it could be merged with collection services. Or it could stand on its own as a reference linking service.

Using the API

One test of the API is to see if it can be used to construct a collection of surrogate objects in the first place. This page details the use of the API to construct such a collection.

Source: $HOME/private/DLRG/ReferenceLinking/API.html 2000/03/27 Updated 2000/05/04-7-18 based on comments from Carl Lagoze. Published: http://www.cs.cornell.edu/cdlrg/Reference%20Linking/APIforLinkableReferences/API.html