04/04/2000

Processing an Item in the Archive (Pseudo Code)

One test of the API is to see if it can be used to construct a collection of surrogate objects in the first place. This page details the use of the API to construct such a collection.

We assume two databases are available. The first one holds creation-related data (URN, OAMS metadata). (Note: for Open Archives, the database can be loaded up with existing metadata records.) The other database is the citation database consisting of source and target pairs of document ids. A document id could be the index in the first database of where metadata for the citing and cited creations can be found.

  1. Get the next paper in the archive.
  2. Convert the full text of the paper to text/plain. If this can't be done, give up. We'll never determine linking relationships.
  3. Instantiate a new Surrogate object, passing it the addresses of this item in the archive.
  4. During the instantiation, we extract the available metadata from the ASCII version of the document. Extracted information includes titles, authors, etc. The original text fragments are saved in the MIMEfile localMetaData.  If there is a Dienst harvesting interface for this archive, use it to get the OAMS, and try to find the matching text fragments.
  5. See if this paper is already in the database, using any available data to conduct the search. (NOTE: this may be a tough step, depending on the quality of the available data.)  If so, save the document id. If not, construct an OAMS metadata MIMEfile and add it to the database and save the new document id. A new BibData is constructed from the saved document id.

    We now have the following private fields defined:

    BibData   myData         // our document id
    String    myURL          // Network address of our item
    MIMEfile  localMetaData  // Original text fragments
    
  6. Next collect the references from the raw text. For each reference, save the original string, set its reference type (it may or may not be linkable), and save the contexts of the reference. If it makes sense, save the ordinal number. Parse the reference text to collect some metadata.
  7. For each reference, use collected metadata to look up the reference in the database. (Again, this step might be very tricky.)  If it is already there, save the document id. At this point, you might also have a chance to correct/add some more metadata to the database for this document, if the reference contains more information than was in the database.

    If the reference was in the database already, then it has a URN.  There are two main reasons why a reference might already be in the database: 1) it is an archive item that has already been analyzed, or 2) it already appeared as a reference in some other item analysis.

    If case 1) construct a new Citation out of this reference by giving it the context[] and type of citation (REFERENCE). Use the reference's URN to locate the surrogates for copies of this creation. (This involves a call to a handle system.) For each surrogate on the list invoke its addCitation method, handing it the new Citation object.

    If case 2), or if the reference is not already in the database, then we need to construct an OAMS metadata MIMEfile and add it to the database. Save the newly generated document id.

    Finally, construct a BibData from the reference's document id and store it in referenceData.

    For each reference, construct a new Citeref from this document's id and the reference id, and add it to the citation database. (ResearchIndex would also generate a unique CID for this citation.)

    At this point, we have a completed Reference object:

    BibData  referenceData   // pointer into the creation database; a doc id
    int      ordinalNumber   // which reference this is in this item
    String   origRef         // how the reference was spelled in the text
    String   context[]       // context strings from the text for this reference
    RefEnum  refType         // NATURAL, AMBIGUOUS, CLEAR, or LINKABLE
    
    Process each reference in the same way until the Surrogate's refList[] is complete.
  8. Finally we do the citations. Go to the citation data base. For each record where our document id is the target, use the source document id to retrieve the associated creation from the first database. This is the creation that cites us.

    If this citation is already in our knownCitations we are done with this CiteRef. (We know it's in our list by matching up document ids.)

    If it is not on our list, then we must construct a new Citation object and add it to our knownCitations. Constructing a new Citation object requires a document id, a set of context strings, and a citation type. We have the document id. Use it to access the surrogate corresponding to the citing creation.

    How do we do this access? First, we feed the citing creation's URN to our name server, which gives us URLs for all the surrogates for the creation that cites us. Pick one of the surrogates. Invoking its getRefID(MIMEfile citation-BibData) will return the complete Reference in the citing creation for which we were the target.

    Turn that Reference into a Citation by invoking the static Surrogate.buildCitation( Reference ) method.

  9. Take this Citation and add it to our knownCitations. We are now done with this CiteRef. Repeat until all CiteRefs for which we are the target have been handled. At the point, our knownCitations is complete. It may grow as other surrogate constructors invoke our addCitation method.
  10. Done building the Surrogate object for this item. Store the FEDORA object in the "repository" and go to step 1.
It looks as though the API, with two databases and a document id resolver will let us build a collection of surrogates. Here are the particular methods in the API that were used:
Constructors for Surrogate, Reference, Citation, BibData, Creation, CiteRef
getID()
getRefID()
Protected methods used:
addCitation()
buildCitation()

bergmark/private/DLRG/ReferenceLinking/API.html 2000-03-27 Updated 2000-04-04 based on Zhuoan's comments.