uk.ac.soton.harvester
Class Deciter

java.lang.Object
  |
  +--uk.ac.soton.harvester.Deciter

public class Deciter
extends java.lang.Object

deciter class does all the significant work in decoding a set of citations


Field Summary
(package private) static int AUTHORS
          AUTHORS is the index of the object in the AttributeMarkers array that recognises the position of the authors in the citation string.
(package private) static int DATE
          DATE is the index of the object in the AttributeMarkers array that recognises the position of the date in the citation string.
(package private) static int EXTRA
          EXTRA is the index of the object in the AttributeMarkers array that recognises the position of any extra features (e.g.
(package private) static int N_AMS
          N_AMS is the number of AttributeMarkers that are used.
(package private) static int NUMBERING
          NUMBERING is the index of the object in the AttributeMarkers array that recognises any initial preporocessing before the recognition proper gets underway.
(package private) static int PAGERANGE
          PAGERANGE is the index of the object in the AttributeMarkers array that recognises the position of the pagerange in the citation string.
(package private) static int PLACE
          PLACE is the index of the object in the AttributeMarkers array that recognises the position of the place of publication in the citation string.
(package private) static int POSTPROCESS
          POSTPROCESS is the index of the object in the AttributeMarkers array that performs any subsequent postprocessing and rationalisation of the marker values.
(package private) static int PREPROCESS
          PREPROCESS is the index of the object in the AttributeMarkers array that performs any initial preprocessing before the recognition proper gets underway.
(package private) static int PUBLICATION
          PUBLICATION is the index of the object in the AttributeMarkers array that recognises the position of the journal title in the citation string.
(package private) static int PUBLISH
          PUBLISH is the index of the object in the AttributeMarkers array that recognises the position of the publisher in the citation string.
(package private) static int TITLE
          TITLE is the index of the object in the AttributeMarkers array that recognises the position of the title in the citation string.
(package private) static int VOLUMEISSUE
          VOLUMEISSUE is the index of the object in the AttributeMarkers array that recognises the position of the volume and issue in the citation string.
 
Constructor Summary
(package private) Deciter(java.lang.String id, java.lang.String[] opts)
          Constructor sets the value of the article ID and extracts the hints and flags from the array of options passed on the command line.
 
Method Summary
protected  void dodecite_simple(java.lang.String line, java.lang.String pr, java.lang.String wr, java.io.PrintWriter Output)
          dodecite_simple handles the whole deciting process for a single citation (sub)entry.
protected  void dodecite(java.lang.String line, java.lang.String pr, java.lang.String wr, java.io.PrintWriter Output)
          dodecite handles the whole deciting process for a single citation entry.
 int doit(java.io.BufferedReader inp, java.io.PrintWriter outp)
          doit initialises the citation harvesting process by setting up the debugging stream, storing the document id, creating an entity encoder if necessary and calling the readLoop to process all the citations.
protected  void doReadLoop(java.io.BufferedReader inp, java.io.PrintWriter Output)
          doReadLoop performs a read loop, reading a line from the input, and processing and printing it to the output.
 void setAttributeMarker(int which, AttributeMarker a)
          setAttributeMarker allows the recogniser for a particular attribute to be changed.
 void setAttributeMarker(int which, java.lang.String amName)
          setAttributeMarker allows the recogniser for a particular attribute to be changed.
 void setAttributeMarker(java.lang.String which, java.lang.String amName)
          a version of setAttributeMarker which is useful for argv.
 void setCitationOutput(CitationOutput co)
          setCitationOutput specifies the citation output object.
 void setCitationOutput(java.lang.String coName)
          setCitationOutput specifies the citation output object.
protected  void split_multiCitation(java.lang.String rest, java.lang.String pr, java.lang.String wr, java.io.PrintWriter Output)
          split_multiCitation If significant citation material is found to be left over with a multiCite hint in operation, it may be assumed that another citation occurrence has been found and dodecite may be called recursively.
 
Methods inherited from class java.lang.Object
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

PREPROCESS

static final int PREPROCESS
PREPROCESS is the index of the object in the AttributeMarkers array that performs any initial preprocessing before the recognition proper gets underway.

NUMBERING

static final int NUMBERING
NUMBERING is the index of the object in the AttributeMarkers array that recognises any initial preporocessing before the recognition proper gets underway.

DATE

static final int DATE
DATE is the index of the object in the AttributeMarkers array that recognises the position of the date in the citation string.

AUTHORS

static final int AUTHORS
AUTHORS is the index of the object in the AttributeMarkers array that recognises the position of the authors in the citation string.

TITLE

static final int TITLE
TITLE is the index of the object in the AttributeMarkers array that recognises the position of the title in the citation string.

PAGERANGE

static final int PAGERANGE
PAGERANGE is the index of the object in the AttributeMarkers array that recognises the position of the pagerange in the citation string.

PUBLICATION

static final int PUBLICATION
PUBLICATION is the index of the object in the AttributeMarkers array that recognises the position of the journal title in the citation string.

VOLUMEISSUE

static final int VOLUMEISSUE
VOLUMEISSUE is the index of the object in the AttributeMarkers array that recognises the position of the volume and issue in the citation string.

PUBLISH

static final int PUBLISH
PUBLISH is the index of the object in the AttributeMarkers array that recognises the position of the publisher in the citation string.

PLACE

static final int PLACE
PLACE is the index of the object in the AttributeMarkers array that recognises the position of the place of publication in the citation string.

EXTRA

static final int EXTRA
EXTRA is the index of the object in the AttributeMarkers array that recognises the position of any extra features (e.g. xxxid) in the citation string.

POSTPROCESS

static final int POSTPROCESS
POSTPROCESS is the index of the object in the AttributeMarkers array that performs any subsequent postprocessing and rationalisation of the marker values.

N_AMS

static final int N_AMS
N_AMS is the number of AttributeMarkers that are used.
Constructor Detail

Deciter

Deciter(java.lang.String id,
        java.lang.String[] opts)
Constructor sets the value of the article ID and extracts the hints and flags from the array of options passed on the command line. Then it sets up the default values for the AttributeMarker objects and the citation outputter.
Method Detail

setAttributeMarker

public void setAttributeMarker(int which,
                               AttributeMarker a)
setAttributeMarker allows the recogniser for a particular attribute to be changed. The anticipated use is setAttributeMarker(DATE, new MyDateRecogniserClass()); It ignores the request if the attribute code is not valid.
Parameters:
which - one of the values PREPROCESS, NUMBERING, DATE, AUTHORS, TITLE, PAGERANGE, VOLUMEISSUE, EXTRA, POSTPROCESS
a - an object which implements the AttributeMarker interface

setAttributeMarker

public void setAttributeMarker(int which,
                               java.lang.String amName)
setAttributeMarker allows the recogniser for a particular attribute to be changed. The anticipated use is setAttributeMarker(DATE, "MyDateRecogniserClass"); This version of the method is provided so that the class name can be given as data, for example as a command line argument or in a configuration file. It ignores the request if the attribute code is not valid. If the name given doesn't correspond to a findable class, if the class is badly constructed or if it is not actually an AttributeMarker then an error message is printed and the request is ignored.
Parameters:
which - one of the values PREPROCESS, NUMBERING, DATE, AUTHORS, TITLE, PAGERANGE, VOLUMEISSUE, EXTRA, POSTPROCESS
amName - a String which gives the name of a class which implements the AttributeMarker interface. A new instance of this class will be created.

setAttributeMarker

public void setAttributeMarker(java.lang.String which,
                               java.lang.String amName)
a version of setAttributeMarker which is useful for argv.

setCitationOutput

public void setCitationOutput(CitationOutput co)
setCitationOutput specifies the citation output object. The standard choice is from a XML , HTML and plain text outputter objects.
Parameters:
co - an object from the CitationOutput-derived class which will be used for printing the citation data.

setCitationOutput

public void setCitationOutput(java.lang.String coName)
setCitationOutput specifies the citation output object. The standard choice is from a XML , HTML and plain text outputter objects. This version of the method is provided so that the class name can be given as data, for example as a command line argument or in a configuration file. If the name given doesn't correspond to a findable class, if the class is badly constructed or if it is not actually an AttributeMarker then an error message is printed and the request is ignored.
Parameters:
coName - the name of a CitationOutput-derived class which will be used for printing the citation data.

dodecite

protected void dodecite(java.lang.String line,
                        java.lang.String pr,
                        java.lang.String wr,
                        java.io.PrintWriter Output)
dodecite handles the whole deciting process for a single citation entry. If a multiCite hint is seen and a citation separator is spotted then the citation is appropriately split and dodecite is called recursively on the fragment. If no multiCiteSpearator is seen, then dodecite_simple is invvoked.
Parameters:
line - the string containing the citation under scrutiny
pr - the page number of the article which contained this citation
wr - the word number at which this citation started on the page
Output - the PrintWriter to which all output must be sent

dodecite_simple

protected void dodecite_simple(java.lang.String line,
                               java.lang.String pr,
                               java.lang.String wr,
                               java.io.PrintWriter Output)
dodecite_simple handles the whole deciting process for a single citation (sub)entry. The various citation attributes are searched for in the following order: numbering, whitespace'n'tags, date, authors, title, page range, volume and issue, xxxid. The recognised data is copied into strings and then output (using splitAuthors and splitPageRange for the structured elements). If significant citation material is found to be left over with a multiCite hint in operation, it may be assumed that another citation occurrence has been found and dodecite may be called recursively.
Parameters:
line - the string containing the citation under scrutiny
pr - the page number of the article which contained this citation
wr - the word number at which this citation started on the page
Output - the PrintWriter to which all output must be sent

split_multiCitation

protected void split_multiCitation(java.lang.String rest,
                                   java.lang.String pr,
                                   java.lang.String wr,
                                   java.io.PrintWriter Output)
split_multiCitation If significant citation material is found to be left over with a multiCite hint in operation, it may be assumed that another citation occurrence has been found and dodecite may be called recursively.
Parameters:
rest - the remaining part of the line containing the citation under scrutiny
pr - the page number of the article which contained this citation
wr - the word number at which this citation started on the page
Output - the PrintWriter to which all output must be sent

doReadLoop

protected void doReadLoop(java.io.BufferedReader inp,
                          java.io.PrintWriter Output)
                   throws java.io.IOException
doReadLoop performs a read loop, reading a line from the input, and processing and printing it to the output. It handles the simple (unspecified) XML input lines from the C/PDF handler phase, sticking continuation lines back together to handle citation split over page boundaries in a PDF file.

doit

public int doit(java.io.BufferedReader inp,
                java.io.PrintWriter outp)
         throws java.io.IOException
doit initialises the citation harvesting process by setting up the debugging stream, storing the document id, creating an entity encoder if necessary and calling the readLoop to process all the citations.
Parameters:
inp - the (de-entitied) input stream containing citations in a primitive XML format
id - the unique id corresponding to this article
outp - the (re-entitying) output stream to which the citation entries will be written.