uk.ac.soton.harvester
Class Utils

java.lang.Object
  |
  +--uk.ac.soton.harvester.Utils

public class Utils
extends java.lang.Object

Utils is a place for miscellaneous utility methods to try to control class bloat!


Field Summary
static EntityEncoder ee
          ee is an entity encoder object which contains the mapping from (non-)ASCII to ISO-Latin1 entity names.
 
Constructor Summary
Utils()
           
 
Method Summary
static void DEBUG(java.lang.String s)
          DEBUG is a convenience method for producing debugging output.
static java.lang.String detag(java.lang.String s)
          detag removes tags from an HTML-style string.
static boolean iciSWe(java.lang.String s1, java.lang.String s2)
          iciSWe "ignore case of initial" version of startsWith used to make "Del " and "del " match.
static boolean iciSWp(java.lang.String s1, java.lang.String s2)
          iciSWp is the same as iciSWe except it looks for punctuation instead of a space.
static boolean isBook(DeciterState ds)
          isBook is a utility method that encapsulates a naive heuristic (oh, alright then, hack) for determining whether the citation was to a book/thesis or not.
static boolean isDash(char ch)
          isDash recognises the characters from all the character sets which could correspond to a "dash".
static boolean isInitial(java.lang.String s)
          isInitial checks to see whether the current word is in fact an inital / a set of initials as opposed to a surname.
static boolean isProceedings(DeciterState ds)
          isProceedings is a utility method that encapsulates a naive heuristic (oh, alright then, hack) for determining whether the citation was to a conf/workshop proceedings
static boolean lowerCaseNameComponent(java.lang.String s)
          lowerCaseNameComponent recognises those words which start with a lowercase letter which are in fact parts of names.
static boolean lowercaseOrHyphen(java.lang.String s, int i)
          lowercaseOrHyphen is a utility method that recognises valid characters (ie [a-z-]) within an XXX eprint article identifier.
static java.lang.String PCDATA(java.lang.String s)
          PCDATA is a convenience method to access the entity encoder.
static void setDebugging(boolean b)
          setDebugging controls whether DEBUG messages are printed or not.
static java.lang.String substring(java.lang.String line, int a, int b)
          This is just a safe version of substring
static java.lang.String toInitials(java.lang.String s)
          toInitials turns a set of "forenames" to an appropriate set of separated, correctly delimited initials.
static boolean xxxId(java.lang.String s)
          xxxId recognises strings which are XXX citation ids.
 
Methods inherited from class java.lang.Object
, clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ee

public static EntityEncoder ee
ee is an entity encoder object which contains the mapping from (non-)ASCII to ISO-Latin1 entity names. Most of this task should be performed invisibly by the output PrintWriter, however, the deciter needs to have explicit control of the coding process because it needs to emit tags which should not be transformed (i.e. <author> should not appear as &lt;author&gt; ).
Constructor Detail

Utils

public Utils()
Method Detail

setDebugging

public static void setDebugging(boolean b)
setDebugging controls whether DEBUG messages are printed or not.

DEBUG

public static void DEBUG(java.lang.String s)
DEBUG is a convenience method for producing debugging output. The first time it is called it opens the debugging output file (called "deciter.err") if necessary.
Parameters:
s - the String to be written to the debugging file (a newline is added).

PCDATA

public static java.lang.String PCDATA(java.lang.String s)
PCDATA is a convenience method to access the entity encoder. It is used to explicitly transform the output of any data string that may contain non-ASCII characters or less-than, greater-than or ampersand symbols. These must all be turned into XML entites by the EntityEncoder object.

iciSWe

public static boolean iciSWe(java.lang.String s1,
                             java.lang.String s2)
iciSWe "ignore case of initial" version of startsWith used to make "Del " and "del " match. It also expects string to either end or have a space after it.

iciSWp

public static boolean iciSWp(java.lang.String s1,
                             java.lang.String s2)
iciSWp is the same as iciSWe except it looks for punctuation instead of a space.

xxxId

public static boolean xxxId(java.lang.String s)
xxxId recognises strings which are XXX citation ids. This is one of a limited (but growing?) set of archive names, a slash and a seven digit number of the form YYMMNNN.

lowerCaseNameComponent

public static boolean lowerCaseNameComponent(java.lang.String s)
lowerCaseNameComponent recognises those words which start with a lowercase letter which are in fact parts of names.

isDash

public static boolean isDash(char ch)
isDash recognises the characters from all the character sets which could correspond to a "dash". This is a crucial part of recognising a page range: a dash with numeric strings adjacent is easily recognisiable as a page range.

isInitial

public static boolean isInitial(java.lang.String s)
isInitial checks to see whether the current word is in fact an inital / a set of initials as opposed to a surname.

toInitials

public static java.lang.String toInitials(java.lang.String s)
toInitials turns a set of "forenames" to an appropriate set of separated, correctly delimited initials.

detag

public static java.lang.String detag(java.lang.String s)
detag removes tags from an HTML-style string. These tags are in practise just the font-change tags <b> and <i>. It is used as a final stage filter after all the sections have been recognised in the original string, and just prior to their final output.

lowercaseOrHyphen

public static boolean lowercaseOrHyphen(java.lang.String s,
                                        int i)
lowercaseOrHyphen is a utility method that recognises valid characters (ie [a-z-]) within an XXX eprint article identifier.
Parameters:
s - the string containing the character to check
the - character offset within the string to check

isProceedings

public static boolean isProceedings(DeciterState ds)
isProceedings is a utility method that encapsulates a naive heuristic (oh, alright then, hack) for determining whether the citation was to a conf/workshop proceedings

isBook

public static boolean isBook(DeciterState ds)
isBook is a utility method that encapsulates a naive heuristic (oh, alright then, hack) for determining whether the citation was to a book/thesis or not.

substring

public static java.lang.String substring(java.lang.String line,
                                         int a,
                                         int b)
This is just a safe version of substring