Kristen Summers

PhD Student, Cornell University
summers@cs.cornell.edu
5132 Upson Hall
607-255-5577

Research Interests

I work on document analysis. My long-term goal is to provide support for sophisticated electronic document manipulation tools for indexing, browsing, linking, etc.

My primary interest is in discovering logical structure in arbitrary electronic documents. The goal is to take an electronic document representation as input and return a hierarchy of logical pieces of the document as output. For example, given a scanned-in or postscript version of a technical report, I would like to be able to divide it into sections, paragraphs, etc. Similarly, in a business letter, the address headings, body, and closing should be identifiable.

This problem has two primary components: segmentation (dividing the document into logical pieces) and classification (categorizing the pieces). It also raises the questions of evaluation (previous work differs in descriptions of the correct hierarchy), types of logical structures, and theoretical limitations.

The task is relevant to two of Bruce Croft's top 10 research issues for information retrieval (in the November 1995 issue of D-Lib Magazine): number 5, "interfaces and browsing," and number 3, "efficient, flexible, indexing and retrieval." Determining logical structure enables flexible, hierarchical browsing; doing so in a general way supports system flexibility and handling of multiple document types.

As my thesis project, I have implemented a system called LABLER (LAyout-Based Logical Entity Recognizer), which takes as input the (slightly cleaned) results of OCR and finds a logical structure hierarchy for the given document.

Papers

Résumé