This chapter describes high-level models for document structure and the extraction of such structure from electronic markup. Our recognizer, a recursive descent parser written in Lisp, handles documents encoded in the La)TeX family of markup languages: TeX, LaTeX and AMSTeX.
We present the recognizer as follows. s:high-level-models describes the high-level models used to capture general document content. s:quasi-prefix presents the models used to capture written mathematics. s:parse gives a brief overview of the techniques used to extract structure from documents conforming to our model. La)TeX allows the author of a document to extend the markup language by introducing user-defined macros. These are modeled as introducing new object types into the logical structure. Using this model, we describe a flexible method for extending the recognizer to handle La)TeX macros in s:macro-objects. s:document-encoding formulates a few guidelines for unambiguous document encodings based on our experience in extracting structure from current-day markup documents. a:recognize documents the external interface to the recognizer.