Introduction to Computional Linguistics

LING/COGST 424 COM S 324

Spring 2002


  1. Mats Rooth

  2. Morrill 203A (enter through Linguistics main office)

  3. Office hour: Wednesday 3-4


Steady progress in formalisms, algorithms, linguistic knowledge, and computer technology is bringing computational mastery of the syntax, morphology, and phonology of natural languages within reach. The course introduces methods for "doing a language" computationally, with an emphasis on approaches which combine linguistic knowledge with powerful computational formalisms. Topics include: computational grammars, parse forest representation of syntactic analyses; finite state morphology; weighted grammars; feature constraint formalisms for syntax; treebank and other markup methodology; experimental-modeling methodology using large data samples.

The class involves working with computational tools and formalisms, but no programming.


Books

  1. Finite State Morphology, by Lauri Karttunen and Ken Beesley, forhcoming from Cambridge University Press. By arrangement with the authors, you will get printouts of the pre-publication version. Please observe their request not to distribute this any further (don't let anyone copy your copy).

  2. Head-driven Phrase Structure Gramar, by Carl Pollard and Ivan Sag. University of Chicago Press. This is in the book store.

  3. (Unix Introduction) In the book store.


Class requirements


Problem sets involve lab experiments and grammar hacking, and are based on in-class labs. About five sets will be assigned, and are due two weeks after you get them.

Other in-class labs introduce some computational tool or technique, and include a small problem or experiment. For these, turn in a one or two page lab report describing your results.

Term projects involve grammar development and/or computational linguistic experiments. Turn in a paper (a ten page length is ideal).


Lab and software


The computational linguistics lab is in Morrill 203A, in the basement of Morrill Hall. This is the location for in-class labs, and you may work there on your own when there is no class or meeting in the lab.

The Sun Ultra 10 machines in the lab run Solaris 8, which is a Unix operating system. You will need to learn how to work with the Unix shell and edit text files (usually with emacs or vi). Useful Unix programs such as sort and egrep will be introduced in the class.

You will receive a login and home directory. Most course software is installed in /usr/local/bin. Material for labs is under /fsys/blue/a/Lab. Manual pages are in /usr/local/man.

The parser lopar from Helmut Schmid at the University of Stuttgart is a parser and parameter estimator for probabilistic context free grammars. In addition to Solaris, it is available for Linux. You can get your own copy from his web page.

The system xfst from Xerox is an implementation of the calculus of regular relations. They license it to universities for educational use. Reportedly, Windows, Linux, and Solaris versions will be included with Finite State Morphology.

The parser yap is also from Helmut Schmid at the University of Stuttgart. It is a feature constraint parser which is compatible in certain ways with lopar. You can get your own copy from his page, but only for Solaris.

The North American News Paper Corpus is licensed from the Linguistic Data Consortium includes material from AP and the New York Times.

The Penn Treebank is a database of syntactic trees, and is also licensed from the LDC. The program tgrep performs searches in a treebank.

The programs xkwic and cqp perform searches in an indexed text corpus, which may include linguistic markup. They are useful for finding linguistic examples, especially ones which are keyed to some lexical item or low-level pattern.


MainTopics


  1. Probabilistic grammars

  2. Finite state morphology

  3. Feature constraint grammars