Suggested Project Topics

Named entity recognition

Create a specialized grammar of one or more semantically defined classes of noun phrases, such as corporation names, person names or city names. Incorporate the specialized grammar in English97. Evaluate identification of the specialized chunk categories (as in problem set 3). In addition to markup, the method can be evaluated with reference to the lexicon: does it learn that Microsoft is a corporation, while Hannover is a city, as recorded in lexicon frequencies?

Treebank construction by selection of partial analyses

In construction of treebank databases, a parser is often used to find a unique preliminary analysis, which is then corrected by a human expert. For instance, Donald Hindle's parser Fidditch was used as a preprocessor for the Penn Treebank. Consider the following alternative strategy.

Lopar is used to construct a parse forest with frequency annotation, which is read into an interface similar to charge.
The expert selects one or more parse forest categories or rules which are correct.
Analyses which are not in the complete tree set for the selected items are eliminated, thus reducing the tree set represented by the parse forest. Frequencies are re-computed in this reduced complete tree set.
Go to 2.

At some point, the expert decides to write out the parse forest, which might at that point represent one or more trees. Implement the interface, either starting from lopar-charge (in which case you should probably consult the author), or starting from scratch.

Perplexity of Word Bigram Models

Obtain the toolkit for word ngram models which is available from Carnegie Mellon.

As one baseline, construct a word bigram model using the toolkit.
As another baseline, construct the same word bigram model in lopar, using right branching rules. Compare it's perplexity on the test set to the standard model.
Construct a lexicalized part of speech bigram model in lopar, and measure its perplexity.
Adapt English97 to the training and test set, by restricting and/or incrementing the lexicon. Train a syntax-based language model, and measure perplexity.

Optimal chunking and tagging

The value of a tag or chunk sequence is the sum of the weights of complete trees which is compatible with that sequence. Consider the problem of finding an maximal-value sequence in a parse forest. Implement an algorithm (whether it is efficient or not). Consider the possibility of coding an NP-complete problem in a grammar and input sentence.

Learning of open-class lexicon

Hypothesis: possible parts of speech for open class words can be learned by the EM algorithm. Working with the English97 grammar, modify the lexicon to give all open class words beginning with "b" all open class parts of speech. Re-estimate lexical frequencies using the EM algorithm. By imposing a frequency cutoff, or by other means, derive a part of speech lexicon for the words starting with "b". Evaluate against a standard, such as the original lexicon.

Past participle test suite

Develop a parser evaluation test suite based Problem Set 3. Find one or more parsers in addition to lopar+english97, and evaluate their performance. In addition to a dozen specific verbs, include a random sample of -ed form verbs. Compare the lexicalized English97 model to the unlexicalized one.

Capitalization modeling in a PCFG

Develop a strategy for dealing with capitalization in a PCFG. Here is one possibility.

In preprocessing, each word is replaced by a sequence of a capitalization token and a lower-case word. For instance IBM is mapped to "CAP-ALL ibm" and "Johnson" is mapped to "CAP-1 johnson".
Capitalization tokens are terminals in the grammar.
Constructs such as proper names and titles introduce appropriate capitalization tokens in grammar rules.
Something special happens at the beginning of sentences, either in grammar rules or in the capitalization tokens introduced, so that capitalized words at the beginning of the sentence may either have CAP-1 as part of their local phrase, or not.

Develop, train and evaluate a grammar. Try to demonstrate improvement in the treatment of inital words or titles.

Ranking of search engine hits

If I type "sort a text file" into google, the first hit which has "file" as an object of "sort" is in position 30. There are a lot of irrelevant hits, e.g. ones involving "sort of". Design and implement a method of using syntactic similarity to the query to sort hits. Use the robust parser to characterize syntactic similarity. Develop and evaluate the method using real data.

Optimization methods for PCFGs

Investigate the question whether there are probabilistically and/or linguistically better models which are not found by the EM search. Possible methods: partially bracketed input, other optimization methods.

Thinning of parse forests

Investigate the following questions.

What proportion of probability mass in the parse forest is contributed by the best k parses (or top N/k parses, where N is the total number of trees)?
How much can a parse forest be thinned (by removing vertices) while retaining almost all the probability mass?

Chunk analysis of web pages

Create a facility for syntactic chunk analysis of html texts, involving suitable pre-processing and/or grammar features. Create a model application, e.g. a color map based on chunk category.

Lexicalized EM estimation

Implement lexicalized EM estimation, taking the lexicalized parse forests printed by lopar as a starting point. Include some improvement, such as:

Parameter tying for lexicalized rules.
Class smoothing of word pair distributions. A class is listed with a lemma, such as oil:liquid. P(oil:liquid | NP VP pour) is smoothed e.g. with reference to P(liquid | NP VP pour).