Suggested Project Topics
Named entity recognition
Create a specialized grammar of one or more semantically
defined classes of noun phrases, such as corporation
names, person names or city names. Incorporate the specialized
grammar in English97. Evaluate identification of
the specialized chunk categories (as in problem set
3).
In addition to markup, the method can be evaluated with
reference to the lexicon: does it learn that Microsoft
is a corporation, while Hannover is a city, as recorded
in lexicon frequencies?
Treebank construction by selection of partial analyses
In construction of treebank databases, a parser is often used
to find a unique preliminary analysis, which is then corrected by
a human expert. For instance, Donald Hindle's parser Fidditch
was used as a preprocessor for the Penn Treebank.
Consider the following alternative strategy.
- Lopar is used to construct a parse forest with frequency
annotation, which is read into an interface similar to
charge.
- The expert selects one or more parse forest categories or rules which
are correct.
- Analyses which are not in the complete tree set for the
selected items are eliminated, thus reducing the tree set
represented by the parse forest. Frequencies are re-computed
in this reduced complete tree set.
- Go to 2.
At some point, the expert decides to write out the parse forest,
which might at that point represent one or more trees.
Implement the interface, either starting from lopar-charge (in
which case you should probably consult the author), or starting
from scratch.
Perplexity of Word Bigram Models
Obtain the toolkit for word ngram models
which is available from Carnegie Mellon.
- As one baseline, construct a word bigram model using
the toolkit.
- As another baseline, construct the same word bigram model in lopar,
using right branching rules. Compare it's perplexity
on the test set to the standard model.
- Construct a lexicalized part of speech bigram model
in lopar, and measure its perplexity.
- Adapt English97 to the training and test set, by
restricting and/or incrementing the lexicon. Train
a syntax-based language model, and measure perplexity.
Optimal chunking and tagging
The value of a tag or chunk sequence is the
sum of the weights of complete trees which is compatible
with that sequence. Consider the problem of finding
an maximal-value sequence in a parse forest. Implement an
algorithm (whether it is efficient or not).
Consider the possibility of coding an NP-complete problem
in a grammar and input sentence.
Learning of open-class lexicon
Hypothesis: possible parts of speech for open class
words can be learned by the EM algorithm.
Working with the English97 grammar, modify the lexicon
to give all open class words beginning with "b" all
open class parts of speech. Re-estimate lexical
frequencies using the EM algorithm. By imposing a
frequency cutoff, or by other means, derive a part
of speech lexicon for the words starting with "b".
Evaluate against a standard, such as the original
lexicon.
Past participle test suite
Develop a parser evaluation test suite based Problem Set 3.
Find one or more parsers in addition to lopar+english97,
and evaluate their performance. In addition to a dozen
specific verbs, include a random sample of -ed form
verbs.
Compare the lexicalized English97 model to the unlexicalized
one.
Capitalization modeling in a PCFG
Develop a strategy for dealing with capitalization
in a PCFG. Here is one possibility.
- In preprocessing, each word is replaced by a sequence
of a capitalization token and a lower-case word. For instance
IBM is mapped to "CAP-ALL ibm" and "Johnson" is mapped to
"CAP-1 johnson".
- Capitalization tokens are terminals in the grammar.
- Constructs such as proper names and titles introduce
appropriate capitalization tokens in grammar rules.
- Something special happens at the beginning of
sentences, either in grammar rules or in the capitalization
tokens introduced, so that capitalized words at the
beginning of the sentence may either have CAP-1 as
part of their local phrase, or not.
Develop, train and evaluate a grammar. Try to demonstrate improvement
in the treatment of inital words or titles.
Ranking of search engine hits
If I type "sort a text file" into google,
the first hit which has "file" as an object of
"sort" is in position 30. There are a lot of irrelevant
hits, e.g. ones involving "sort of".
Design and implement a method of using syntactic similarity
to the query to sort hits. Use the robust parser to characterize
syntactic similarity. Develop and evaluate the method using
real data.
Optimization methods for PCFGs
Investigate the question whether there are probabilistically
and/or linguistically better models which are not found by
the EM search.
Possible methods: partially bracketed input, other optimization
methods.
Thinning of parse forests
Investigate the following questions.
- What proportion of probability mass in the parse forest
is contributed by the best k parses (or top N/k parses, where
N is the total number of trees)?
- How much can a parse forest be thinned (by removing vertices)
while retaining almost all the probability mass?
Chunk analysis of web pages
Create a facility for syntactic chunk analysis of html
texts, involving suitable pre-processing and/or grammar
features. Create a model application, e.g. a color map
based on chunk category.
Lexicalized EM estimation
Implement lexicalized EM estimation, taking the
lexicalized parse forests printed by lopar as
a starting point. Include some improvement,
such as:
- Parameter tying for lexicalized rules.
- Class smoothing of word pair distributions.
A class is listed with a lemma, such as
oil:liquid. P(oil:liquid | NP VP pour) is smoothed
e.g. with reference to P(liquid | NP VP pour).