Bootstrapping Lexical Choice via Multiple-Sequence Alignment
Regina Barzilay and Lillian Lee.
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 164--171, 2002.

Abstract: An important component of any generation system is the mapping dictionary, a lexicon of elementary semantic expressions and corresponding natural language realizations. Typically, labor-intensive knowledge-based methods are used to construct the dictionary. We instead propose to acquire it automatically via a novel multiple-pass algorithm employing multiple-sequence alignment, a technique commonly used in bioinformatics. Crucially, our method leverages latent information contained in multi-parallel corpora --- datasets that supply several verbalizations of the corresponding semantics rather than just one.

We used our techniques to generate natural language versions of computer-generated mathematical proofs, with good results on both a per-component and overall-output basis. For example, in evaluations involving a dozen human judges, our system produced output whose readability and faithfulness to the semantic input rivaled that of a traditional generation system.

Paper formats: ps, pdf, other

Data: http://www.cs.cornell.edu/Info/Projects/NuPrl/html/nlp/

BibTeX entry:

@InProceedings{Barzilay+Lee:02a,
  author =       {Regina Barzilay and Lillian Lee},
  title =        {Bootstrapping Lexical Choice via Multiple-Sequence Alignment},
  booktitle =    "Proceedings of the 2002 Conference on Empirical Methods in Natural
Language Processing (EMNLP)",
  pages =        {164--171},
  year =         2002
}


Back links: Lillian Lee's home page or papers page; Cornell NLP page.