Abstract: We consider the problem of modeling the content structure of texts within a specific domain, in terms of the topics the texts address and the order in which these topics appear. We first present an effective knowledge-lean method for learning content models from un-annotated documents, utilizing a novel adaptation of algorithms for Hidden Markov Models. We then apply our method to two complementary tasks: information ordering and extractive summarization. Our experiments show that incorporating content models in these applications yields substantial improvement over previously-proposed methods.
Data: http://www.sls.csail.mit.edu/~regina/struct
Code: http://people.csail.mit.edu/regina/code/struct.tgz
Press mentions:
BibTeX entry:
@InProceedings{Barzilay+Lee:04a,
author = {Regina Barzilay and Lillian Lee},
title = {Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization},
booktitle = "HLT-NAACL 2004: Proceedings of the Main Conference",
pages={113--120},
year = 2004,
note={Best paper award}
}