Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization
Regina Barzilay and Lillian Lee.
HLT-NAACL 2004: Proceedings of the Main Conference, pp. 113--120.
Best paper award.

Abstract: We consider the problem of modeling the content structure of texts within a specific domain, in terms of the topics the texts address and the order in which these topics appear. We first present an effective knowledge-lean method for learning content models from un-annotated documents, utilizing a novel adaptation of algorithms for Hidden Markov Models. We then apply our method to two complementary tasks: information ordering and extractive summarization. Our experiments show that incorporating content models in these applications yields substantial improvement over previously-proposed methods.

Paper formats: ps, pdf, other

Data: http://www.sls.csail.mit.edu/~regina/struct Code: http://people.csail.mit.edu/regina/code/struct.tgz

Press mentions:

BibTeX entry:

@InProceedings{Barzilay+Lee:04a,
  author =       {Regina Barzilay and Lillian Lee},
  title =        {Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization},
  booktitle =    "HLT-NAACL 2004: Proceedings of the Main Conference",
  pages={113--120},
  year =         2004,
  note={Best paper award}
}


Back links: Lillian Lee's home page or papers page; Cornell NLP page