Cornell Movie--Dialogs Corpus

Distributed together with:  Chameleons in Imagined Conversations.

Data and Code available in ConvoKit: a toolkit for analyzing conversations

Related corpus: Cornell Movie-Quotes Corpus



This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts:

- 220,579 conversational exchanges between 10,292 pairs of movie characters

- involves 9,035 characters from 617 movies

- in total 304,713 utterances

- movie metadata included:

    - genres

    - release year

    - IMDB rating

    - number of IMDB votes

    - IMDB rating

- character metadata included:

    - gender (for 3,774 characters)

    - position on movie credits (3,321 characters)

- see the documentation for details




  author={Cristian Danescu-Niculescu-Mizil and Lillian Lee},

  title={Chameleons in imagined conversations:

  A new approach to understanding coordination of linguistic style in dialogs.},

  booktitle={Proceedings of the

        Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011},



This material is based upon work supported in part by the National Science Foundation under grant IIS-0910664.

Any opinions, findings, and conclusions or recommendations expressed above are those of the author(s) and do

not necessarily reflect the views of the National Science Foundation.