Cornell Movie-Quotes Corpus v1.0 (released July 2012) Distributed together with: "You had me at Hello: How phrasing affects memorability" Cristian Danescu-Niculescu-Mizil, Justin Cheng, Jon Kleinberg and Lillian Lee ACL 2012 RELATED CORPUS: Cornell Movie-Dialogs Corpus, available at http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html NOTE: If you have results to report on these corpora, please send email to cristian@cs.cornell.edu so we can add you to our list of people using this data. Thanks! Contents of this README: A) Brief description B) Memorability annotation C) Files description D) Contact A) Brief description: This corpus contains a collection of movie lines together with memorability annotations. - 894014 movie script lines - from 1068 movie scripts - 6282 one-line memorable quotes that are automatically matched with the script line which contain them - 2197 one-sentence memorable quotes paired with surrounding non-memorable quotes from the same movie, spoken by the same character and containing the same number of words B) Memorability annotation <> The memorability annotations are based on IMDb’s Memorable Quotes pages. It is important to note that often the form of the IMDb-memorable-quote is different from the form in which the quote appears in the movie script. For example, an IMDb quote for the movie psycho is: "Oh, someone has seen her, all right. Someone always sees a girl with $400,000." however, the quote appears in the movie script as: "Someone has seen her. Someone always sees a girl with forty thousand dollars." which is actually part of a longer line: "Someone has seen her. Someone always sees a girl with forty thousand dollars. She is your girl friend, isn't she?" In order to automatically match IMDb-memorable-quotes with script lines, we used an matching mechanism based on edit distance, lexical overlap and a few simple heuristics. Keep in mind that although this matching mechanism works with very high precision, we did not manually check every single match, so a few matching errors may have occurred. <> Note that this corpus does not contain all IMDb memorable quotes: we discard quotes that have fewer than 20 characters and quotes that could not be confidently matched with with any line in the movie script. Also, we did not consider multi-line IMDb quotes (blocks of lines involving multiple characters). C) Files description: <> moviequotes.memorable_nonmemorable_pairs.txt - this is the data used in the "You had me at hello" paper referenced above. - it contains 2197 pairs of the form (M, N) where: - M is a memorable one-sentence quote - N is a non-memorable quote selected from the same movie such that it is as close in the script as possible to the M (either before or after it), subject to the conditions that: (i) M and N are uttered by the same speaker, (ii) M and N have the same number of words, and (iii) N does not occur in the IMDb list of memorable quotes for the movie (either as a single line or as part of a larger block). - the pairs are separated by blank lines - each pair is represented by 4 lines with the following format: MOVIE_TITLE MEMORABLE_QUOTE LINE_ID_MEMORABLE MATCHED_QUOTE LINE_ID_NON_MEMORABLE NON_MEMORABLE_QUOTE where: MEMORABLE_QUOTE is the memorable one-sentence quote as it appeared on IMDb. LINE_ID_MEMORABLE corresponds to a LINE_ID in moviequotes.scripts.txt (described below) with which the MEMORABLE_QUOTE was matched (see B above for a description of the matching process) MATCHED_QUOTE corresponds to the script line with which the MEMORABLE_QUOTE was matched LINE_ID_NON_MEMORABLE is the LINE_ID in moviequotes.scripts.txt (described below) of the NON_MEMORABLE_QUOTE - Example pair: star wars The Force is strong with this one. 736048 The Force is strong with this one! 735122 Send a detachment down to retrieve them. <> moviequotes.memorable_quotes.txt - contains 6282 memorable one-line quotes (could be multiple sentences) - items are separated by blank lines - each item contains 3 lines with the following format: MOVIE_TITLE MEMORABLE_QUOTE LINE_ID_MEMORABLE MATCHED_QUOTE where: MEMORABLE_QUOTE is the memorable one-line quote as it appeared on IMDb LINE_ID_MEMORABLE corresponds to a LINE_ID in moviequotes.scripts.txt (described below) with which the MEMORABLE_QUOTE was matched MATCHED_QUOTE corresponds to the script line with which the MEMORABLE_QUOTE was matched Example item: psycho Oh, someone has seen her, all right. Someone always sees a girl with $400,000. 621762 Someone has seen her. Someone always sees a girl with forty thousand dollars. She is your girl friend, isn't she? <> moviequotes.scripts.txt - this file contains lines from 1068 movie scripts - one item per line, with the following fields (the field separator is: "+++$+++") LINE_ID MOVIE_TITLE MOVIE_LINE_NR CHARACTER REPLY_TO_LINE_ID TEXT where LINE_ID is an unique ID of the script line MOVIE_TITLE is the title of the movie script MOVIE_LINE_NR is the line in that respective movie script CHARACTER is the name of the character uttering that line REPLY_TO_LINE_ID is the ID of the line after which this line follows in a conversation; empty field if the line is the beginning of a new conversation (where a conversation is defined as a group of lines not interrupted by stage directions) Example item: 751519 +++$+++ strangers on a train +++$+++ 524 +++$+++ hammond +++$+++ 751518 +++$+++ You look worried. What's the matter? - CAVEAT: in some cases stage directions could not be automatically identified due to the variety of the formats in which the original scripts were retrieved; as a consequence, some stage directions are still present in this file. Note that this does not affect the data used in the paper (moviequotes.memorable_nonmemorable_pairs.txt) which was both automatically and manually cleaned of stage directions. D) Contact: Please email any questions to: cristian@cs.cornell.edu (Cristian Danescu-Niculescu-Mizil) This material is based upon work supported in part by the National Science Foundation under grant IIS-0910664. Any opinions, findings, and conclusions or recommendations expressed above are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.