CS 6742

Course homepage http://www.cs.cornell.edu/courses/cs6742/2014fa. Main site for course info, assignments, readings, lecture references, etc.; updated frequently.
Course CMS page http://cms.csuglab.cornell.edu. Site for submitting assignments, unless otherwise noted.
Course Piazza page http://piazza.com/cornell/Fall2014/cs6742 Course announcements and Q&A/discussion site. Social interaction and all that, you know.
Instructor Professor Lillian Lee. For contact info, see http://www.cs.cornell.edu/home/llee
Time and place Tuesdays and Thursdays, 10:10-11:25, ~~Hollister 401 (since this room has reconfigurable seating)~~ Gates Hall 344 breakout room (quietly enter through 344, since students are working there, and go to the room on the right).

This page last modified Thu August 20, 2015 4:24 AM.

Brief course description More and more of life is now manifested online, and many of the digital traces that are left by human activity are increasingly recorded in natural-language format. This research-oriented course examines the opportunities for natural language processing to contribute to the analysis and facilitation of socially embedded processes. Possible topics include sentiment analysis, learning social-network structure, analysis of text in political or legal domains, review aggregation systems, analysis of online conversations, and text categorization with respect to psychological categories.

Prerequisites As previously announced in the 2014-2015 Courses of Study, enrollment is limited to PhD students except by permission of instructor. August 14 addition: given the number of PhD students who have registered for credit, permission will not be granted to non-PhD students, and auditing will not be allowed. Required background: CS 2110 or equivalent programming experience, and at least one course in artificial intelligence or any relevant subfield (e.g., NLP, information retrieval, machine learning).

Lectures

QUICK LINKS into the lecture table below
Tuesdays	Thursdays
8/26, lec 1: overview	8/28, lec 2: reviews, helpfulness, social interaction
9/2, lec 3 reviews and social, cont.	9/4, lec 4: what do conversations "look" like?
9/9, lec 5: discourse	9/11, lec 6 discourse, cont.
9/16, lec 7: A1 presentations	9/18, lec 8: A1 presentations
9/23, lec 9 discourse, cont.	9/25, lec 10: adaptation
9/30, lec 11: Unspeakable/Kickstarter	10/2, lec 12: Meme mutation/gendered um and uh
10/7, lec 13: hiphopgangstaghettorapper/stackoverflow vs. email. Plus the IRB; finding data samples on Twitter	10/9, lec 14: donating and collaborating
(Fall break)	10/16, lec 15: checkup appointments
10/21, lec 16: scraping	10/23, lec 17: project coordination, conference submission deadlines
10/28, lec 18: Bayesian ID of fightin' words	10/30, lec 19: fightin' words, cont.
11/4, lec 20: checkup appointments	11/6, lec 21: checkup appointments
11/11, lec 22: no meeting - Veteran's Day	11/13, lec 23: features case study: great writing; grammars
11/18, lec 24: language models, mostly non-ngram ones	11/20, lec 25: checkup appointments
11/23, lec 26: language models and comparing language models	(Thanksgiving break)
12/2, lec 27: checkup appointments	12/4, lec 28: in-class project presentations

Lecture	Date	Agenda and references	Assignments and other handouts
#1	Aug 26	Course overview: scope, course goals, course design Image source: http://en.wikipedia.org/wiki/The_School_of_Athens. Some people are speaking to each other; some are reading and perhaps being influenced by that text; some are writing text, perhaps hoping to have an effect on others; some texts are being read by several people simulataneously. Scan of lecture notes Images and webpages displayed in class: An annotated Wikipedian vote page with comment threads (comments can be seen by clicking on the "1" speech balloons) and notabilia.net visualization of vote dynamics on such pages A Slashdot page (try out the various filtering/visualizations) and an interactive visualization of the corresponding conversation tree by the ConVis project, Hoque, Enamul, Giuseppe Carenini, and Shafiq Joty. 2014. Interactive exploration of asynchronous conversations: Applying a user-centered approach to design a visual text analytic system. Proceedings of the ACL Workshop on Interactive Language Learning, Visualization, and Interfaces (ILLVI). A screenshot of another Slashdot conversation- tree visualizer taken from Pascual-Cid, Victor and Andreas Kaltenbrunner. 2009. Exploring asynchronous online discussions through hierarchical visualisation. Information Visualisation. References Bryan, Christopher J, Gregory M Walton, Todd Rogers, and Carol S Dweck. 2 August 2011. Motivating voter turnout by invoking the self. Proceedings of the National Academy of Sciences 108 (31): 12653-12656. Chong, Dennis and James N. Druckman. 2007. Framing theory. Annual Review of Political Science 10:103--126.	Assignment 1 (A1) officially released
#2	28	To what extent is there social interaction on review sites? Image source: Dorothy Gambrel, Cat and Girl: http://catandgirl.com/archive/2001-05-21-cg0043drive.gif. Permission policy here. Scan of lecture notes Images and webpages displayed in class: Review with 42/42 helpfulness score Saved Amazon's "Straight Man" page with text/stars divergence highlighted; and annotated page of comments on most the most critical helpful review (note also that one of those comments is hidden because "Customers don't think this post adds to the discussion") References Danescu-Niculescu-Mizil, Cristian, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. No country for old members: User lifecycle and linguistic change in online communities. Proceedings of WWW, pp. 307--318. Gilbert, Eric and Karrie Karahalios. 2010. Understanding deja reviewers. Proceedings of CSCW, pp.225—228. [ACM link] Jurafsky, Dan, Victor Chahuneau, Bryan R. Routledge and Noah A. Smith. Narrative framing of consumer sentiment in online restaurant reviews. First Monday 19(4). Michael, Loizos and Jahna Otterbacher. 2014. Write like I write: Herding in the language of online reviews. Proceedings of ICWSM. Mimno, David. Data carpentry. 2014. Pinch, Trevor and Filip Kesler. 2011. How Aunt Ammy gets her free lunch: A study of the top-thousand customer reviewers at Amazon.com.
#3	Sep 2	Review "quality" and "helpfulness": a lens for studying social influence Image source: Randall Munroe, xkcd (click on image for original link). Expletive obscured in this presentation. Scan of lecture notes Images and handouts from class outline handout helpfulness features handout SGJ, WWW'14, fig 3a MAT, Science '13, fig 1c Slides on DNM/K/K/L, WWW'09 C/DNM/L, ICWSM '14, fig 4 References on lecture topics Cheng, Justin, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec. 2014. How community feedback shapes user behavior. Proceedings of ICWSM. Danescu-Niculescu-Mizil, Cristian, Gueorgi Kossinets, Jon Kleinberg, and Lillian Lee. 2009. How opinions are received by online communities: A case study on Amazon.com helpfulness votes. Proceedings of WWW: 141—150. [alt link] Ghose, Anindya and Panagiotis Ipeirotis. 2011. Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. IEEE Transactions on Knowledge and Data Engineering 23(10): 1498—1512. Official link can be found through Worldcat, e.g., here. Muchnik, Lev, Sinan Aral, and Sean Taylor. 2013. Social influence bias: A randomized experiment. Science 341. Otterbacher, Jahna. 2009. 'Helpfulness' in online communities: a measure of message quality. Proceedings of CHI, 955-964. Sipos, Ruben, Arpita Ghosh, and Thorsten Joachims. 2014. Was this review helpful to you? It depends! Context and voting patterns in online content. Proceeedings of WWW. Wang, R.Y. and Strong, D.M. Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems 12, 4 (1996), 5-34. Representative additional references on "unconventional" text classification, by popular demand Davidov, Dmitry, Oren Tsur, and Ari Rappoport. 2010. Semi-supervised recognition of sarcastic sentences in Twitter and Amazon. Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pp. 107--116. http://aclweb.org/anthology/W10-2914 Kiddon, Chloé and Yuriy Brun. That's what she said: Double entendre classification. Proceedings of the ACL (short papers), 89--94. Li, Jiwei, Myle Ott, Claire Cardie, and Eduard Hovy. 2014. Towards a general rule for identifying deceptive opinion spam. Proceedings of the ACL. The paper showing a learned classifier outperforming humans on Tripadvisor-style reviews is Ott, M, Y Choi, C Cardie, and J T Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. Proceedings of the ACL, pp. 309--319. Mihalcea, Rada and Carlo Strapparava. 2006. Learning to laugh (automatically): Computational models for humor recognition. Computational Intelligence 22(2).
#4	4	What do conversations "look" like? Scan of lecture notes Aside: email corpora The Enron corpus: see this data site. MIT Tech Review '13 article: The immortal life of the Enron e-mails. The W3C Enterprise Track TREC data The LDC's Avocado Research Email Collection The paper “Topic-Specific Communication Patterns from Email Data,” Bruce Desmarais, Peter Krafft, Hanna Wallach, James ben-Aaron, and Juston Moore, presented at the 2012 Text as Data conference, mentions an email dataset consisting of "the entire inboxes and outboxes of the county managers of New Hanover County, North Carolina from the month of February, 2011. " References Backstrom, Lars, Jon Kleinberg, Lillian Lee, and Cristian Danescu-Niculescu-Mizil. 2013. Characterizing and curating conversation threads: Expansion, focus, volume, re-entry. Proceedings of WSDM, pp. 13–22. [alt link] Elsner, Micha and Eugene Charniak. September 2010. Disentangling chat. Computational Linguistics 36(3): 389-409. [data and code] Gonzalez-Bailon, Sandra, Andreas Kaltenbrunner, and Rafael E Banchs. 2010. The structure of political discussion networks: A model for the analysis of online deliberation. Journal of Information Technology 25(2): 230-243. Kumar, Ravi, Mohammad Mahdian, and Mary McGlohon. 2010. Dynamics of conversations. Proceedings of KDD, pp. 553--562. Nguyen, Viet-An, Jordan Boyd-Graber, Philip Resnik, Deborah A Cai, Jennifer E Midberry, and Yuanxin Wang. 2014. Modeling topic control to detect influence in conversations using nonparametric topic models. Machine Learning 95:381--421. [alt link]. [The talk slides we looked at in class] Prabhakaran, Vinodkumar, Ashima Arora, and Owen Rambow. 2014. Power of confidence: How poll scores impact topic dynamics in political debates. ACL joint workshop on social dynamics and personal attributes. Prabhakaran, Vinodkumar and Owen Rambow. 2014. Predicting power relations between participants in written dialog from a single thread. Proceedings of the ACL (short papers). Seo, Jangwon, W. Bruce Croft, and David A. Smith. 2009. Online community search using thread structure. Proceedings of CIKM, pp. 1907--1910. Siersdorfer, Stefan, Sergiu Chelaru, Jose San Pedro, Ismail Sengor Altingovde, and Wolfgang Nejdl. July 2014. Analyzing and mining comments and comment ratings on the social web. ACM Trans. Web 8 (3): 17:1-17:39. [alt link] Wang, Yi-Chia, Mahesh Joshi, and Carolyn Penstein Rosé. 2008. Investigating the effect of discussion forum interface affordances on patterns of conversational interactions. Proceedings of CSCW, pp. 555–558.
#5	9	Checkpoints of A1 projects; Discourse phenomena: clues regarding structure Image source: http://www.metmuseum.org/toah/works-of-art/49.70.33. "The image is one for which Picasso did a number of variations in Paris during the autumn–winter of 1912; in each version, a tall bottle and goblet are set out on a small round table." Scan of lecture notes and the handout References related to the A1 project discussions On code switching: Auer, Peter. 2013. Code-switching in Conversation: Language, Interaction and Identity. Routledge. Sample "ACL-style" paper: Elfardy, Heba and Mona Diab. 2012. Token level identification of linguistic code switching, Proceedings of COLING. The LIWC lexicon (categorized word lists): Tausczik, Y R and J W Pennebaker. 2010. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology 29(1). Technically, single licenses are available for purchase from this site. A freely available similar type of lexicon is the Harvard General Inquirer lexicon. Duplicate detection: Sorokina, Daria, Johannes Gehrke, Simeon Warner, and Paul Ginsparg. 2006. Plagiarism detection in arxiv. In Proceedings of the Sixth International Conference on Data Mining, 1070-1075. The code may be available upon request; there's a Java port of the original C code, but available only as a web service, I think, here. Humorous reviews subreddit, Funny reviews on Amazon subreddit, Funniest Amazon review I've ever read thread Subcommunity review sites: DPRreview, RateMyProfessors References from discourse lecture Grice, H.P. 1975. Logic and Conversation. In Syntax and semantics 3: Speech Acts, pp. 41-58. Jurafsky, Dan, and Martin, James H. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition Second edition. Chapter 21 covers discourse. Moser, Megan and Johanna Moore. Toward a synthesis of two accounts of discourse structure. Computational Linguistics 22(3):409--419. Rogers, Todd and Michael I Norton. June 2011. The artful dodger: Answering the wrong question the right way. Journal of Experimental Psychology: Applied 17 (2). References for the examples on the handout: Jordan Boyd-Graber Google+ post Allen, James. 1995. Natural Language Understanding. Benjamin/Cummings Pub Co. Second ed. Hirst, Graeme. 1981. Anaphora in Natural Language Understanding: A Survey. Lecture Notes in Computer Science. Springer, Berlin. Sidner, Candace Lee. 1979. Towards a computational theory of definite anaphora comprehension in English discourse. MIT AITR-537. Wilks, Yorick. 1975. An intelligent analyzer and understander of English. Communications of the ACM 18 (5): 264-274.
#6	11	Attention, intentions, and discourse structure: the Grosz and Sidner theory Scan of lecture notes References: Grosz, Barbara J., and Sidner, Candace L. 1986. Attention, intentions, and the structure of discourse. Computational Linguistics 12(3): 175-204. Mann, William C., and Thompson, Sandra A. 1988. Rhetorical structure theory: Toward a functional theory of text organization. Text: Interdisciplinary Journal for the Study of Discourse 8, no. 3: 243-281. Pinker, Steven and the Royal Society for the Encouragement of Arts, Manufactures and Commerce (RSA) Animate, posted to YouTube on Feb 10, 2011. Language as a Window into Human Nature	A2 out (deadline subsequently extended to Sept. 22)
#7	16	A1 presentations, part one
#8	18	A1 presentations, part two
#9	23	Discussion of application of Grosz/Sidner theory in A2 "Stacking", by Alastair Hesletine. Image source: http://thumbpress.com/the-art-of-stacking-wood/ Scan of discussion notes References see also the previous discourse lectures Wikipedia entry on Deep Blue vs. Garry Kasparov (pronunciation) Stolcke, Andreas Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. 2000. Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech. Computational Linguistics 26(3): 339--373. Taboada, Maite and William C. Mann. 2006. Rhetorical structure theory: Looking back and moving ahead. Discourse Studies 8(3): 423-459. Gives an overview of many issues in analyzing discourse structure. Walker, Marilyn A. 1996. Limited attention and discourse structure. Computational Linguistics 22(2): 255-264.	Read one — your choice — of the readings for Tu Sep 30 (lecture 11) and post a project proposal inspired by it to Piazza by 3pm Mon the 29th; include the general idea, and a suggestion for a dataset. A paragraph suffices (and more is great, if you feel inspired!). Thoughtfulness and creativity are what I'm most interested in, but take feasibility into account. And, read each other's proposals, commenting as you see fit, before class on the 30th.
#10	25	Language adaptation, power and within-group lifespan Scan of lecture notes Danescu-Niculescu-Mizil, Cristian, Lillian Lee, Bo Pang, and Jon Kleinberg. 2012. Echoes of power: Language effects and power differences in social interaction. Proceedings of WWW, pp. 699--708. Link includes access to datasets, talk slides, etc. ACM link is here. Danescu-Niculescu-Mizil, Cristian, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. No country for old members: User lifecycle and linguistic change in online communities. Proceedings of WWW, pp. 307--318. Link includes access to datasets, talk slides, etc. ACM link is here. References http://minimalmovieposters.tumblr.com/archive Beňuš, Štefan, Rivka Levitan, and Julia Hirschberg. 2012. Entrainment in spontaneous speech: The case of filled pauses in supreme court hearings. Proceedings of the 3rd IEEE Conference on Cognitive Infocommunications. Bramsen, Philip, Martha Escobar-Molana, Ami Patel, and Rafael Alonso. 2011. Extracting social power relationships from natural language. Proceedings of ACL HLT. Choudhury, Tanzeem and Alex Pentland. 2004. Characterizing social networks using the sociometer. Proceedings of the North American Association of Computational Social and Organizational Science (NAACSOS) Danescu-Niculescu-Mizil, Cristian, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. A computational approach to politeness with application to social factors. Proceedings of the ACL. Real-life application and links to data and code. Diehl, Christopher P., Galileo Namata, and Lise Getoor. 2007. Relationship identification for social network discovery. Proceedings of the AAAI Workshop on Enhanced Messaging, pp. 546--552. Gilbert, Eric. 2012. Phrases that signal workplace hierarchy. Proceedings of CSCW. Leber, Jessica. 2013. The immortal life of the Enron e-mails. Business News. Ng, Sik Hung and James J Bradac. 1993. Power in Language: Verbal Communication and Social Influence. Sage Publications, Inc. Vinod Prabhakaran and Owen Rambow's work on inferring power relationships
#11	30	Project-possibilities discussion Image source: http://xkcd.com/1055/ The assigned reading: one of: Glasgow, Kimberly, Clayton Fink, and Jordan Boyd-Graber. 2014. Our grief is unspeakable: Automatically measuring the community impact of a tragedy. Proceedings of ICWSM. Mitra, Tanushree and Eric Gilbert. 2014. The language that gets people to give: Phrases that predict success on Kickstarter. Proceedings of CSCW. Sites examined or mentioned during class A sample www.debate.org "duel"; notice the plethora of evaluative explicit or implicit annotations. Other debate sites include createdebate, www.forandagainst.com, www.convinceme.net, idebate.org See also the Internet Argument Corpus, described in A corpus for research on deliberation and debate. Marilyn A. Walker, Pranav Anand, Jean E. Fox Tree, Rob Abbott, Joseph King. LREC 2012 Scrape of 87K Kickstarter projects. Discovered by looking at the Greenberg et al. 2013 paper mentioned below, who linked to thekickbackmachine.com by neight-allen. That site itself is no longer maintained, and there may be a problem with the file, but "dm" at thekickbackmachine.com recommends Walter Haas' Kickspy. Potato-salad kickstarter project. Data on 45K Kickstarter projects, by "Jeanne" Memetracker website, including data: http://www.memetracker.org/index.html. "Palling around with terrorists" quote graph (links = share many words). "Heartbeat" figure for when blogs vs. the mainstream media discuss an event. See publication info below. Note also the ICWSM 2011 Spinn3r dataset: over 386 million blog posts, news articles, classifieds, forum posts and social media content between January 13th and February 14th. It spans events such as the Tunisian revolution and the Egyptian protests (see http://en.wikipedia.org/wiki/January_2011 for a more detailed list of events spanning the dataset's time period). A sample Quirky appeal for help in pricing, tagline selection, etc. randomactsofpizza.com. Note the need for guarding against scam attempts, described in the "Best Practices" section. The actual subreddit is here. Example of reciprocity offer. Note annotation of who actually received pizza. See publication info below. TREC 2011 microblog-track tweet dataset and link to twitter-tools by Jimmy Lin ("lintool"), which "provides support for removing deleted tweets from your copy of the corpus", as is important to be compliant with Twitter's policies. References (including some that came up during class) Althoff, Tim, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. 2014. How to ask for a favor: A case study on the success of altruistic requests. Proceedings of ICWSM. Bailey, Michael, Daniel J Hopkins, and Todd Rogers. 2013. Unresponsive and unpersuaded: The unintended consequences of voter persuasion efforts. Working paper on SSRN. Bamman, David, Brendan O'Connor, and Noah Smith. 2012. Censorship and deletion practices in Chinese social media. First Monday 17(3). Bell, Brad E and Elizabeth F Loftus. May 1989. Trivial persuasion in the courtroom: The power of (a few) minor details Journal of Personality and Social Psychology 56(5):669-679. Danescu-Niculescu-Mizil, Cristian, Justin Cheng, Jon Kleinberg, and Lillian Lee. 2012. You had me at hello: How phrasing affects memorability. Proceedings of the ACL, pp. 892--901. Gayo-Avello, Daniel. December 2013. A meta-analysis of state-of-the-art electoral prediction from Twitter data. Social Science Computer Review 31(6): 649-679. Hat tip to Brendan O'Connor; I saw this on his 2013 blog post Some analysis of tweet shares and “predicting” election outcomes. Also of interest, for the title alone: Gayo-Avello's "I wanted to predict elections with twitter and all I got was this lousy paper" -- A balanced survey on election prediction using twitter data, Eprint ArXiv:1204.6441 and On Twitter and Elections, catchy paper titles, press releases and telling scientist's opinions from facts: A brief comment to DiGrazia et al. 2013 and to Fabio Rojas Op-Ed in Washington Post. Greenberg, Michael D, Bryan Pardo, Karthic Hariharan, and Elizabeth Gerber. 2013. Crowdfunding support tools: Predicting success & failure. Proceedings of CHI: Extended Abstracts, pp. 1815--1820. Guerini, Marco, Carlo Strapparava, and Oliverio Stock. 2010. Evaluation metrics for persuasive NLP with Google adwords. Proceedings of LREC. Hannak, Aniko, Drew Margolin, Brian Keegan, and Ingmar Weber. 2014. Get back! You don't know me like that: The social mediation of fact checking interventions in Twitter conversations. Proceedings of ICWSM. King, Gary, Jennifer Pan, and Margaret E Roberts. 2013. How censorship in China allows government criticism but silences collective expression. American Political Science Review 107(02): 326-343. Leskovec, Jure, Lars Backstrom, and Jon Kleinberg. 2009. Meme-tracking and the dynamics of the news cycle. In Proceedings of KDD, 497-506. Petrovic, Sasa, Miles Osborne, and Victor Lavrenko. 2013. I wish I didn't say that! Analyzing and predicting deleted messages in Twitter. eprint arXiv:1305.3107. Qazvinian, Vahed, Emily Rosengren, Dragomir R. Radev, and Qiaozhu Mei. 2011. Rumor has it: Identifying misinformation in microblogs. Proceedings of EMNLP, 1589--1599. Thelwall, Mike, Kevan Buckley, and Georgios Paltoglou. 2011. Sentiment in Twitter events. Journal of the American Society for Information Science and Technology 62(2): 406-418.	Read one — your choice — of the readings for Tu Oct 7 (lecture 13) and post a project proposal inspired by it to Piazza by 3pm Mon the 6th; include the general idea, and a suggestion for a dataset. A paragraph suffices (and more is great, if you feel inspired!). Thoughtfulness and creativity are what I'm most interested in, but take feasibility into account. And, read each other's proposals, commenting as you see fit, before the in-class discussion.
#12	Oct 2	Project-possibilities discussion Image source: http://http://qwantz.com/index.php?comic=1317 Class is at 3:30 - let's say the Theory Lab. Papers to be presented: Simmons, Matthew P., Lada A. Adamic, and Eytan Adar. 2011. Memes online: Extracted, subtracted, injected, and recollected. Proceedings of ICWSM, pp. 353--360. Acton, Eric K. 2011. On gender differences in the distribution of um and uh. Penn working papers in Linguistics: Selected papers from NWAV 17. Some conversation/transcript corpora Cornell is a member of the LDC and so has access to LDC corpora Always a good idea to check the corpora mailing list AMI Meeting Corpus (many annotations): "a multi-modal data set consisting of 100 hours of meeting recordings...Around two-thirds of the data has been elicited using a scenario in which the participants play different roles in a design team, taking a design project from kick-off to completion over the course of a day. The rest consists of naturally occurring meetings in a range of domains" British Columbia Conversation Corpus (40 email threads) Enron email dataset London-Lund corpus of spoken English: mainly face-to-face conversations but also telephone conversations, interviews, radio discussions, sports commentaries, political speeces, court proceedings, etc., and totals nearly half a million words Penn Discourse Treebank Saarbrücken Corpus of Spoken English Santa Barbara Corpus of Spoken American English: based on a large body of recordings of naturally occurring spoken interaction from all over the United States. The Santa Barbara Corpus represents a wide variety of people of different regional origins, ages, occupations, genders, and ethnic and social backgrounds. The predominant form of language use represented is face-to-face conversation, but the corpus also documents many other ways that that people use language in their everyday lives: telephone conversations, card games, food preparation, on-the-job talk, classroom lectures, sermons, story-telling, town hall meetings, tour-guide spiels, and more. Supreme Court dialogs corpus Switchboard corpus (also in NLTK) Talkbank looks like a rich set of different domains, including child-directed speech (do check the rules) "Unshared" task in poliInformatics. Includes meeting transcripts of the Federal Open Market Committee (keep scrolling down the page to also get to tools). A Wall Street Journal article asking for help analyzing the scheduled and emergency meeting transcripts References Attempt to reframe (reclaim?) "fracking". Ditto for Obamacare Centrality measures - here's a presentation by Peter Dodds Choi, Eunsol, Chenhao Tan, Lillian Lee, Cristian Danescu-Niculescu-Mizil, and Jennifer Spindel. June 2012. Hedge detection as a lens on framing in the GMO debates: A position paper. Proceedings of the ACL Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics Clark, Herbert H., and Fox Tree, Jean E. 2002. Using uh and um in spontaneous speaking. Cognition 84, no. 1: 73--111. Gonzales, Amy L., Jeffrey T. Hancock, and James W. Pennebaker. 2010. Language style matching as a predictor of social dynamics in small groups. Communication Research 37(1): 3-19. Greene, Stephan and Philip Resnik. 2009. More than words: Syntactic packaging and implicit sentiment. NAACL, pp. 503--511. Ireland, Molly E., Richard B. Slatcher, Paul W. Eastwick, Lauren E. Scissors, Eli J. Finkel, and James W. Pennebaker. 2011. Language style matching predicts relationship initiation and stability. Psychological Science 22(1): 39-44. Kleinberg, Jon. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM 46 (5): 604–632. Liberman, Mark. 2014. Language Log post on all sorts of aspects of the uh/um divide. Omodei, Elisa, Thierry Poibeau, and Jean-Philippe Cointet. 2012. Multi-level modeling of quotation families morphogenesis. Proceedings of ASE/IEEE SocialCom. Ranganath, Rajesh, Dan Jurafsky, and Dan McFarland. 2009. It's not you, it's me: Detecting flirting and its misperception in speed-dates. Proceedings of EMNLP. Schneider, Nathan, Rebecca Hwa, Philip Gianfortoni, Dipanjan Das, Michael Heilman, Alan W. Black, Frederick L. Crabbe, and Noah A. Smith. 2010. Visualizing Topical Quotations Over Time to Understand News Discourse. CMU-LTI-01-103, CMU. Tagliamonte, Sali. 2005. So who? Like how? Just what? Discourse markers in the conversations of young Canadians. Journal of Pragmatics 37(11): 1896-1915.
#13	7	Project-possibilities discussion Image source: www.catandgirl.com/?p=2105 Garley, Matt and Julia Hockenmaier. 2012. Beefmoves: Dissemination, diversity, and dynamics of English borrowings in a German hip hop forum. Proceedings of ACL. Vasilescu, Bogdan, Alexander Serebrenik, Prem Devanbu, and Vladimir Filkov. 2014. How social Q&A sites are changing knowledge sharing in open source software communities. Proceedings of CSCW, pp. 342--354. Some things discussed or tried in class About the Cornell Institutional Review Board for Human Participants. Flowchart: how to decide if your activity is covered by Cornell's Human Research Protection Program. Link generated for sharing a StackOverflow question for Twitter: http://stackoverflow.com/q/23639039?stw=2 ; for email: http://stackoverflow.com/q/23639039?sem=2 , etc. Note that we can search for such things; for example: https://twitter.com/search?q=stackoverflow.com%20stw%3D2 (this is a query for stackoverflow.com and stw=2). About collecting data from Twitter: Tutorial on using twitteR (R) to mine Twitter attitudes towards airlines REST API documentation and Streaming API Cristian Danescu-Niculescu-Mizil's advice: (a) Rate limitations: http://dev.twitter.com/pages/rate-limiting Not complying with the rate limitations will result in your IP getting blacklisted. For the REST API there is a clear limit and an easy way to track your limit status: https://api.twitter.com/1.1/application/rate_limit_status.json (which should be called frequently by your code). (b) When interacting with the API, use exception clauses. Many things can go wrong. When handling the exceptions, keep (a) in mind. (c) Design your data gathering so that it can be easily restarted, without losing what was already collected. (d) For the most popular programing languages there are many Twitter Libraries that can be used to send requests to the API: https://dev.twitter.com/overview/api/twitter-libraries Query for code-switching examples in Thai: https://twitter.com/search?q=%22code-switching%22%20lang%3Ath . 55555, or, How to laugh online in other languages. Megan Garber, The Atlantic, 2012. References Anderson, Ashton, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2013. Steering user behavior with badges. Proceedings of WWW. Farshad Kooti, Haeryun Yang, Meeyoung Cha, Krishna Gummadi, and Winter Mason. The emergence of conventions in online social networks. Proceedings of ICWSM. Best paper award. Vasilescu, Bogdan, Andrea Capiluppi, and Alexander Serebrenik. 2012. Gender, representation and online participation: A quantitative study of stackoverflow. Proceedings of Social Informatics, pp. 332--338.	Assignments: Sign up by 3pm Wed the 15th for a check-up appointment, which will be held on Thursday the 16th. Here is the sign-up link. By 3pm on Tue the 21st, post your two+ paragraph informal term-project proposal draft on Piazza. If you've already decided to team up, just one person on the team posts. At a minimum, your proposal should give the main idea, why you think this is interesting, the dataset you plan to use, and as precise an indication as you can give of what precisely you intend to investigate as the minimum criteria for completion. Beyond the minimum, the more thought you put into proving that your project is feasible, the more useful this step will be. Between 3pm on Tuesday the 21st and 9am on Thursday the 23rd, you should on Piazza do the best you can to provide helpful comments to each other, perhaps decide to team up, etc.
#14	9	Project-possibilities discussion Sites (in addition to the last time we talked about kickstarter) Transcripts of the meetings of the Federal Open Market Committee (FOMC). I believe I've been told that very important decisions get made in the meetings; pre-1993, the meeting transcripts were not meant to be made public, which means we can perhaps assume that the problem-solving is genuine.. Incidentally, here's an Economist article summarizing a paper (see citation below) saying that the change in transcript privacy policy corresponds to clear change in language behavior. Gentoo's Bugzilla http://www.gofundme.com http://kickingitforward.org http://www.kiva.org and kiva data snapshots MathOverflow and stackexchange API and some older mathoverflow data dumps https://www.patreon.com Polymath wiki References An, Jisun, Daniele Quercia, and Jon Crowcroft. 2014. Recommending investors for crowdfunding projects. Proceedings of the 23rd International Conference on World Wide Web, pp. 261--270. Barany, Michael J. 2010. '[B]ut this is blog maths and we're free to make up conventions as we go along': Polymath1 and the modalities of 'massively collaborative mathematics'. Proceedings of the 6th International Symposium on Wikis and Open Collaboration, pp. 10:1--10:9. Barron, Brigid. November 2009. Achieving coordination in collaborative problem-solving groups. The Journal of the Learning Sciences 9(4): 403–436. Cranshaw, Justin and Aniket Kittur. 2011. The polymath project: Lessons from a successful online collaboration in mathematics. Proceedings of CHI, pp. 1865--1874. Fogarty, Mignon (Grammar Girl), 2014. What new research on the brain says every writer should do. Fort, Karën, Gilles Adda, and K Bretonnel Cohen. 2011. Amazon mechanical turk: Gold mine or coal mine? Computational Linguistics 37(2): 413-420. Hansen, Stephen, Michael McMahon, and Andrea Prat. 2014. Transparency and deliberation within the FOMC: A computational linguistics approach. Centre for Economic Policy Research, paper no 9994. Roschelle, Jeremy and Stephanie D Teasley. 1995. The construction of shared knowledge in collaborative problem solving. Proceedings of the NATO Advanced Research Workshop on Computer Supported Collaborative Learning, pp. 69--97. Willemyns, Michael, Cynthia Gallois, and Victor J Callan. 2006. Conversations between postgraduate students and their supervisors: Intergroup communication and accommodation. Proceedings of the World Congress on the Power of Language: Theory, Practice and Performance. Xu, Anbang, Xiao Yang, Huaming Rao, Wai-Tat Fu, Shih-Wen Huang, and Brian P Bailey. 2014. Show me the money!: An analysis of project updates during crowdfunding campaigns. Proceedings of CHI, pp. 591--600.
Oct 14	Fall Break
#15	16	No class meeting — instead, individual meetings throughout the day for performance (and, if desired, potential project) feedback.
#16	21	Tales from the trenches: data scraping Slides by Amit Sharma and Chenhao Tan Notes taken during class Resources mentioned in discussion. All descriptions below taken from the linked webpages Beautiful Soup, Python library designed for quick turnaround projects like screen-scraping curl man page. Transfer a URL cron pandas, Python Data Analysis Library /r/redditdev, subreddit for discussion of reddit API clients and the reddit source code. redis, open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server social-integrator by Amit Sharma. A project to provide convenient API access to programmers and researchers for downloading data. Currently supports lastfm. Tweepy, An easy-to-use Python library for accessing the Twitter API. Twitter social graph 2009 Results of a full crawl of the entire Twitter site. 41.7 million user profiles, 1.47 billion social relations, 4,262 trending topics, and 106 million tweets. This data was collected for the paper, What is Twitter, a Social Network or a News Media? by Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. wget Wikipedia page, retrieves content from web servers WordNet and SentiWordNet
#17	23	Class discussion of project proposals and feedback Upcoming submission deadlines: NAACL long & short papers: sub Dec 4, notification Feb 20, multiple submissions OK. ICWSM full and poster: abstracts Jan 18, papers Jan 23, notification March 9. WWW "Web Science track": abstracts Jan 19, papers Jan 23, notification Feb 27.	Post to Piazza as a followup to your proposal what you commit to doing in the next week/1.5 weeks or so.
#18	28	Bayesian identification of features distinguishing two sub-languages Image source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-never-tell-me-the-odds-1/. Scan of lecture notes Monroe, Burt L, Michael P Colaresi, and Kevin M Quinn. 2008. Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16(4): 372-403. Additional references: Google books n-gram corpus Kenneth W. Church. 2000. Empirical estimates of adaptation: The chance of two Noriegas is closer to p/2 than p2. Proceedings of COLING. Fredette, Marc and Jean-François Angers. 2002. A new approximation of the posterior distribution of the log-odds ratio. Statistica Neerlandica 56(3): 314-329. Kleinberg, Jon. 2002. Bursty and hierarchical structure in streams. Proceedings of KDD, pp. 91-101. Percy Liang and Dan Klein.2007. Tutorial: Structured Bayesian nonparametric models with variational inference. Included for material and visualizations of Dirichlets. Liberman, Mark. 2014. Obama's favored (and disfavored) SOTU words. Language Log blog post, using the Monroe/Colaresi/Quinn method. Mitra, Tanushree and Eric Gilbert. 2014. The language that gets people to give: Phrases that predict success on kickstarter. Proceedings of CSCW. FAQ: How do I interpret odds ratios in logistic regression? Introduction to SAS. UCLA: Statistical Consulting Group.
#19	30	Continuation of "Fightin' Words" Scan of lecture notes Deferred, for the most part, until next class meeting: Louis, Annie and Ani Nenkova. 2013. What makes writing great? First experiments on article quality prediction in the science journalism domain. Transactions of the Association for Computational Linguistics 1:341-352. Resources MRC Psycholinguistic database. Citation: Wilson, M.D. (1988) The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural Research Methods, Instruments and Computers, 20(1), 6-11. A search interface makes it clear what kind of features (annotations) are identified for the lexicon items.	Sign up for (mandatory) checkup appointments for next week. Link here.
#20	Nov 4	No class meeting — individual team check-up meetings instead.
#21	6	No class meeting — individual team check-up meetings instead.
#22	11	No class meeting — Veteran's Day
#23	13	Case study of feature ingenuity; grammars of various sorts Image source: http://popchartlab.com/products/a-diagrammatical-dissertation-on-opening-lines-of-notable-novels, available for purchase; image crop by Popular Science Scan of lecture notes Our starting point: Louis, Annie and Ani Nenkova. 2013. What makes writing great? First experiments on article quality prediction in the science journalism domain. Transactions of the Association for Computational Linguistics 1:341-352. Jurafsky, Daniel and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 2nd edition. Slides by Kathy McCoy corresponding to chapter 14 - slide 7 (Fig 14.12) is what I showed in class, and slides 17 and 18 (need to split categories), 20 and 22 (importance of lexical information) were ones I contemplated showing. Section 12.7 discussed dependency parsing. I also displayed figure 12.14, a dependency parse, from chapter 12. Notes for a lecture I gave on context-free grammars in 2007, scribed by Cristian Danescu-Niculescu-Mizil, Nam Nguyen, and Myle Ott. References Abeillé, Anne and Yves Schabes. 1989. Parsing idioms in lexicalized TAGs. EACL, pp. 1–9. Jäger, Gerhard and James Rogers. 2012. Formal language theory: refining the Chomsky hierarchy. Philosophical Transactions of the Royal Society B: Biological Sciences. I picked this for being brief, hitting the whole Chomsky hierarchy, and mentioning the mildly-context-sensitive languages and their relation to natural language; but one would not necessarily argue that this is an easy introduction. Kroch, Anthony and Aravind Joshi. 1985. The linguistic relevance of tree adjoining grammar. UPenn Technical Report MS-CIS-85-16. Joshi, Aravind K. and Yves Schabes. 1997. Tree-adjoining grammars. In Grzegorz Rozenberg and Arto Salomaa, editors, Handbook of Formal Languages, volume 3 (Beyond words), pp. 69–12. Joshi, Aravind, K. Vijay-Shanker, and David Weir. 1991. The convergence of mildly context-sensitive grammar formalisms. In Peter Sells, Stuart Shieber and Thomas Wasow, Eds., Foundational Issues in Natural Language Processing. Link is to a technical-report version. Nadeau, David and Sekine, Satoshi. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1): 3-26. Alternative link. Pullum, Geoffrey K. 1986. Topic ... Comment: Footloose and context-free. Natural Language & Linguistic Theory 4(3): 409-414. Comments on attempts to prove that natural languages are not context-free. The MaltParser dependency parser. An early paper, mentioning that "The runtime of the algoerithm is linear in the length of the input string, and the dependency graph is guaranteed to be projective and acyclic": Nivre, Joakim. 2003. An efficient algorithm for projective dependency parsing. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT 03), pp. 149-160. The XTAG project.	Sign up for (mandatory) checkup appointments for next week. Link here.
#24	18	Tour of (mostly non-ngram) language models Scan of lecture notes References Barzilay, Regina and Lillian Lee. 2004. Catching the drift: Probabilistic content models, with applications to generation and summarization. In Proceedings of HLT-NAACL, 113-120. Original code by Regina Barzilay (in Lisp); code by Alexandre Passos; other code for later versions Booth, Taylor L. and Richard A. Thompson. 1973. Applying probability measures to abstract languages. IEEE Transactions on Computers 100(5): 442-450. Chi, Zhiyi and Stuart Geman. June 1998. Estimation of probabilistic context-free grammars. Computational Linguistics 24(2): 299-305. Lari, K. and S.J. Young. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech & Language 4(1): 35-56. Code by Mark Johnson. Manning, Christopher D. and Hinrich Schuetze. 1999. Section 11.1 "Some features of PCFGs", which can be found in Chapter 11 of Foundations of Statistical Natural Language Processing. MIT Press. Rabiner, Lawrence R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (2), pp. 257--286. Errata by Ali Rahimi
#25	20	No class meeting — individual team check-up meetings instead.
#26	25	Language models: characterization and comparison Image source: http://danielsolisblog.blogspot.com/2012/01/writers-dice.html Scan of lecture notes References Baez, John. 2012. The mathematics of bioversity (part 4). Blog post. Chen, Stanley F. and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13(4): 359-393. Csiszár, Imre. September 2008. Axiomatic characterizations of information measures. Entropy 10(3): 261-273. Gale, William A. and Kenneth W. Church. 1994. What's wrong with adding one. Corpus-based Research Into Language: In Honour of Jan Aarts, pp. 189--200. Lee, Lillian. 1999. Measures of distributional similarity. Proceedings of the ACL, pp. 25--32. Rao, Calyampudi Radhakrishna. January 2011. Entropy and cross entropy as diversity and distance measures. In International Encyclopedia of Statistical Science, pp. 440--446.	Sign up for (mandatory) checkup appointments for next week. Link here.
Nov 27	Thanksgiving Break
#27	Dec 2	No class meeting — individual team check-up meetings instead.
#28	4	10-minute project presentations.
Final-project write-up due-date, as determined by the registrar: December 11 at 4:30 pm. I have no particular page length in mind, but please highlight the most interesting findings (positive or negative). You should include the following sections: introduction/motivation, related work, data description (how you gathered, cleaned, and processed it), a methods section, an experiments section, what you learned and what you concluded, what are directions for future work. You don't need to be particularly formal. My primary evaluation criteria will be the reasonableness (in approach and amount of effort), thoughtfulness, and creativity of what you tried.

Lecture

Date

Agenda and references

Assignments and other handouts

Aug 26

Course overview: scope, course goals, course design

The school of Athens - people talking and reading

Image source: http://en.wikipedia.org/wiki/The_School_of_Athens. Some people are speaking to each other; some are reading and perhaps being influenced by that text; some are writing text, perhaps hoping to have an effect on others; some texts are being read by several people simulataneously.

Scan of lecture notes

Images and webpages displayed in class:

An annotated Wikipedian vote page with comment threads (comments can be seen by clicking on the "1" speech balloons) and notabilia.net visualization of vote dynamics on such pages
A Slashdot page (try out the various filtering/visualizations) and an interactive visualization of the corresponding conversation tree by the ConVis project, Hoque, Enamul, Giuseppe Carenini, and Shafiq Joty. 2014. Interactive exploration of asynchronous conversations: Applying a user-centered approach to design a visual text analytic system. Proceedings of the ACL Workshop on Interactive Language Learning, Visualization, and Interfaces (ILLVI).
A screenshot of another Slashdot conversation- tree visualizer taken from Pascual-Cid, Victor and Andreas Kaltenbrunner. 2009. Exploring asynchronous online discussions through hierarchical visualisation. Information Visualisation.

References

Bryan, Christopher J, Gregory M Walton, Todd Rogers, and Carol S Dweck. 2 August 2011. Motivating voter turnout by invoking the self. Proceedings of the National Academy of Sciences 108 (31): 12653-12656.

Chong, Dennis and James N. Druckman. 2007. Framing theory. Annual Review of Political Science 10:103--126.

Assignment 1 (A1) officially released

To what extent is there social interaction on review sites?

Image source: Dorothy Gambrel, Cat and Girl: http://catandgirl.com/archive/2001-05-21-cg0043drive.gif. Permission policy here.

Scan of lecture notes

Images and webpages displayed in class:

Review with 42/42 helpfulness score
Saved Amazon's "Straight Man" page with text/stars divergence highlighted; and annotated page of comments on most the most critical helpful review (note also that one of those comments is hidden because "Customers don't think this post adds to the discussion")

References

Danescu-Niculescu-Mizil, Cristian, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. No country for old members: User lifecycle and linguistic change in online communities. Proceedings of WWW, pp. 307--318.

Gilbert, Eric and Karrie Karahalios. 2010. Understanding deja reviewers. Proceedings of CSCW, pp.225—228. [ACM link]

Jurafsky, Dan, Victor Chahuneau, Bryan R. Routledge and Noah A. Smith. Narrative framing of consumer sentiment in online restaurant reviews. First Monday 19(4).

Michael, Loizos and Jahna Otterbacher. 2014. Write like I write: Herding in the language of online reviews. Proceedings of ICWSM.

Mimno, David. Data carpentry. 2014.

Pinch, Trevor and Filip Kesler. 2011. How Aunt Ammy gets her free lunch: A study of the top-thousand customer reviewers at Amazon.com.

Sep 2

Review "quality" and "helpfulness": a lens for studying social influence

Image source: Randall Munroe, xkcd (click on image for original link). Expletive obscured in this presentation.

Scan of lecture notes

Images and handouts from class

outline handout
helpfulness features handout
SGJ, WWW'14, fig 3a
MAT, Science '13, fig 1c
Slides on DNM/K/K/L, WWW'09
C/DNM/L, ICWSM '14, fig 4

References on lecture topics

Cheng, Justin, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec. 2014. How community feedback shapes user behavior. Proceedings of ICWSM.

Danescu-Niculescu-Mizil, Cristian, Gueorgi Kossinets, Jon Kleinberg, and Lillian Lee. 2009. How opinions are received by online communities: A case study on Amazon.com helpfulness votes. Proceedings of WWW: 141—150. [alt link]

Ghose, Anindya and Panagiotis Ipeirotis. 2011. Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. IEEE Transactions on Knowledge and Data Engineering 23(10): 1498—1512. Official link can be found through Worldcat, e.g., here.

Muchnik, Lev, Sinan Aral, and Sean Taylor. 2013. Social influence bias: A randomized experiment. Science 341.

Otterbacher, Jahna. 2009. 'Helpfulness' in online communities: a measure of message quality. Proceedings of CHI, 955-964.

Sipos, Ruben, Arpita Ghosh, and Thorsten Joachims. 2014. Was this review helpful to you? It depends! Context and voting patterns in online content. Proceeedings of WWW.

Wang, R.Y. and Strong, D.M. Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems 12, 4 (1996), 5-34.

Representative additional references on "unconventional" text classification, by popular demand

Davidov, Dmitry, Oren Tsur, and Ari Rappoport. 2010. Semi-supervised recognition of sarcastic sentences in Twitter and Amazon. Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pp. 107--116. http://aclweb.org/anthology/W10-2914

Kiddon, Chloé and Yuriy Brun. That's what she said: Double entendre classification. Proceedings of the ACL (short papers), 89--94.

Li, Jiwei, Myle Ott, Claire Cardie, and Eduard Hovy. 2014. Towards a general rule for identifying deceptive opinion spam. Proceedings of the ACL. The paper showing a learned classifier outperforming humans on Tripadvisor-style reviews is Ott, M, Y Choi, C Cardie, and J T Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. Proceedings of the ACL, pp. 309--319.

Mihalcea, Rada and Carlo Strapparava. 2006. Learning to laugh (automatically): Computational models for humor recognition. Computational Intelligence 22(2).

What do conversations "look" like?

Scan of lecture notes

Aside: email corpora

The Enron corpus: see this data site. MIT Tech Review '13 article: The immortal life of the Enron e-mails.
The W3C Enterprise Track TREC data
The LDC's Avocado Research Email Collection
The paper “Topic-Specific Communication Patterns from Email Data,” Bruce Desmarais, Peter Krafft, Hanna Wallach, James ben-Aaron, and Juston Moore, presented at the 2012 Text as Data conference, mentions an email dataset consisting of "the entire inboxes and outboxes of the county managers of New Hanover County, North Carolina from the month of February, 2011. "

References

Backstrom, Lars, Jon Kleinberg, Lillian Lee, and Cristian Danescu-Niculescu-Mizil. 2013. Characterizing and curating conversation threads: Expansion, focus, volume, re-entry. Proceedings of WSDM, pp. 13–22. [alt link]

Elsner, Micha and Eugene Charniak. September 2010. Disentangling chat. Computational Linguistics 36(3): 389-409. [data and code]

Gonzalez-Bailon, Sandra, Andreas Kaltenbrunner, and Rafael E Banchs. 2010. The structure of political discussion networks: A model for the analysis of online deliberation. Journal of Information Technology 25(2): 230-243.

Kumar, Ravi, Mohammad Mahdian, and Mary McGlohon. 2010. Dynamics of conversations. Proceedings of KDD, pp. 553--562.

Nguyen, Viet-An, Jordan Boyd-Graber, Philip Resnik, Deborah A Cai, Jennifer E Midberry, and Yuanxin Wang. 2014. Modeling topic control to detect influence in conversations using nonparametric topic models. Machine Learning 95:381--421. [alt link]. [The talk slides we looked at in class]

Prabhakaran, Vinodkumar, Ashima Arora, and Owen Rambow. 2014. Power of confidence: How poll scores impact topic dynamics in political debates. ACL joint workshop on social dynamics and personal attributes.

Prabhakaran, Vinodkumar and Owen Rambow. 2014. Predicting power relations between participants in written dialog from a single thread. Proceedings of the ACL (short papers).

Seo, Jangwon, W. Bruce Croft, and David A. Smith. 2009. Online community search using thread structure. Proceedings of CIKM, pp. 1907--1910.

Siersdorfer, Stefan, Sergiu Chelaru, Jose San Pedro, Ismail Sengor Altingovde, and Wolfgang Nejdl. July 2014. Analyzing and mining comments and comment ratings on the social web. ACM Trans. Web 8 (3): 17:1-17:39. [alt link]

Wang, Yi-Chia, Mahesh Joshi, and Carolyn Penstein Rosé. 2008. Investigating the effect of discussion forum interface affordances on patterns of conversational interactions. Proceedings of CSCW, pp. 555–558.

Checkpoints of A1 projects; Discourse phenomena: clues regarding structure

Image source: http://www.metmuseum.org/toah/works-of-art/49.70.33. "The image is one for which Picasso did a number of variations in Paris during the autumn–winter of 1912; in each version, a tall bottle and goblet are set out on a small round table."

Scan of lecture notes and the handout

References related to the A1 project discussions

On code switching: Auer, Peter. 2013. Code-switching in Conversation: Language, Interaction and Identity. Routledge. Sample "ACL-style" paper: Elfardy, Heba and Mona Diab. 2012. Token level identification of linguistic code switching, Proceedings of COLING.
The LIWC lexicon (categorized word lists): Tausczik, Y R and J W Pennebaker. 2010. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology 29(1). Technically, single licenses are available for purchase from this site. A freely available similar type of lexicon is the Harvard General Inquirer lexicon.
Duplicate detection: Sorokina, Daria, Johannes Gehrke, Simeon Warner, and Paul Ginsparg. 2006. Plagiarism detection in arxiv. In Proceedings of the Sixth International Conference on Data Mining, 1070-1075. The code may be available upon request; there's a Java port of the original C code, but available only as a web service, I think, here.
Humorous reviews subreddit, Funny reviews on Amazon subreddit, Funniest Amazon review I've ever read thread
Subcommunity review sites: DPRreview, RateMyProfessors

References from discourse lecture

Grice, H.P. 1975. Logic and Conversation. In Syntax and semantics 3: Speech Acts, pp. 41-58.

Jurafsky, Dan, and Martin, James H. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition Second edition. Chapter 21 covers discourse.

Moser, Megan and Johanna Moore. Toward a synthesis of two accounts of discourse structure. Computational Linguistics 22(3):409--419.

Rogers, Todd and Michael I Norton. June 2011. The artful dodger: Answering the wrong question the right way. Journal of Experimental Psychology: Applied 17 (2).

References for the examples on the handout:

Jordan Boyd-Graber Google+ post

Allen, James. 1995. Natural Language Understanding. Benjamin/Cummings Pub Co. Second ed.

Hirst, Graeme. 1981. Anaphora in Natural Language Understanding: A Survey. Lecture Notes in Computer Science. Springer, Berlin.

Sidner, Candace Lee. 1979. Towards a computational theory of definite anaphora comprehension in English discourse. MIT AITR-537.

Wilks, Yorick. 1975. An intelligent analyzer and understander of English. Communications of the ACM 18 (5): 264-274.

Attention, intentions, and discourse structure: the Grosz and Sidner theory

Scan of lecture notes

References:

Grosz, Barbara J., and Sidner, Candace L. 1986. Attention, intentions, and the structure of discourse. Computational Linguistics 12(3): 175-204.

Mann, William C., and Thompson, Sandra A. 1988. Rhetorical structure theory: Toward a functional theory of text organization. Text: Interdisciplinary Journal for the Study of Discourse 8, no. 3: 243-281.

Pinker, Steven and the Royal Society for the Encouragement of Arts, Manufactures and Commerce (RSA) Animate, posted to YouTube on Feb 10, 2011. Language as a Window into Human Nature

A2 out (deadline subsequently extended to Sept. 22)

A1 presentations, part one

A1 presentations, part two

Discussion of application of Grosz/Sidner theory in A2

"Stacking", by Alastair Hesletine. Image source: http://thumbpress.com/the-art-of-stacking-wood/

Scan of discussion notes

References see also the previous discourse lectures

Wikipedia entry on Deep Blue vs. Garry Kasparov (pronunciation)

Stolcke, Andreas Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. 2000. Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech. Computational Linguistics 26(3): 339--373.

Taboada, Maite and William C. Mann. 2006. Rhetorical structure theory: Looking back and moving ahead. Discourse Studies 8(3): 423-459. Gives an overview of many issues in analyzing discourse structure.

Walker, Marilyn A. 1996. Limited attention and discourse structure. Computational Linguistics 22(2): 255-264.

Read one — your choice — of the readings for Tu Sep 30 (lecture 11) and post a project proposal inspired by it to Piazza by 3pm Mon the 29th; include the general idea, and a suggestion for a dataset. A paragraph suffices (and more is great, if you feel inspired!). Thoughtfulness and creativity are what I'm most interested in, but take feasibility into account.

And, read each other's proposals, commenting as you see fit, before class on the 30th.

#10

Language adaptation, power and within-group lifespan

Scan of lecture notes

Danescu-Niculescu-Mizil, Cristian, Lillian Lee, Bo Pang, and Jon Kleinberg. 2012. Echoes of power: Language effects and power differences in social interaction. Proceedings of WWW, pp. 699--708. Link includes access to datasets, talk slides, etc. ACM link is here.

References

http://minimalmovieposters.tumblr.com/archive

Beňuš, Štefan, Rivka Levitan, and Julia Hirschberg. 2012. Entrainment in spontaneous speech: The case of filled pauses in supreme court hearings. Proceedings of the 3rd IEEE Conference on Cognitive Infocommunications.

Bramsen, Philip, Martha Escobar-Molana, Ami Patel, and Rafael Alonso. 2011. Extracting social power relationships from natural language. Proceedings of ACL HLT.

Choudhury, Tanzeem and Alex Pentland. 2004. Characterizing social networks using the sociometer. Proceedings of the North American Association of Computational Social and Organizational Science (NAACSOS)

Danescu-Niculescu-Mizil, Cristian, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. A computational approach to politeness with application to social factors. Proceedings of the ACL. Real-life application and links to data and code.

Diehl, Christopher P., Galileo Namata, and Lise Getoor. 2007. Relationship identification for social network discovery. Proceedings of the AAAI Workshop on Enhanced Messaging, pp. 546--552.

Gilbert, Eric. 2012. Phrases that signal workplace hierarchy. Proceedings of CSCW.

Leber, Jessica. 2013. The immortal life of the Enron e-mails. Business News.

Ng, Sik Hung and James J Bradac. 1993. Power in Language: Verbal Communication and Social Influence. Sage Publications, Inc.

Vinod Prabhakaran and Owen Rambow's work on inferring power relationships

#11

Project-possibilities discussion

Image source: http://xkcd.com/1055/

The assigned reading: one of:

Glasgow, Kimberly, Clayton Fink, and Jordan Boyd-Graber. 2014. Our grief is unspeakable: Automatically measuring the community impact of a tragedy. Proceedings of ICWSM.
Mitra, Tanushree and Eric Gilbert. 2014. The language that gets people to give: Phrases that predict success on Kickstarter. Proceedings of CSCW.

Sites examined or mentioned during class

A sample www.debate.org "duel"; notice the plethora of evaluative explicit or implicit annotations. Other debate sites include createdebate, www.forandagainst.com, www.convinceme.net, idebate.org See also the Internet Argument Corpus, described in A corpus for research on deliberation and debate. Marilyn A. Walker, Pranav Anand, Jean E. Fox Tree, Rob Abbott, Joseph King. LREC 2012
Scrape of 87K Kickstarter projects. Discovered by looking at the Greenberg et al. 2013 paper mentioned below, who linked to thekickbackmachine.com by neight-allen. That site itself is no longer maintained, and there may be a problem with the file, but "dm" at thekickbackmachine.com recommends Walter Haas' Kickspy.
- Potato-salad kickstarter project.
Data on 45K Kickstarter projects, by "Jeanne"
Memetracker website, including data: http://www.memetracker.org/index.html. "Palling around with terrorists" quote graph (links = share many words). "Heartbeat" figure for when blogs vs. the mainstream media discuss an event. See publication info below.

Note also the ICWSM 2011 Spinn3r dataset: over 386 million blog posts, news articles, classifieds, forum posts and social media content between January 13th and February 14th. It spans events such as the Tunisian revolution and the Egyptian protests (see http://en.wikipedia.org/wiki/January_2011 for a more detailed list of events spanning the dataset's time period).

A sample Quirky appeal for help in pricing, tagline selection, etc.
randomactsofpizza.com. Note the need for guarding against scam attempts, described in the "Best Practices" section. The actual subreddit is here. Example of reciprocity offer. Note annotation of who actually received pizza. See publication info below.
TREC 2011 microblog-track tweet dataset and link to twitter-tools by Jimmy Lin ("lintool"), which "provides support for removing deleted tweets from your copy of the corpus", as is important to be compliant with Twitter's policies.

References (including some that came up during class)

Althoff, Tim, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. 2014. How to ask for a favor: A case study on the success of altruistic requests. Proceedings of ICWSM.

Bailey, Michael, Daniel J Hopkins, and Todd Rogers. 2013. Unresponsive and unpersuaded: The unintended consequences of voter persuasion efforts. Working paper on SSRN.

Bamman, David, Brendan O'Connor, and Noah Smith. 2012. Censorship and deletion practices in Chinese social media. First Monday 17(3).

Bell, Brad E and Elizabeth F Loftus. May 1989. Trivial persuasion in the courtroom: The power of (a few) minor details Journal of Personality and Social Psychology 56(5):669-679.

Danescu-Niculescu-Mizil, Cristian, Justin Cheng, Jon Kleinberg, and Lillian Lee. 2012. You had me at hello: How phrasing affects memorability. Proceedings of the ACL, pp. 892--901.

Gayo-Avello, Daniel. December 2013. A meta-analysis of state-of-the-art electoral prediction from Twitter data. Social Science Computer Review 31(6): 649-679. Hat tip to Brendan O'Connor; I saw this on his 2013 blog post Some analysis of tweet shares and “predicting” election outcomes. Also of interest, for the title alone: Gayo-Avello's "I wanted to predict elections with twitter and all I got was this lousy paper" -- A balanced survey on election prediction using twitter data, Eprint ArXiv:1204.6441 and On Twitter and Elections, catchy paper titles, press releases and telling scientist's opinions from facts: A brief comment to DiGrazia et al. 2013 and to Fabio Rojas Op-Ed in Washington Post.

Greenberg, Michael D, Bryan Pardo, Karthic Hariharan, and Elizabeth Gerber. 2013. Crowdfunding support tools: Predicting success & failure. Proceedings of CHI: Extended Abstracts, pp. 1815--1820.

Guerini, Marco, Carlo Strapparava, and Oliverio Stock. 2010. Evaluation metrics for persuasive NLP with Google adwords. Proceedings of LREC.

Hannak, Aniko, Drew Margolin, Brian Keegan, and Ingmar Weber. 2014. Get back! You don't know me like that: The social mediation of fact checking interventions in Twitter conversations. Proceedings of ICWSM.

King, Gary, Jennifer Pan, and Margaret E Roberts. 2013. How censorship in China allows government criticism but silences collective expression. American Political Science Review 107(02): 326-343.

Leskovec, Jure, Lars Backstrom, and Jon Kleinberg. 2009. Meme-tracking and the dynamics of the news cycle. In Proceedings of KDD, 497-506.

Petrovic, Sasa, Miles Osborne, and Victor Lavrenko. 2013. I wish I didn't say that! Analyzing and predicting deleted messages in Twitter. eprint arXiv:1305.3107.

Qazvinian, Vahed, Emily Rosengren, Dragomir R. Radev, and Qiaozhu Mei. 2011. Rumor has it: Identifying misinformation in microblogs. Proceedings of EMNLP, 1589--1599.

Thelwall, Mike, Kevan Buckley, and Georgios Paltoglou. 2011. Sentiment in Twitter events. Journal of the American Society for Information Science and Technology 62(2): 406-418.

Read one — your choice — of the readings for Tu Oct 7 (lecture 13) and post a project proposal inspired by it to Piazza by 3pm Mon the 6th; include the general idea, and a suggestion for a dataset. A paragraph suffices (and more is great, if you feel inspired!). Thoughtfulness and creativity are what I'm most interested in, but take feasibility into account.

And, read each other's proposals, commenting as you see fit, before the in-class discussion.

#12

Oct 2

Project-possibilities discussion

Image source: http://http://qwantz.com/index.php?comic=1317

Class is at 3:30 - let's say the Theory Lab.

Papers to be presented:

Simmons, Matthew P., Lada A. Adamic, and Eytan Adar. 2011. Memes online: Extracted, subtracted, injected, and recollected. Proceedings of ICWSM, pp. 353--360.
Acton, Eric K. 2011. On gender differences in the distribution of um and uh. Penn working papers in Linguistics: Selected papers from NWAV 17.

Some conversation/transcript corpora Cornell is a member of the LDC and so has access to LDC corpora

Always a good idea to check the corpora mailing list
AMI Meeting Corpus (many annotations): "a multi-modal data set consisting of 100 hours of meeting recordings...Around two-thirds of the data has been elicited using a scenario in which the participants play different roles in a design team, taking a design project from kick-off to completion over the course of a day. The rest consists of naturally occurring meetings in a range of domains"
British Columbia Conversation Corpus (40 email threads)
Enron email dataset
London-Lund corpus of spoken English: mainly face-to-face conversations but also telephone conversations, interviews, radio discussions, sports commentaries, political speeces, court proceedings, etc., and totals nearly half a million words
Penn Discourse Treebank
Saarbrücken Corpus of Spoken English
Santa Barbara Corpus of Spoken American English: based on a large body of recordings of naturally occurring spoken interaction from all over the United States. The Santa Barbara Corpus represents a wide variety of people of different regional origins, ages, occupations, genders, and ethnic and social backgrounds. The predominant form of language use represented is face-to-face conversation, but the corpus also documents many other ways that that people use language in their everyday lives: telephone conversations, card games, food preparation, on-the-job talk, classroom lectures, sermons, story-telling, town hall meetings, tour-guide spiels, and more.
Supreme Court dialogs corpus
Switchboard corpus (also in NLTK)
Talkbank looks like a rich set of different domains, including child-directed speech (do check the rules)
"Unshared" task in poliInformatics. Includes meeting transcripts of the Federal Open Market Committee (keep scrolling down the page to also get to tools). A Wall Street Journal article asking for help analyzing the scheduled and emergency meeting transcripts

References

Attempt to reframe (reclaim?) "fracking". Ditto for Obamacare

Centrality measures - here's a presentation by Peter Dodds

Choi, Eunsol, Chenhao Tan, Lillian Lee, Cristian Danescu-Niculescu-Mizil, and Jennifer Spindel. June 2012. Hedge detection as a lens on framing in the GMO debates: A position paper. Proceedings of the ACL Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics

Clark, Herbert H., and Fox Tree, Jean E. 2002. Using uh and um in spontaneous speaking. Cognition 84, no. 1: 73--111.

Gonzales, Amy L., Jeffrey T. Hancock, and James W. Pennebaker. 2010. Language style matching as a predictor of social dynamics in small groups. Communication Research 37(1): 3-19.

Greene, Stephan and Philip Resnik. 2009. More than words: Syntactic packaging and implicit sentiment. NAACL, pp. 503--511.

Ireland, Molly E., Richard B. Slatcher, Paul W. Eastwick, Lauren E. Scissors, Eli J. Finkel, and James W. Pennebaker. 2011. Language style matching predicts relationship initiation and stability. Psychological Science 22(1): 39-44.

Kleinberg, Jon. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM 46 (5): 604–632.

Liberman, Mark. 2014. Language Log post on all sorts of aspects of the uh/um divide.

Omodei, Elisa, Thierry Poibeau, and Jean-Philippe Cointet. 2012. Multi-level modeling of quotation families morphogenesis. Proceedings of ASE/IEEE SocialCom.

Ranganath, Rajesh, Dan Jurafsky, and Dan McFarland. 2009. It's not you, it's me: Detecting flirting and its misperception in speed-dates. Proceedings of EMNLP.

Schneider, Nathan, Rebecca Hwa, Philip Gianfortoni, Dipanjan Das, Michael Heilman, Alan W. Black, Frederick L. Crabbe, and Noah A. Smith. 2010. Visualizing Topical Quotations Over Time to Understand News Discourse. CMU-LTI-01-103, CMU.

Tagliamonte, Sali. 2005. So who? Like how? Just what? Discourse markers in the conversations of young Canadians. Journal of Pragmatics 37(11): 1896-1915.

#13

Project-possibilities discussion

Image source: www.catandgirl.com/?p=2105

Garley, Matt and Julia Hockenmaier. 2012. Beefmoves: Dissemination, diversity, and dynamics of English borrowings in a German hip hop forum. Proceedings of ACL.
Vasilescu, Bogdan, Alexander Serebrenik, Prem Devanbu, and Vladimir Filkov. 2014. How social Q&A sites are changing knowledge sharing in open source software communities. Proceedings of CSCW, pp. 342--354.

Some things discussed or tried in class

About the Cornell Institutional Review Board for Human Participants. Flowchart: how to decide if your activity is covered by Cornell's Human Research Protection Program.
Link generated for sharing a StackOverflow question for Twitter: http://stackoverflow.com/q/23639039?stw=2 ; for email: http://stackoverflow.com/q/23639039?sem=2 , etc. Note that we can search for such things; for example: https://twitter.com/search?q=stackoverflow.com%20stw%3D2 (this is a query for stackoverflow.com and stw=2).
- About collecting data from Twitter:
  - Tutorial on using twitteR (R) to mine Twitter attitudes towards airlines
  - REST API documentation and Streaming API
  - Cristian Danescu-Niculescu-Mizil's advice:
    
    (a) Rate limitations: http://dev.twitter.com/pages/rate-limiting
    Not complying with the rate limitations will result in your IP getting blacklisted. For the REST API there is a clear limit and an easy way to track your limit status: https://api.twitter.com/1.1/application/rate_limit_status.json (which should be called frequently by your code).
    
    (b) When interacting with the API, use exception clauses. Many things can go wrong. When handling the exceptions, keep (a) in mind.
    
    (c) Design your data gathering so that it can be easily restarted, without losing what was already collected.
    
    (d) For the most popular programing languages there are many Twitter Libraries that can be used to send requests to the API: https://dev.twitter.com/overview/api/twitter-libraries
Query for code-switching examples in Thai: https://twitter.com/search?q=%22code-switching%22%20lang%3Ath .
- 55555, or, How to laugh online in other languages. Megan Garber, The Atlantic, 2012.

References

Anderson, Ashton, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2013. Steering user behavior with badges. Proceedings of WWW.

Farshad Kooti, Haeryun Yang, Meeyoung Cha, Krishna Gummadi, and Winter Mason. The emergence of conventions in online social networks. Proceedings of ICWSM. Best paper award.

Vasilescu, Bogdan, Andrea Capiluppi, and Alexander Serebrenik. 2012. Gender, representation and online participation: A quantitative study of stackoverflow. Proceedings of Social Informatics, pp. 332--338.

Assignments:

Sign up by 3pm Wed the 15th for a check-up appointment, which will be held on Thursday the 16th. Here is the sign-up link.
By 3pm on Tue the 21st, post your two+ paragraph informal term-project proposal draft on Piazza. If you've already decided to team up, just one person on the team posts.
- At a minimum, your proposal should give the main idea, why you think this is interesting, the dataset you plan to use, and as precise an indication as you can give of what precisely you intend to investigate as the minimum criteria for completion.
- Beyond the minimum, the more thought you put into proving that your project is feasible, the more useful this step will be.
Between 3pm on Tuesday the 21st and 9am on Thursday the 23rd, you should on Piazza do the best you can to provide helpful comments to each other, perhaps decide to team up, etc.

#14

Project-possibilities discussion

Sites (in addition to the last time we talked about kickstarter)

Transcripts of the meetings of the Federal Open Market Committee (FOMC). I believe I've been told that very important decisions get made in the meetings; pre-1993, the meeting transcripts were not meant to be made public, which means we can perhaps assume that the problem-solving is genuine.. Incidentally, here's an Economist article summarizing a paper (see citation below) saying that the change in transcript privacy policy corresponds to clear change in language behavior.
Gentoo's Bugzilla
http://www.gofundme.com
http://kickingitforward.org
http://www.kiva.org and kiva data snapshots
MathOverflow and stackexchange API and some older mathoverflow data dumps
https://www.patreon.com
Polymath wiki

References

An, Jisun, Daniele Quercia, and Jon Crowcroft. 2014. Recommending investors for crowdfunding projects. Proceedings of the 23rd International Conference on World Wide Web, pp. 261--270.

Barany, Michael J. 2010. '[B]ut this is blog maths and we're free to make up conventions as we go along': Polymath1 and the modalities of 'massively collaborative mathematics'. Proceedings of the 6th International Symposium on Wikis and Open Collaboration, pp. 10:1--10:9.

Barron, Brigid. November 2009. Achieving coordination in collaborative problem-solving groups. The Journal of the Learning Sciences 9(4): 403–436.

Cranshaw, Justin and Aniket Kittur. 2011. The polymath project: Lessons from a successful online collaboration in mathematics. Proceedings of CHI, pp. 1865--1874.

Fogarty, Mignon (Grammar Girl), 2014. What new research on the brain says every writer should do.

Fort, Karën, Gilles Adda, and K Bretonnel Cohen. 2011. Amazon mechanical turk: Gold mine or coal mine? Computational Linguistics 37(2): 413-420.

Hansen, Stephen, Michael McMahon, and Andrea Prat. 2014. Transparency and deliberation within the FOMC: A computational linguistics approach. Centre for Economic Policy Research, paper no 9994.

Roschelle, Jeremy and Stephanie D Teasley. 1995. The construction of shared knowledge in collaborative problem solving. Proceedings of the NATO Advanced Research Workshop on Computer Supported Collaborative Learning, pp. 69--97.

Willemyns, Michael, Cynthia Gallois, and Victor J Callan. 2006. Conversations between postgraduate students and their supervisors: Intergroup communication and accommodation. Proceedings of the World Congress on the Power of Language: Theory, Practice and Performance.

Xu, Anbang, Xiao Yang, Huaming Rao, Wai-Tat Fu, Shih-Wen Huang, and Brian P Bailey. 2014. Show me the money!: An analysis of project updates during crowdfunding campaigns. Proceedings of CHI, pp. 591--600.

Oct 14

Fall Break

#15

No class meeting — instead, individual meetings throughout the day for performance (and, if desired, potential project) feedback.

#16

Tales from the trenches: data scraping

Slides by Amit Sharma and Chenhao Tan

Notes taken during class

Resources mentioned in discussion. All descriptions below taken from the linked webpages

Beautiful Soup, Python library designed for quick turnaround projects like screen-scraping
curl man page. Transfer a URL
cron
pandas, Python Data Analysis Library
/r/redditdev, subreddit for discussion of reddit API clients and the reddit source code.
redis, open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server
social-integrator by Amit Sharma. A project to provide convenient API access to programmers and researchers for downloading data. Currently supports lastfm.
Tweepy, An easy-to-use Python library for accessing the Twitter API.
Twitter social graph 2009 Results of a full crawl of the entire Twitter site. 41.7 million user profiles, 1.47 billion social relations, 4,262 trending topics, and 106 million tweets. This data was collected for the paper, What is Twitter, a Social Network or a News Media? by Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon.
wget Wikipedia page, retrieves content from web servers
WordNet and SentiWordNet

#17

Class discussion of project proposals and feedback

Upcoming submission deadlines: NAACL long & short papers: sub Dec 4, notification Feb 20, multiple submissions OK. ICWSM full and poster: abstracts Jan 18, papers Jan 23, notification March 9. WWW "Web Science track": abstracts Jan 19, papers Jan 23, notification Feb 27.

Post to Piazza as a followup to your proposal what you commit to doing in the next week/1.5 weeks or so.

#18

Bayesian identification of features distinguishing two sub-languages

Image source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-never-tell-me-the-odds-1/.

Scan of lecture notes

Monroe, Burt L, Michael P Colaresi, and Kevin M Quinn. 2008. Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16(4): 372-403.

Additional references:

Google books n-gram corpus

Kenneth W. Church. 2000. Empirical estimates of adaptation: The chance of two Noriegas is closer to p/2 than p2. Proceedings of COLING.

Fredette, Marc and Jean-François Angers. 2002. A new approximation of the posterior distribution of the log-odds ratio. Statistica Neerlandica 56(3): 314-329.

Kleinberg, Jon. 2002. Bursty and hierarchical structure in streams. Proceedings of KDD, pp. 91-101.

Percy Liang and Dan Klein.2007. Tutorial: Structured Bayesian nonparametric models with variational inference. Included for material and visualizations of Dirichlets.

Liberman, Mark. 2014. Obama's favored (and disfavored) SOTU words. Language Log blog post, using the Monroe/Colaresi/Quinn method.

Mitra, Tanushree and Eric Gilbert. 2014. The language that gets people to give: Phrases that predict success on kickstarter. Proceedings of CSCW.

FAQ: How do I interpret odds ratios in logistic regression? Introduction to SAS. UCLA: Statistical Consulting Group.

#19

Continuation of "Fightin' Words"

Scan of lecture notes

Deferred, for the most part, until next class meeting:

Louis, Annie and Ani Nenkova. 2013. What makes writing great? First experiments on article quality prediction in the science journalism domain. Transactions of the Association for Computational Linguistics 1:341-352.

Resources

MRC Psycholinguistic database. Citation: Wilson, M.D. (1988) The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural Research Methods, Instruments and Computers, 20(1), 6-11. A search interface makes it clear what kind of features (annotations) are identified for the lexicon items.

#20

Nov 4

No class meeting — individual team check-up meetings instead.

#21

No class meeting — individual team check-up meetings instead.

#22

No class meeting — Veteran's Day

#23

Case study of feature ingenuity; grammars of various sorts

Image source: http://popchartlab.com/products/a-diagrammatical-dissertation-on-opening-lines-of-notable-novels, available for purchase; image crop by Popular Science

Scan of lecture notes

Our starting point: Louis, Annie and Ani Nenkova. 2013. What makes writing great? First experiments on article quality prediction in the science journalism domain. Transactions of the Association for Computational Linguistics 1:341-352.

Jurafsky, Daniel and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition. 2nd edition. Slides by Kathy McCoy corresponding to chapter 14 - slide 7 (Fig 14.12) is what I showed in class, and slides 17 and 18 (need to split categories), 20 and 22 (importance of lexical information) were ones I contemplated showing. Section 12.7 discussed dependency parsing. I also displayed figure 12.14, a dependency parse, from chapter 12.

Notes for a lecture I gave on context-free grammars in 2007, scribed by Cristian Danescu-Niculescu-Mizil, Nam Nguyen, and Myle Ott.

References

Abeillé, Anne and Yves Schabes. 1989. Parsing idioms in lexicalized TAGs. EACL, pp. 1–9.

Jäger, Gerhard and James Rogers. 2012. Formal language theory: refining the Chomsky hierarchy. Philosophical Transactions of the Royal Society B: Biological Sciences. I picked this for being brief, hitting the whole Chomsky hierarchy, and mentioning the mildly-context-sensitive languages and their relation to natural language; but one would not necessarily argue that this is an easy introduction.

Kroch, Anthony and Aravind Joshi. 1985. The linguistic relevance of tree adjoining grammar. UPenn Technical Report MS-CIS-85-16.

Joshi, Aravind K. and Yves Schabes. 1997. Tree-adjoining grammars. In Grzegorz Rozenberg and Arto Salomaa, editors, Handbook of Formal Languages, volume 3 (Beyond words), pp. 69–12.

Joshi, Aravind, K. Vijay-Shanker, and David Weir. 1991. The convergence of mildly context-sensitive grammar formalisms. In Peter Sells, Stuart Shieber and Thomas Wasow, Eds., Foundational Issues in Natural Language Processing. Link is to a technical-report version.

Nadeau, David and Sekine, Satoshi. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1): 3-26. Alternative link.

Pullum, Geoffrey K. 1986. Topic ... Comment: Footloose and context-free. Natural Language & Linguistic Theory 4(3): 409-414. Comments on attempts to prove that natural languages are not context-free.

The MaltParser dependency parser. An early paper, mentioning that "The runtime of the algoerithm is linear in the length of the input string, and the dependency graph is guaranteed to be projective and acyclic": Nivre, Joakim. 2003. An efficient algorithm for projective dependency parsing. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT 03), pp. 149-160.

The XTAG project.

#24

Tour of (mostly non-ngram) language models

Scan of lecture notes

References

Barzilay, Regina and Lillian Lee. 2004. Catching the drift: Probabilistic content models, with applications to generation and summarization. In Proceedings of HLT-NAACL, 113-120. Original code by Regina Barzilay (in Lisp); code by Alexandre Passos; other code for later versions

Booth, Taylor L. and Richard A. Thompson. 1973. Applying probability measures to abstract languages. IEEE Transactions on Computers 100(5): 442-450.

Chi, Zhiyi and Stuart Geman. June 1998. Estimation of probabilistic context-free grammars. Computational Linguistics 24(2): 299-305.

Lari, K. and S.J. Young. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech & Language 4(1): 35-56. Code by Mark Johnson.

Manning, Christopher D. and Hinrich Schuetze. 1999. Section 11.1 "Some features of PCFGs", which can be found in Chapter 11 of Foundations of Statistical Natural Language Processing. MIT Press.

Rabiner, Lawrence R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (2), pp. 257--286. Errata by Ali Rahimi

#25

No class meeting — individual team check-up meetings instead.

#26

Language models: characterization and comparison

Image source: http://danielsolisblog.blogspot.com/2012/01/writers-dice.html

Scan of lecture notes

References

Baez, John. 2012. The mathematics of bioversity (part 4). Blog post.

Chen, Stanley F. and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13(4): 359-393.

Csiszár, Imre. September 2008. Axiomatic characterizations of information measures. Entropy 10(3): 261-273.

Gale, William A. and Kenneth W. Church. 1994. What's wrong with adding one. Corpus-based Research Into Language: In Honour of Jan Aarts, pp. 189--200.

Lee, Lillian. 1999. Measures of distributional similarity. Proceedings of the ACL, pp. 25--32.

Rao, Calyampudi Radhakrishna. January 2011. Entropy and cross entropy as diversity and distance measures. In International Encyclopedia of Statistical Science, pp. 440--446.

Nov 27

Thanksgiving Break

#27

Dec 2

No class meeting — individual team check-up meetings instead.

#28

10-minute project presentations.

Final-project write-up due-date, as determined by the registrar: December 11 at 4:30 pm. I have no particular page length in mind, but please highlight the most interesting findings (positive or negative). You should include the following sections: introduction/motivation, related work, data description (how you gathered, cleaned, and processed it), a methods section, an experiments section, what you learned and what you concluded, what are directions for future work. You don't need to be particularly formal. My primary evaluation criteria will be the reasonableness (in approach and amount of effort), thoughtfulness, and creativity of what you tried.

Code for generating the calendar above and css was (barely) adapted from the original versions created by Andrew Myers.

CS 6742 Fall 2014: Natural Language Processing and Social Interaction

Lectures