Course homepage Main site for course info, assignments, readings, lecture references, etc.; updated frequently.
Course CMS page Site for submitting assignments, unless otherwise noted.
Course Piazza page Course announcements and Q&A/discussion site. Social interaction and all that, you know.
Instructor Professor Lillian Lee. For contact info, see
Time and place Tuesdays and Thursdays, 10:10-11:25, Hollister 401 (since this room has reconfigurable seating) Gates Hall 344 breakout room (quietly enter through 344, since students are working there, and go to the room on the right).
This page last modified Thu August 20, 2015 4:24 AM.

Brief course description More and more of life is now manifested online, and many of the digital traces that are left by human activity are increasingly recorded in natural-language format. This research-oriented course examines the opportunities for natural language processing to contribute to the analysis and facilitation of socially embedded processes. Possible topics include sentiment analysis, learning social-network structure, analysis of text in political or legal domains, review aggregation systems, analysis of online conversations, and text categorization with respect to psychological categories.

Prerequisites As previously announced in the 2014-2015 Courses of Study, enrollment is limited to PhD students except by permission of instructor. August 14 addition: given the number of PhD students who have registered for credit, permission will not be granted to non-PhD students, and auditing will not be allowed. Required background: CS 2110 or equivalent programming experience, and at least one course in artificial intelligence or any relevant subfield (e.g., NLP, information retrieval, machine learning).

Related courses In Fall 2014, there's CS4744 Computational linguistics, CS6783 Machine learning theory, CS6788/INFO 6150 Advanced topic modeling, ECE 5960 Graphical models, IS 6320 Games, economic behavior, and the Internet. In Spring 2015, there's CS 4740 Natural language processing, and new IS professor Cristian Danescu-Niculescu-Mizil may be offering a course quite similar to CS6742.

Informative links


QUICK LINKS into the lecture table below
Tuesdays Thursdays

8/26, lec 1: overview

8/28, lec 2: reviews, helpfulness, social interaction

9/2, lec 3 reviews and social, cont.

9/4, lec 4: what do conversations "look" like?

9/9, lec 5: discourse

9/11, lec 6 discourse, cont.

9/16, lec 7: A1 presentations

9/18, lec 8: A1 presentations

9/23, lec 9 discourse, cont.

9/25, lec 10: adaptation

9/30, lec 11: Unspeakable/Kickstarter

10/2, lec 12: Meme mutation/gendered um and uh

10/7, lec 13: hiphopgangstaghettorapper/stackoverflow vs. email. Plus the IRB; finding data samples on Twitter

10/9, lec 14: donating and collaborating

(Fall break)

10/16, lec 15: checkup appointments

10/21, lec 16: scraping

10/23, lec 17: project coordination, conference submission deadlines

10/28, lec 18: Bayesian ID of fightin' words

10/30, lec 19: fightin' words, cont.

11/4, lec 20: checkup appointments

11/6, lec 21: checkup appointments

11/11, lec 22: no meeting - Veteran's Day

11/13, lec 23: features case study: great writing; grammars
11/18, lec 24: language models, mostly non-ngram ones 11/20, lec 25: checkup appointments
11/23, lec 26: language models and comparing language models (Thanksgiving break)
12/2, lec 27: checkup appointments 12/4, lec 28: in-class project presentations


Navigation tools for the table below:

  1. Click and drag column dividers to resize columns.
  2. Click on this text to toggle all lectures' details' visibility

Lecture Date Agenda and references Assignments and other handouts
#1 Aug 26

Course overview: scope, course goals, course design

The school of Athens - people talking and reading

Image source: Some people are speaking to each other; some are reading and perhaps being influenced by that text; some are writing text, perhaps hoping to have an effect on others; some texts are being read by several people simulataneously.

Scan of lecture notes

Images and webpages displayed in class:


Bryan, Christopher J, Gregory M Walton, Todd Rogers, and Carol S Dweck. 2 August 2011. Motivating voter turnout by invoking the self. Proceedings of the National Academy of Sciences 108 (31): 12653-12656.

Chong, Dennis and James N. Druckman. 2007. Framing theory. Annual Review of Political Science 10:103--126.

Assignment 1 (A1) officially released

#2 28

To what extent is there social interaction on review sites?

Image source: Dorothy Gambrel, Cat and Girl: Permission policy here.

Scan of lecture notes

Images and webpages displayed in class:


Danescu-Niculescu-Mizil, Cristian, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. No country for old members: User lifecycle and linguistic change in online communities. Proceedings of WWW, pp. 307--318.

Gilbert, Eric and Karrie Karahalios. 2010. Understanding deja reviewers. Proceedings of CSCW, pp.225—228. [ACM link]

Jurafsky, Dan, Victor Chahuneau, Bryan R. Routledge and Noah A. Smith. Narrative framing of consumer sentiment in online restaurant reviews. First Monday 19(4).

Michael, Loizos and Jahna Otterbacher. 2014. Write like I write: Herding in the language of online reviews. Proceedings of ICWSM.

Mimno, David. Data carpentry. 2014.

Pinch, Trevor and Filip Kesler. 2011. How Aunt Ammy gets her free lunch: A study of the top-thousand customer reviewers at

#3 Sep 2

Review "quality" and "helpfulness": a lens for studying social influence

Image source: Randall Munroe, xkcd (click on image for original link). Expletive obscured in this presentation.

Scan of lecture notes

Images and handouts from class

References on lecture topics

Cheng, Justin, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec. 2014. How community feedback shapes user behavior. Proceedings of ICWSM.

Danescu-Niculescu-Mizil, Cristian, Gueorgi Kossinets, Jon Kleinberg, and Lillian Lee. 2009. How opinions are received by online communities: A case study on helpfulness votes. Proceedings of WWW: 141—150. [alt link]

Ghose, Anindya and Panagiotis Ipeirotis. 2011. Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. IEEE Transactions on Knowledge and Data Engineering 23(10): 1498—1512. Official link can be found through Worldcat, e.g., here.

Muchnik, Lev, Sinan Aral, and Sean Taylor. 2013. Social influence bias: A randomized experiment. Science 341.

Otterbacher, Jahna. 2009. 'Helpfulness' in online communities: a measure of message quality. Proceedings of CHI, 955-964.

Sipos, Ruben, Arpita Ghosh, and Thorsten Joachims. 2014. Was this review helpful to you? It depends! Context and voting patterns in online content. Proceeedings of WWW.

Wang, R.Y. and Strong, D.M. Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems 12, 4 (1996), 5-34.

Representative additional references on "unconventional" text classification, by popular demand

Davidov, Dmitry, Oren Tsur, and Ari Rappoport. 2010. Semi-supervised recognition of sarcastic sentences in Twitter and Amazon. Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pp. 107--116.

Kiddon, Chloé and Yuriy Brun. That's what she said: Double entendre classification. Proceedings of the ACL (short papers), 89--94.

Li, Jiwei, Myle Ott, Claire Cardie, and Eduard Hovy. 2014. Towards a general rule for identifying deceptive opinion spam. Proceedings of the ACL. The paper showing a learned classifier outperforming humans on Tripadvisor-style reviews is Ott, M, Y Choi, C Cardie, and J T Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. Proceedings of the ACL, pp. 309--319.

Mihalcea, Rada and Carlo Strapparava. 2006. Learning to laugh (automatically): Computational models for humor recognition. Computational Intelligence 22(2).

#4 4

What do conversations "look" like?

Scan of lecture notes

Aside: email corpora


Backstrom, Lars, Jon Kleinberg, Lillian Lee, and Cristian Danescu-Niculescu-Mizil. 2013. Characterizing and curating conversation threads: Expansion, focus, volume, re-entry. Proceedings of WSDM, pp. 13–22. [alt link]

Elsner, Micha and Eugene Charniak. September 2010. Disentangling chat. Computational Linguistics 36(3): 389-409. [data and code]

Gonzalez-Bailon, Sandra, Andreas Kaltenbrunner, and Rafael E Banchs. 2010. The structure of political discussion networks: A model for the analysis of online deliberation. Journal of Information Technology 25(2): 230-243.

Kumar, Ravi, Mohammad Mahdian, and Mary McGlohon. 2010. Dynamics of conversations. Proceedings of KDD, pp. 553--562.

Nguyen, Viet-An, Jordan Boyd-Graber, Philip Resnik, Deborah A Cai, Jennifer E Midberry, and Yuanxin Wang. 2014. Modeling topic control to detect influence in conversations using nonparametric topic models. Machine Learning 95:381--421. [alt link]. [The talk slides we looked at in class]

Prabhakaran, Vinodkumar, Ashima Arora, and Owen Rambow. 2014. Power of confidence: How poll scores impact topic dynamics in political debates. ACL joint workshop on social dynamics and personal attributes.

Prabhakaran, Vinodkumar and Owen Rambow. 2014. Predicting power relations between participants in written dialog from a single thread. Proceedings of the ACL (short papers).

Seo, Jangwon, W. Bruce Croft, and David A. Smith. 2009. Online community search using thread structure. Proceedings of CIKM, pp. 1907--1910.

Siersdorfer, Stefan, Sergiu Chelaru, Jose San Pedro, Ismail Sengor Altingovde, and Wolfgang Nejdl. July 2014. Analyzing and mining comments and comment ratings on the social web. ACM Trans. Web 8 (3): 17:1-17:39. [alt link]

Wang, Yi-Chia, Mahesh Joshi, and Carolyn Penstein Rosé. 2008. Investigating the effect of discussion forum interface affordances on patterns of conversational interactions. Proceedings of CSCW, pp. 555–558.

#5 9

Checkpoints of A1 projects; Discourse phenomena: clues regarding structure

Image source: "The image is one for which Picasso did a number of variations in Paris during the autumn–winter of 1912; in each version, a tall bottle and goblet are set out on a small round table."

Scan of lecture notes and the handout

References related to the A1 project discussions


References from discourse lecture

Grice, H.P. 1975. Logic and Conversation. In Syntax and semantics 3: Speech Acts, pp. 41-58.

Jurafsky, Dan, and Martin, James H. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition Second edition. Chapter 21 covers discourse.

Moser, Megan and Johanna Moore. Toward a synthesis of two accounts of discourse structure. Computational Linguistics 22(3):409--419.

Rogers, Todd and Michael I Norton. June 2011. The artful dodger: Answering the wrong question the right way. Journal of Experimental Psychology: Applied 17 (2).

References for the examples on the handout:

Jordan Boyd-Graber Google+ post

Allen, James. 1995. Natural Language Understanding. Benjamin/Cummings Pub Co. Second ed.

Hirst, Graeme. 1981. Anaphora in Natural Language Understanding: A Survey. Lecture Notes in Computer Science. Springer, Berlin.

Sidner, Candace Lee. 1979. Towards a computational theory of definite anaphora comprehension in English discourse. MIT AITR-537.

Wilks, Yorick. 1975. An intelligent analyzer and understander of English. Communications of the ACM 18 (5): 264-274.


#6 11

Attention, intentions, and discourse structure: the Grosz and Sidner theory

Scan of lecture notes


Grosz, Barbara J., and Sidner, Candace L. 1986. Attention, intentions, and the structure of discourse. Computational Linguistics 12(3): 175-204.

Mann, William C., and Thompson, Sandra A. 1988. Rhetorical structure theory: Toward a functional theory of text organization. Text: Interdisciplinary Journal for the Study of Discourse 8, no. 3: 243-281.

Pinker, Steven and the Royal Society for the Encouragement of Arts, Manufactures and Commerce (RSA) Animate, posted to YouTube on Feb 10, 2011. Language as a Window into Human Nature

A2 out (deadline subsequently extended to Sept. 22)

#7 16

A1 presentations, part one

#8 18

A1 presentations, part two

#9 23

Discussion of application of Grosz/Sidner theory in A2

"Stacking", by Alastair Hesletine. Image source:

Scan of discussion notes

References see also the previous discourse lectures

Wikipedia entry on Deep Blue vs. Garry Kasparov (pronunciation)

Stolcke, Andreas Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. 2000. Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech. Computational Linguistics 26(3): 339--373.

Taboada, Maite and William C. Mann. 2006. Rhetorical structure theory: Looking back and moving ahead. Discourse Studies 8(3): 423-459. Gives an overview of many issues in analyzing discourse structure.

Walker, Marilyn A. 1996. Limited attention and discourse structure. Computational Linguistics 22(2): 255-264.

Read one — your choice — of the readings for Tu Sep 30 (lecture 11) and post a project proposal inspired by it to Piazza by 3pm Mon the 29th; include the general idea, and a suggestion for a dataset. A paragraph suffices (and more is great, if you feel inspired!). Thoughtfulness and creativity are what I'm most interested in, but take feasibility into account.

And, read each other's proposals, commenting as you see fit, before class on the 30th.

#10 25

Language adaptation, power and within-group lifespan

Scan of lecture notes

Danescu-Niculescu-Mizil, Cristian, Lillian Lee, Bo Pang, and Jon Kleinberg. 2012. Echoes of power: Language effects and power differences in social interaction. Proceedings of WWW, pp. 699--708. Link includes access to datasets, talk slides, etc. ACM link is here.

Danescu-Niculescu-Mizil, Cristian, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. No country for old members: User lifecycle and linguistic change in online communities. Proceedings of WWW, pp. 307--318. Link includes access to datasets, talk slides, etc. ACM link is here.


Beňuš, Štefan, Rivka Levitan, and Julia Hirschberg. 2012. Entrainment in spontaneous speech: The case of filled pauses in supreme court hearings. Proceedings of the 3rd IEEE Conference on Cognitive Infocommunications.

Bramsen, Philip, Martha Escobar-Molana, Ami Patel, and Rafael Alonso. 2011. Extracting social power relationships from natural language. Proceedings of ACL HLT.

Choudhury, Tanzeem and Alex Pentland. 2004. Characterizing social networks using the sociometer. Proceedings of the North American Association of Computational Social and Organizational Science (NAACSOS)

Danescu-Niculescu-Mizil, Cristian, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. A computational approach to politeness with application to social factors. Proceedings of the ACL. Real-life application and links to data and code.

Diehl, Christopher P., Galileo Namata, and Lise Getoor. 2007. Relationship identification for social network discovery. Proceedings of the AAAI Workshop on Enhanced Messaging, pp. 546--552.

Gilbert, Eric. 2012. Phrases that signal workplace hierarchy. Proceedings of CSCW.

Leber, Jessica. 2013. The immortal life of the Enron e-mails. Business News.

Ng, Sik Hung and James J Bradac. 1993. Power in Language: Verbal Communication and Social Influence. Sage Publications, Inc.

Vinod Prabhakaran and Owen Rambow's work on inferring power relationships

#11 30

Project-possibilities discussion

Image source:

The assigned reading: one of:

  1. Glasgow, Kimberly, Clayton Fink, and Jordan Boyd-Graber. 2014. Our grief is unspeakable: Automatically measuring the community impact of a tragedy. Proceedings of ICWSM.
  2. Mitra, Tanushree and Eric Gilbert. 2014. The language that gets people to give: Phrases that predict success on Kickstarter. Proceedings of CSCW.

Sites examined or mentioned during class

References (including some that came up during class)

Althoff, Tim, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. 2014. How to ask for a favor: A case study on the success of altruistic requests. Proceedings of ICWSM.

Bailey, Michael, Daniel J Hopkins, and Todd Rogers. 2013. Unresponsive and unpersuaded: The unintended consequences of voter persuasion efforts. Working paper on SSRN.

Bamman, David, Brendan O'Connor, and Noah Smith. 2012. Censorship and deletion practices in Chinese social media. First Monday 17(3).

Bell, Brad E and Elizabeth F Loftus. May 1989. Trivial persuasion in the courtroom: The power of (a few) minor details Journal of Personality and Social Psychology 56(5):669-679.

Danescu-Niculescu-Mizil, Cristian, Justin Cheng, Jon Kleinberg, and Lillian Lee. 2012. You had me at hello: How phrasing affects memorability. Proceedings of the ACL, pp. 892--901.

Gayo-Avello, Daniel. December 2013. A meta-analysis of state-of-the-art electoral prediction from Twitter data. Social Science Computer Review 31(6): 649-679. Hat tip to Brendan O'Connor; I saw this on his 2013 blog post Some analysis of tweet shares and “predicting” election outcomes. Also of interest, for the title alone: Gayo-Avello's "I wanted to predict elections with twitter and all I got was this lousy paper" -- A balanced survey on election prediction using twitter data, Eprint ArXiv:1204.6441 and On Twitter and Elections, catchy paper titles, press releases and telling scientist's opinions from facts: A brief comment to DiGrazia et al. 2013 and to Fabio Rojas Op-Ed in Washington Post.

Greenberg, Michael D, Bryan Pardo, Karthic Hariharan, and Elizabeth Gerber. 2013. Crowdfunding support tools: Predicting success & failure. Proceedings of CHI: Extended Abstracts, pp. 1815--1820.

Guerini, Marco, Carlo Strapparava, and Oliverio Stock. 2010. Evaluation metrics for persuasive NLP with Google adwords. Proceedings of LREC.

Hannak, Aniko, Drew Margolin, Brian Keegan, and Ingmar Weber. 2014. Get back! You don't know me like that: The social mediation of fact checking interventions in Twitter conversations. Proceedings of ICWSM.

King, Gary, Jennifer Pan, and Margaret E Roberts. 2013. How censorship in China allows government criticism but silences collective expression. American Political Science Review 107(02): 326-343.

Leskovec, Jure, Lars Backstrom, and Jon Kleinberg. 2009. Meme-tracking and the dynamics of the news cycle. In Proceedings of KDD, 497-506.

Petrovic, Sasa, Miles Osborne, and Victor Lavrenko. 2013. I wish I didn't say that! Analyzing and predicting deleted messages in Twitter. eprint arXiv:1305.3107.

Qazvinian, Vahed, Emily Rosengren, Dragomir R. Radev, and Qiaozhu Mei. 2011. Rumor has it: Identifying misinformation in microblogs. Proceedings of EMNLP, 1589--1599.

Thelwall, Mike, Kevan Buckley, and Georgios Paltoglou. 2011. Sentiment in Twitter events. Journal of the American Society for Information Science and Technology 62(2): 406-418.

Read one — your choice — of the readings for Tu Oct 7 (lecture 13) and post a project proposal inspired by it to Piazza by 3pm Mon the 6th; include the general idea, and a suggestion for a dataset. A paragraph suffices (and more is great, if you feel inspired!). Thoughtfulness and creativity are what I'm most interested in, but take feasibility into account.

And, read each other's proposals, commenting as you see fit, before the in-class discussion.

#12 Oct 2

Project-possibilities discussion

Image source: http://

Class is at 3:30 - let's say the Theory Lab.

Papers to be presented:

  1. Simmons, Matthew P., Lada A. Adamic, and Eytan Adar. 2011. Memes online: Extracted, subtracted, injected, and recollected. Proceedings of ICWSM, pp. 353--360.
  2. Acton, Eric K. 2011. On gender differences in the distribution of um and uh. Penn working papers in Linguistics: Selected papers from NWAV 17.

Some conversation/transcript corpora Cornell is a member of the LDC and so has access to LDC corpora


Attempt to reframe (reclaim?) "fracking". Ditto for Obamacare

Centrality measures - here's a presentation by Peter Dodds

Choi, Eunsol, Chenhao Tan, Lillian Lee, Cristian Danescu-Niculescu-Mizil, and Jennifer Spindel. June 2012. Hedge detection as a lens on framing in the GMO debates: A position paper. Proceedings of the ACL Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics

Clark, Herbert H., and Fox Tree, Jean E. 2002. Using uh and um in spontaneous speaking. Cognition 84, no. 1: 73--111.

Gonzales, Amy L., Jeffrey T. Hancock, and James W. Pennebaker. 2010. Language style matching as a predictor of social dynamics in small groups. Communication Research 37(1): 3-19.

Greene, Stephan and Philip Resnik. 2009. More than words: Syntactic packaging and implicit sentiment. NAACL, pp. 503--511.

Ireland, Molly E., Richard B. Slatcher, Paul W. Eastwick, Lauren E. Scissors, Eli J. Finkel, and James W. Pennebaker. 2011. Language style matching predicts relationship initiation and stability. Psychological Science 22(1): 39-44.

Kleinberg, Jon. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM 46 (5): 604–632.

Liberman, Mark. 2014. Language Log post on all sorts of aspects of the uh/um divide.

Omodei, Elisa, Thierry Poibeau, and Jean-Philippe Cointet. 2012. Multi-level modeling of quotation families morphogenesis. Proceedings of ASE/IEEE SocialCom.

Ranganath, Rajesh, Dan Jurafsky, and Dan McFarland. 2009. It's not you, it's me: Detecting flirting and its misperception in speed-dates. Proceedings of EMNLP.

Schneider, Nathan, Rebecca Hwa, Philip Gianfortoni, Dipanjan Das, Michael Heilman, Alan W. Black, Frederick L. Crabbe, and Noah A. Smith. 2010. Visualizing Topical Quotations Over Time to Understand News Discourse. CMU-LTI-01-103, CMU.

Tagliamonte, Sali. 2005. So who? Like how? Just what? Discourse markers in the conversations of young Canadians. Journal of Pragmatics 37(11): 1896-1915.

#13 7

Project-possibilities discussion

Image source:

  1. Garley, Matt and Julia Hockenmaier. 2012. Beefmoves: Dissemination, diversity, and dynamics of English borrowings in a German hip hop forum. Proceedings of ACL.
  2. Vasilescu, Bogdan, Alexander Serebrenik, Prem Devanbu, and Vladimir Filkov. 2014. How social Q&A sites are changing knowledge sharing in open source software communities. Proceedings of CSCW, pp. 342--354.

Some things discussed or tried in class


Anderson, Ashton, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2013. Steering user behavior with badges. Proceedings of WWW.

Farshad Kooti, Haeryun Yang, Meeyoung Cha, Krishna Gummadi, and Winter Mason. The emergence of conventions in online social networks. Proceedings of ICWSM. Best paper award.

Vasilescu, Bogdan, Andrea Capiluppi, and Alexander Serebrenik. 2012. Gender, representation and online participation: A quantitative study of stackoverflow. Proceedings of Social Informatics, pp. 332--338.


  1. Sign up by 3pm Wed the 15th for a check-up appointment, which will be held on Thursday the 16th. Here is the sign-up link.
  2. By 3pm on Tue the 21st, post your two+ paragraph informal term-project proposal draft on Piazza. If you've already decided to team up, just one person on the team posts.
    • At a minimum, your proposal should give the main idea, why you think this is interesting, the dataset you plan to use, and as precise an indication as you can give of what precisely you intend to investigate as the minimum criteria for completion.
    • Beyond the minimum, the more thought you put into proving that your project is feasible, the more useful this step will be.
  3. Between 3pm on Tuesday the 21st and 9am on Thursday the 23rd, you should on Piazza do the best you can to provide helpful comments to each other, perhaps decide to team up, etc.
#14 9

Project-possibilities discussion

Sites (in addition to the last time we talked about kickstarter)


An, Jisun, Daniele Quercia, and Jon Crowcroft. 2014. Recommending investors for crowdfunding projects. Proceedings of the 23rd International Conference on World Wide Web, pp. 261--270.

Barany, Michael J. 2010. '[B]ut this is blog maths and we're free to make up conventions as we go along': Polymath1 and the modalities of 'massively collaborative mathematics'. Proceedings of the 6th International Symposium on Wikis and Open Collaboration, pp. 10:1--10:9.

Barron, Brigid. November 2009. Achieving coordination in collaborative problem-solving groups. The Journal of the Learning Sciences 9(4): 403–436.

Cranshaw, Justin and Aniket Kittur. 2011. The polymath project: Lessons from a successful online collaboration in mathematics. Proceedings of CHI, pp. 1865--1874.

Fogarty, Mignon (Grammar Girl), 2014. What new research on the brain says every writer should do.

Fort, Karën, Gilles Adda, and K Bretonnel Cohen. 2011. Amazon mechanical turk: Gold mine or coal mine? Computational Linguistics 37(2): 413-420.

Hansen, Stephen, Michael McMahon, and Andrea Prat. 2014. Transparency and deliberation within the FOMC: A computational linguistics approach. Centre for Economic Policy Research, paper no 9994.

Roschelle, Jeremy and Stephanie D Teasley. 1995. The construction of shared knowledge in collaborative problem solving. Proceedings of the NATO Advanced Research Workshop on Computer Supported Collaborative Learning, pp. 69--97.

Willemyns, Michael, Cynthia Gallois, and Victor J Callan. 2006. Conversations between postgraduate students and their supervisors: Intergroup communication and accommodation. Proceedings of the World Congress on the Power of Language: Theory, Practice and Performance.

Xu, Anbang, Xiao Yang, Huaming Rao, Wai-Tat Fu, Shih-Wen Huang, and Brian P Bailey. 2014. Show me the money!: An analysis of project updates during crowdfunding campaigns. Proceedings of CHI, pp. 591--600.

Oct 14 Fall Break
#15 16

No class meeting — instead, individual meetings throughout the day for performance (and, if desired, potential project) feedback.

#16 21

Tales from the trenches: data scraping

Slides by Amit Sharma and Chenhao Tan

Notes taken during class

Resources mentioned in discussion. All descriptions below taken from the linked webpages

  • Beautiful Soup, Python library designed for quick turnaround projects like screen-scraping
  • curl man page. Transfer a URL
  • cron
  • pandas, Python Data Analysis Library
  • /r/redditdev, subreddit for discussion of reddit API clients and the reddit source code.
  • redis, open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server
  • social-integrator by Amit Sharma. A project to provide convenient API access to programmers and researchers for downloading data. Currently supports lastfm.
  • Tweepy, An easy-to-use Python library for accessing the Twitter API.
  • Twitter social graph 2009 Results of a full crawl of the entire Twitter site. 41.7 million user profiles, 1.47 billion social relations, 4,262 trending topics, and 106 million tweets. This data was collected for the paper, What is Twitter, a Social Network or a News Media? by Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon.
  • wget Wikipedia page, retrieves content from web servers
  • WordNet and SentiWordNet
#17 23

Class discussion of project proposals and feedback

Upcoming submission deadlines: NAACL long & short papers: sub Dec 4, notification Feb 20, multiple submissions OK. ICWSM full and poster: abstracts Jan 18, papers Jan 23, notification March 9. WWW "Web Science track": abstracts Jan 19, papers Jan 23, notification Feb 27.

Post to Piazza as a followup to your proposal what you commit to doing in the next week/1.5 weeks or so.

#18 28

Bayesian identification of features distinguishing two sub-languages

Image source:

Scan of lecture notes

Monroe, Burt L, Michael P Colaresi, and Kevin M Quinn. 2008. Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16(4): 372-403.

Additional references:

Google books n-gram corpus

Kenneth W. Church. 2000. Empirical estimates of adaptation: The chance of two Noriegas is closer to p/2 than p2. Proceedings of COLING.

Fredette, Marc and Jean-François Angers. 2002. A new approximation of the posterior distribution of the log-odds ratio. Statistica Neerlandica 56(3): 314-329.

Kleinberg, Jon. 2002. Bursty and hierarchical structure in streams. Proceedings of KDD, pp. 91-101.

Percy Liang and Dan Klein.2007. Tutorial: Structured Bayesian nonparametric models with variational inference. Included for material and visualizations of Dirichlets.

Liberman, Mark. 2014. Obama's favored (and disfavored) SOTU words. Language Log blog post, using the Monroe/Colaresi/Quinn method.

Mitra, Tanushree and Eric Gilbert. 2014. The language that gets people to give: Phrases that predict success on kickstarter. Proceedings of CSCW.

FAQ: How do I interpret odds ratios in logistic regression? Introduction to SAS. UCLA: Statistical Consulting Group.

#19 30

Continuation of "Fightin' Words"

Scan of lecture notes

Deferred, for the most part, until next class meeting:

Louis, Annie and Ani Nenkova. 2013. What makes writing great? First experiments on article quality prediction in the science journalism domain. Transactions of the Association for Computational Linguistics 1:341-352.


MRC Psycholinguistic database. Citation: Wilson, M.D. (1988) The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural Research Methods, Instruments and Computers, 20(1), 6-11. A search interface makes it clear what kind of features (annotations) are identified for the lexicon items.

Sign up for (mandatory) checkup appointments for next week. Link here.

#20 Nov 4

No class meeting — individual team check-up meetings instead.

#21 6

No class meeting — individual team check-up meetings instead.

#22 11

No class meeting — Veteran's Day

#23 13

Case study of feature ingenuity; grammars of various sorts

Image source:, available for purchase; image crop by Popular Science

Scan of lecture notes

Our starting point: Louis, Annie and Ani Nenkova. 2013. What makes writing great? First experiments on article quality prediction in the science journalism domain. Transactions of the Association for Computational Linguistics 1:341-352.

Jurafsky, Daniel and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition
. 2nd edition. Slides by Kathy McCoy corresponding to chapter 14 - slide 7 (Fig 14.12) is what I showed in class, and slides 17 and 18 (need to split categories), 20 and 22 (importance of lexical information) were ones I contemplated showing. Section 12.7 discussed dependency parsing. I also displayed figure 12.14, a dependency parse, from chapter 12.

Notes for a lecture I gave on context-free grammars in 2007, scribed by Cristian Danescu-Niculescu-Mizil, Nam Nguyen, and Myle Ott.


Abeillé, Anne and Yves Schabes. 1989. Parsing idioms in lexicalized TAGs. EACL, pp. 1–9.

Jäger, Gerhard and James Rogers. 2012. Formal language theory: refining the Chomsky hierarchy. Philosophical Transactions of the Royal Society B: Biological Sciences. I picked this for being brief, hitting the whole Chomsky hierarchy, and mentioning the mildly-context-sensitive languages and their relation to natural language; but one would not necessarily argue that this is an easy introduction.

Kroch, Anthony and Aravind Joshi. 1985. The linguistic relevance of tree adjoining grammar. UPenn Technical Report MS-CIS-85-16.

Joshi, Aravind K. and Yves Schabes. 1997. Tree-adjoining grammars. In Grzegorz Rozenberg and Arto Salomaa, editors, Handbook of Formal Languages, volume 3 (Beyond words), pp. 69–12.

Joshi, Aravind, K. Vijay-Shanker, and David Weir. 1991. The convergence of mildly context-sensitive grammar formalisms. In Peter Sells, Stuart Shieber and Thomas Wasow, Eds., Foundational Issues in Natural Language Processing. Link is to a technical-report version.

Nadeau, David and Sekine, Satoshi. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1): 3-26. Alternative link.

Pullum, Geoffrey K. 1986. Topic ... Comment: Footloose and context-free. Natural Language & Linguistic Theory 4(3): 409-414. Comments on attempts to prove that natural languages are not context-free.

The MaltParser dependency parser. An early paper, mentioning that "The runtime of the algoerithm is linear in the length of the input string, and the dependency graph is guaranteed to be projective and acyclic": Nivre, Joakim. 2003. An efficient algorithm for projective dependency parsing. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT 03), pp. 149-160.

The XTAG project.

Sign up for (mandatory) checkup appointments for next week. Link here.

#24 18

Tour of (mostly non-ngram) language models

Scan of lecture notes


Barzilay, Regina and Lillian Lee. 2004. Catching the drift: Probabilistic content models, with applications to generation and summarization. In Proceedings of HLT-NAACL, 113-120. Original code by Regina Barzilay (in Lisp); code by Alexandre Passos; other code for later versions

Booth, Taylor L. and Richard A. Thompson. 1973. Applying probability measures to abstract languages. IEEE Transactions on Computers 100(5): 442-450.

Chi, Zhiyi and Stuart Geman. June 1998. Estimation of probabilistic context-free grammars. Computational Linguistics 24(2): 299-305.

Lari, K. and S.J. Young. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech & Language 4(1): 35-56. Code by Mark Johnson.

Manning, Christopher D. and Hinrich Schuetze. 1999. Section 11.1 "Some features of PCFGs", which can be found in Chapter 11 of Foundations of Statistical Natural Language Processing. MIT Press.

Rabiner, Lawrence R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (2), pp. 257--286. Errata by Ali Rahimi

#25 20

No class meeting — individual team check-up meetings instead.

#26 25

Language models: characterization and comparison

Image source:

Scan of lecture notes


Baez, John. 2012. The mathematics of bioversity (part 4). Blog post.

Chen, Stanley F. and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13(4): 359-393.

Csiszár, Imre. September 2008. Axiomatic characterizations of information measures. Entropy 10(3): 261-273.

Gale, William A. and Kenneth W. Church. 1994. What's wrong with adding one. Corpus-based Research Into Language: In Honour of Jan Aarts, pp. 189--200.

Lee, Lillian. 1999. Measures of distributional similarity. Proceedings of the ACL, pp. 25--32.

Rao, Calyampudi Radhakrishna. January 2011. Entropy and cross entropy as diversity and distance measures. In International Encyclopedia of Statistical Science, pp. 440--446.

Sign up for (mandatory) checkup appointments for next week. Link here.

Nov 27 Thanksgiving Break
#27 Dec 2

No class meeting — individual team check-up meetings instead.

#28 4

10-minute project presentations.

Final-project write-up due-date, as determined by the registrar: December 11 at 4:30 pm. I have no particular page length in mind, but please highlight the most interesting findings (positive or negative). You should include the following sections: introduction/motivation, related work, data description (how you gathered, cleaned, and processed it), a methods section, an experiments section, what you learned and what you concluded, what are directions for future work. You don't need to be particularly formal. My primary evaluation criteria will be the reasonableness (in approach and amount of effort), thoughtfulness, and creativity of what you tried.

Code for generating the calendar above and css was (barely) adapted from the original versions created by Andrew Myers.