Fall 2016 CS6741 README

We will continuously update this file.

Course Overview

Time: Monday and Wednesday, 10:10-11:25am
Room: Gates 416 / Hackers
Instructor: Yoav Artzi (office hours: by coordination)
Class listing: CS6741
CMS, Trello, CMT (paper reviewing), Piazza (discussion board)

All students are asked to bring their laptops to class. We will use laptops to broadcast content. If you can not bring a laptop, you will need to share a laptop with another student (this should work well for pairs).

In this course, we will study various topics in NLP. We will focus on research results, and every few meetings will switch topics. In general, most topic discussions will include technical overview, data analysis, a classical result, and recent results. Discussion of results will be done through research papers, and will include a class discussion and paper reviews. We will use CMT to read and review papers. Each week, we will start the discussion with a 10-15min introduction of the papers by one of the students.

We will use a Trello board to vote on topics. We will select topics based on votes and the general agenda of the course. There is no guarantee that the top-voted topic will be the next to be discussed (but we will try our best to do so). We do not have good control of voting on Trello, so please only vote for 5 topics. Once a topic has been moved from the Potential Topics list, you should re-allocate your vote to a new topic.

The list of topics in Trello is not complete. If you have a topics you would like us to discuss (can be your research!), please add them to the suggested topics list in Trello. If you want your topic to be considered, please attach 2-3 papers (1 "old" and the rest "new"). Some good sources for topics are:

Finally, not all resources can be shared publicly. We will use CMS to share copyrighted resources (e.g., Penn Treebank).

Tasks and Problems

We will start with two core topics to warm up:

We will then spend the rest of the semester on topics we select together. This is a partial unordered list of topics:

Schedule

Reviews are due at 5pm the day before class.

Date Topic Readings Presenting Student Data Optional Readings and Others
Aug24 Introduction and Course Overview Intro Questionnaire due Friday
Aug29 Tagging
Aug31 No Class
Sep5 No Class: Labor Day
Sep7 Brants2000, Toutanova2000 Yiqing
Sep12 Santos2014 Ana Tagging BackProp, CharRNN, NNLP Primer, Schmid1994
Sep14 Dependency Parsing Proposal due
Sep19 Goldberg2013 Tianze UD Nivre2003
Sep21 Chen2014 Xilun UD Weiss2015, Andor2016
Sep26 Reading Comprehension Hirchman1999, Richardson2013 Sydney MCTest
Sep28 Danqi2016 Esin CNN/DailyMail Hermann2015 (strongly recommended)
Oct3 Semantic Parsing Tutorial
Oct5
Oct10 No Class: Columbus Day
Oct12 No Class: Yum Kippur
Oct17 Miller1994, Miller1996 Ransen ATIS (CMS) Hemphill1990
Oct19 Artzi2013 Alane Navi Zettlemoyer2005
Oct24 Midterm Presentations Presentations due Oct 23
Oct26 No Class
Oct31 Midterm Presentations
Nov2 No Class: EMNLP2016 - check the proceedings instead
Nov7 Embeddings Turney2010 (Sections 1,2,6), Levy2015 Longqi and Hongyi w2v-explained
Nov9 Hill2016, Adi2016 Yao and Xinya
Nov14 Question Answering Fader2013 Max Paralex Fader2014, Simmons1970, Slides on word alignments, ReVerb (Fader2011)
Nov16 Andreas2016 Arzoo VQA Jabri2016
Nov21 Weston2016 Alane Babi
Nov23 No Class: Thanksgiving
Nov28 Project Presentations
Nov30 Project Presentations Project Report due December 8

Paper Reviewing Guidelines

Each paper review will require a short summary of the paper and the actual review. Some questions you may use to guide your review are (many others are valid too):

Since this is not a real conference review, please also write what you learned form this paper and why, in your opinion, it was a good choice for reading (or why it was a bad choice). Reviews are due at 5pm the day before class.

Paper Presentation and Discussion Guidelines

Each meeting, if readings are discussed, one student will present the papers for 10-15 minutes. The presentation can use slides or can be just verbal. You should use data examples, if data was assigned to the topic, to illustrate your point. We will then go around the room and each student will contribute to the discussion.

Data Analysis Guidelines

Pick at least 2-3 examples to discuss during your presentation in class. Examples should be prepared to display on screen. We will share your screen as necessary. Pick the examples to illustrate various aspects of the paper and task. The questions you should think about include (but not limited to):

Data

Tagging

Part-of-speech Tags

Both parsing corpora below (PTB and UD) contain POS tags. Each parse tree contains POS tags for all leaf nodes. You can view a sample of the PTB in NLTK:

>> import nltk
>> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents()[0]))
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
>> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents(tagset='universal')[0]))
Pierre/NOUN Vinken/NOUN ,/. 61/NUM years/NOUN old/ADJ ,/. will/VERB join/VERB the/DET board/NOUN as/ADP a/DET nonexecutive/ADJ director/NOUN Nov./NOUN 29/NUM ./.

The universal tag set is described here. The PTB tag set is described here.

Named Entity Recognition Data

The CoNLL 2002 shared task is available in NLTK:

>> import nltk
>> len(nltk.corpus.conll2002.iob_sents())
35651
>> len(nltk.corpus.conll2002.iob_words())
678377
>> print ' '.join(map(lambda x: x[0] + '/' + x[2], nltk.corpus.conll2002.iob_sents()[0]))
Sao/B-LOC Paulo/I-LOC (/O Brasil/B-LOC )/O ,/O 23/O may/O (/O EFECOM/B-ORG )/O ./O

CoNLL 2002 is annotated with the IOB annotation scheme and multiple entity types.

NYT Recipe Data

This is another example of tagging. The task is explained here, and the data release is described here.

Dependency Parsing

The Universal Dependencies (UD) project is publicly available online. The website includes statistics for all annotated languages. You can easily download v1.3 from here. UD files follow the simple CoNLL-U format.

Constituency Parsing

The Penn Treebank is available on CMS. You will find tgrep useful for quickly searching the corpus for patterns. NLTK can also be used to load parse trees. A few more browsers are available online.

Machine Translation

The WMT shared task from 2016 is a good source for newswire bi-text.

Textual Entailment

TE has been studied extensively for more than a decade now. Recently, SNLI has been receiving significant attention.

Reading Comprehension

MCTest is a relatively new corpus, which is receiving significant attention. SQuAD is even newer. Another corpus we will look at is DailyMail/CNN from Deep Mind.

Semantic Parsing

We will look at three data sets commonly used for semantic parsing:

  1. GeoQuery: A natural language interface to a small US geography database. The original data is available here, and the original query language is described here. The data with lambda calculus logical forms is available here.
  2. ATIS: A natural language interface for a flights database. The data will be available on CMS.
  3. Navi: Instructional language for robot navigation. The original data is described here, but we recommend using the data here.

Word Analogy

Question Answering (TODO)

Online Demos, Systems, and Tools

If you encounter an interesting demo or system not listed here, please email the course instructor.

Deep Learning frameworks:

Technical Tips

We recommend using virtualenv with Python. Here is a quick, but sufficient, explanation. All pip installations will then be local to the environment. For example, you may install NLTK to access data (i.e., using pip install nltk).

Got any good tips? Email to share!

Procedurals

All policies are subject to change.

Grading: The grade will include paper reviews, paper introductions, project, and potentially assignments.

Prerequisites: Strong programming experience (CS 2110 or equivalent). For any questions regarding enrollment, show up in person on the first day of class or email the instructor.