We will continuously update this file.
Time: Monday and Wednesday, 10:10-11:25am
Room: Gates 416 / Hackers
Instructor: Yoav Artzi (office hours: by coordination)
Class listing: CS6741
CMS, Trello, CMT (paper reviewing), Piazza (discussion board)
All students are asked to bring their laptops to class. We will use laptops to broadcast content. If you can not bring a laptop, you will need to share a laptop with another student (this should work well for pairs).
In this course, we will study various topics in NLP. We will focus on research results, and every few meetings will switch topics. In general, most topic discussions will include technical overview, data analysis, a classical result, and recent results. Discussion of results will be done through research papers, and will include a class discussion and paper reviews. We will use CMT to read and review papers. Each week, we will start the discussion with a 10-15min introduction of the papers by one of the students.
We will use a Trello board to vote on topics. We will select topics based on votes and the general agenda of the course. There is no guarantee that the top-voted topic will be the next to be discussed (but we will try our best to do so). We do not have good control of voting on Trello, so please only vote for 5 topics. Once a topic has been moved from the Potential Topics list, you should re-allocate your vote to a new topic.
The list of topics in Trello is not complete. If you have a topics you would like us to discuss (can be your research!), please add them to the suggested topics list in Trello. If you want your topic to be considered, please attach 2-3 papers (1 "old" and the rest "new"). Some good sources for topics are:
Finally, not all resources can be shared publicly. We will use CMS to share copyrighted resources (e.g., Penn Treebank).
We will start with two core topics to warm up:
We will then spend the rest of the semester on topics we select together. This is a partial unordered list of topics:
Reviews are due at 5pm the day before class.
Date | Topic | Readings | Presenting Student | Data | Optional Readings and Others |
---|---|---|---|---|---|
Aug24 | Introduction and Course Overview | Intro Questionnaire due Friday | |||
Aug29 | Tagging | ||||
Aug31 | No Class | ||||
Sep5 | No Class: Labor Day | ||||
Sep7 | Brants2000, Toutanova2000 | Yiqing | |||
Sep12 | Santos2014 | Ana | Tagging | BackProp, CharRNN, NNLP Primer, Schmid1994 | |
Sep14 | Dependency Parsing | Proposal due | |||
Sep19 | Goldberg2013 | Tianze | UD | Nivre2003 | |
Sep21 | Chen2014 | Xilun | UD | Weiss2015, Andor2016 | |
Sep26 | Reading Comprehension | Hirchman1999, Richardson2013 | Sydney | MCTest | |
Sep28 | Danqi2016 | Esin | CNN/DailyMail | Hermann2015 (strongly recommended) | |
Oct3 | Semantic Parsing | Tutorial | |||
Oct5 | |||||
Oct10 | No Class: Columbus Day | ||||
Oct12 | No Class: Yum Kippur | ||||
Oct17 | Miller1994, Miller1996 | Ransen | ATIS (CMS) | Hemphill1990 | |
Oct19 | Artzi2013 | Alane | Navi | Zettlemoyer2005 | |
Oct24 | Midterm Presentations | Presentations due Oct 23 | |||
Oct26 | No Class | ||||
Oct31 | Midterm Presentations | ||||
Nov2 | No Class: EMNLP2016 - check the proceedings instead | ||||
Nov7 | Embeddings | Turney2010 (Sections 1,2,6), Levy2015 | Longqi and Hongyi | w2v-explained | |
Nov9 | Hill2016, Adi2016 | Yao and Xinya | |||
Nov14 | Question Answering | Fader2013 | Max | Paralex | Fader2014, Simmons1970, Slides on word alignments, ReVerb (Fader2011) |
Nov16 | Andreas2016 | Arzoo | VQA | Jabri2016 | |
Nov21 | Weston2016 | Alane | Babi | ||
Nov23 | No Class: Thanksgiving | ||||
Nov28 | Project Presentations | ||||
Nov30 | Project Presentations | Project Report due December 8 |
Each paper review will require a short summary of the paper and the actual review. Some questions you may use to guide your review are (many others are valid too):
Since this is not a real conference review, please also write what you learned form this paper and why, in your opinion, it was a good choice for reading (or why it was a bad choice). Reviews are due at 5pm the day before class.
Each meeting, if readings are discussed, one student will present the papers for 10-15 minutes. The presentation can use slides or can be just verbal. You should use data examples, if data was assigned to the topic, to illustrate your point. We will then go around the room and each student will contribute to the discussion.
Pick at least 2-3 examples to discuss during your presentation in class. Examples should be prepared to display on screen. We will share your screen as necessary. Pick the examples to illustrate various aspects of the paper and task. The questions you should think about include (but not limited to):
Both parsing corpora below (PTB and UD) contain POS tags. Each parse tree contains POS tags for all leaf nodes. You can view a sample of the PTB in NLTK:
>> import nltk
>> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents()[0]))
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
>> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents(tagset='universal')[0]))
Pierre/NOUN Vinken/NOUN ,/. 61/NUM years/NOUN old/ADJ ,/. will/VERB join/VERB the/DET board/NOUN as/ADP a/DET nonexecutive/ADJ director/NOUN Nov./NOUN 29/NUM ./.
The universal tag set is described here. The PTB tag set is described here.
The CoNLL 2002 shared task is available in NLTK:
>> import nltk
>> len(nltk.corpus.conll2002.iob_sents())
35651
>> len(nltk.corpus.conll2002.iob_words())
678377
>> print ' '.join(map(lambda x: x[0] + '/' + x[2], nltk.corpus.conll2002.iob_sents()[0]))
Sao/B-LOC Paulo/I-LOC (/O Brasil/B-LOC )/O ,/O 23/O may/O (/O EFECOM/B-ORG )/O ./O
CoNLL 2002 is annotated with the IOB annotation scheme and multiple entity types.
This is another example of tagging. The task is explained here, and the data release is described here.
The Universal Dependencies (UD) project is publicly available online. The website includes statistics for all annotated languages. You can easily download v1.3 from here. UD files follow the simple CoNLL-U format.
The Penn Treebank is available on CMS. You will find tgrep useful for quickly searching the corpus for patterns. NLTK can also be used to load parse trees. A few more browsers are available online.
The WMT shared task from 2016 is a good source for newswire bi-text.
TE has been studied extensively for more than a decade now. Recently, SNLI has been receiving significant attention.
MCTest is a relatively new corpus, which is receiving significant attention. SQuAD is even newer. Another corpus we will look at is DailyMail/CNN from Deep Mind.
We will look at three data sets commonly used for semantic parsing:
If you encounter an interesting demo or system not listed here, please email the course instructor.
Deep Learning frameworks:
We recommend using virtualenv with Python. Here is a quick, but sufficient, explanation. All pip
installations will then be local to the environment. For example, you may install NLTK to access data (i.e., using pip install nltk
).
Got any good tips? Email to share!
All policies are subject to change.
Grading: The grade will include paper reviews, paper introductions, project, and potentially assignments.
Prerequisites: Strong programming experience (CS 2110 or equivalent). For any questions regarding enrollment, show up in person on the first day of class or email the instructor.