Time: Monday and Wednesday, 9:50-11:05am
Room: Gates 416 / Bloomberg 398
Instructor: Yoav Artzi, yoav@cs (office hours by appointment)
Class listing: CS6741
CMS, Piazza, Zoom (lectures), CMT (paper reviewing), AOI (topic voting)
All students are asked to bring their laptops to class. We will use laptops to broadcast content. If you can not bring a laptop, you will need to share a laptop with another student (this should work well for pairs). This applies to students on both campuses.
In this course, we will study various topics in NLP. We will focus on research results, and every few meetings switch topics. In general, most topic discussions will include technical overview, data analysis, a classical result, and recent results. Discussion of results will be done through research papers, and will include a class discussion and short paper reviews. We will use CMT to read and review papers. Each meeting will start with a 10-15min presentation followed by a discussion. In addition, we will dedicate a part of the semester to a deep dive into a single focus topic. This semester the focus topic will be reinforcement learning for NLP. The focused part of the semester will include, in addition to paper reviews and discussions, implementation and analysis of core algorithms.
We will use All Our Ideas to vote on topics. We will select topics based on votes and the general agenda of the course. There is no guarantee that the top-voted topic will be the next to be discussed (but it is very likely). Please vote a lot. The more you vote, the better the ranking is. The password for the website will be given during the first lecture.
We will continuously update this page.
Date | Topic | Readings | Presenter | Data | Optional Readings and Others |
---|---|---|---|---|---|
Aug 23 | Introduction | Intro Questionnaire | |||
Aug 28 | Ethics | Caliskan et al. 2017 | Max | word2vec [2] | Fast Company Bolukbasi et al. 2016 |
Aug 30 | Ethics | Zhao et al. 2017 | Andrew | ImSitu | Wired |
Sep 4 | No class - Labor day | ||||
Sep 6 | No class - EMNLP | ||||
Sep 12 | No class - EMNLP | ||||
Sep 14 | No class - campus dedication | ||||
Sep 18 | Guest talk: Felix Hill (Deep Mind) | ||||
Sep 20 | Recurrent architectures | Linzen et al. 2016, Kuncoro et al. 2017 | Alane (Linzen et al. 2016), Ryan (Kuncoro et al. 2017) | ||
Sep 25 | Recurrent architectures | Vaswani et al. 2017 | Howard | ||
Sep 27 | Semantic parsing | Matuszek et al. 2011 | Dipendra | GenX | |
Oct 2 | Semantic parsing | Krishnamurthy et al. 2017 | Skyler | WikiTableQuestions | Project abstracts due |
Oct 4 | Semantic parsing | Padmakumar et al. 2017 | Valts | Experiment Logs | |
Oct 9 | No class - holiday | ||||
Oct 11 | Language+Vision | Kitaev and Klein 2017 | Eyvind | Data | |
Oct 16 | Language+Vision | Goyal et al. 2017, Agrawal et al. 2017 | Trishala | VQA2 | |
Oct 18 | Grounded generation | FitzGerald et al. 2013 | Zexi | GenX | |
Oct 23 | Project proposal presentations | Project presentations due Oct 22 | |||
Oct 25 | Project proposal presentations | ||||
Oct 30 | Grounded generation | Mao et al. 2016 | Ishaan | Google RefExp | |
Nov 1 | Grounded generation | Wiseman et al. 2017 | Esin | Boxscore | |
Nov 6 | RL | Harrison et al. 2017 | Yoav | Kaplan et al. 2017, Krening et al. 2016 Frogger | |
Nov 8 | RL | Guu et al. 2017 | Dipendra | SCONE | |
Nov 13 | RL | Peng et al. 2017 | Alane and Valts | Frames | |
Nov 15 | RL | Fang et al. 2017 | Skyler | ||
Nov 20 | RL | Nguyen et al. 2017 | Ryan | Data | |
Nov 27 | Project presentations, Class 9:50-12:00 | Project presentation due Nov 26 | |||
Nov 29 | No class | Project reports due Dec 12 |
Project: There are two project options: research and survey. If you choose the research option, you will do a research project (can be your own research if relevant – it probably is!). The survey option will require to write a survey paper for selected area in NLP. Both options will include: (a) proposal presentation, (b) final presentation, and © final report submission.
Auditing: Auditing is allowed and encouraged with instructor permission. It requires attending all classes, submitting reviews, presenting papers, and participating in the discussion. Auditing does not require completing the project or doing any of the project related presentations. The goal is to allow interested students to join while maintaining a lively and productive discussion group. If you want to audit, please email the instructor as soon as possible.
Repeat students: Students that already took this class in the past are not required to do the project part of the class.
Grading: The grade will include paper reviews, participation, project, and an intro questionnaire.
Participation of non-PhD students: If you are a master student or an advanced undergraduate student, and you wish to participate in the class, please email the instructor. Cornell Tech master students, please follow the application instructions emailed to you.
Policies are subject to change. If something is not clear, please contact the course staff.
Each paper review will require a short summary of the paper and the actual review. Some questions you may use to guide your review are (many others are valid too):
Since this is not a real conference review, please also write what you learned form this paper and why, in your opinion, it was a good choice for reading (or why it was a bad choice). Reviews are due at 8pm the day before class.
Each meeting, if readings are discussed, one student will present the papers for 10-15 minutes. The presentation can use slides or can be just verbal. You should use data examples, if data is available to the topic, to illustrate your point. We will then go around the room and each student will contribute to the discussion.
Some suggested discussion question (not a comprehensive list):
Pick at least 2-3 examples to discuss during your presentation in class. Examples should be prepared to display on screen. We will share your screen as necessary. Pick the examples to illustrate various aspects of the paper and task. The questions you should think about include (but not limited to):
Some suggested discussion question (not a comprehensive list):
The main publication venues are ACL, NACCL, EMNLP, TACL, EACL, CoNLL, and CL. All the paper from these publications can be found in the ACL Anthology. In addition, NLP publications often appear in ML and AI conferences, including ICML, NIPS, ICLR, AAAI, IJCAI. A calendar of NLP events is available here, and ACL sponsored events are listed here.
Both parsing corpora below (PTB and UD) contain POS tags. Each parse tree contains POS tags for all leaf nodes. You can view a sample of the PTB in NLTK:
>> import nltk
>> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents()[0]))
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
>> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents(tagset='universal')[0]))
Pierre/NOUN Vinken/NOUN ,/. 61/NUM years/NOUN old/ADJ ,/. will/VERB join/VERB the/DET board/NOUN as/ADP a/DET nonexecutive/ADJ director/NOUN Nov./NOUN 29/NUM ./.
The universal tag set is described here. The PTB tag set is described here.
The CoNLL 2002 shared task is available in NLTK:
>> import nltk
>> len(nltk.corpus.conll2002.iob_sents())
35651
>> len(nltk.corpus.conll2002.iob_words())
678377
>> print ' '.join(map(lambda x: x[0] + '/' + x[2], nltk.corpus.conll2002.iob_sents()[0]))
Sao/B-LOC Paulo/I-LOC (/O Brasil/B-LOC )/O ,/O 23/O may/O (/O EFECOM/B-ORG )/O ./O
CoNLL 2002 is annotated with the IOB annotation scheme and multiple entity types.
This is another example of tagging. The task is explained here, and the data release is described here.
The Universal Dependencies (UD) project is publicly available online. The website includes statistics for all annotated languages. You can easily download v1.3 from here. UD files follow the simple CoNLL-U format.
The Penn Treebank is available from the LDC You will find tgrep useful for quickly searching the corpus for patterns. NLTK can also be used to load parse trees. A few more browsers are available online.
The WMT shared task from 2016 is a good source for newswire bi-text.
TE has been studied extensively for more than a decade now. Recently, SNLI has been receiving significant attention.
We will look at three data sets commonly used for semantic parsing:
If you encounter an interesting demo or system not listed here, please email the course instructor.
Deep Learning frameworks and tools: