Spring 2017 CS5740: Natural Language Processing

Time: Monday and Wednesday, 2:30-3:45pm
Room: Big Red
Instructor: Yoav Artzi, yoav@cs (office hours: Monday 5pm, Baron)
TA: Max Grusky, grusky@cs (office hours: Thursday 1pm, by Skype, coordinate via email)
Class listing: CS5740
CMS

This course constitutes a depth-first technical introduction to natural language processing (NLP). NLP is at the heart of many of today's most exciting technological achievements, including machine translation, automatic conversational assistants and Internet search. The goal of the course is to provide a deep understanding of the language of the field, including algorithms, core problems, methods, and data. Possible topics include text classification, lexical semantics, language modeling, machine translation, tagging and sequence modeling, parsing, compositional semantics, summarization, question answering, language grounding, information extraction, and sentiment analysis.

This course is not about using off-the-shelf NLP tools. If interested in tools and frameworks, see below for pointers to get your search started and questions answered.

We will continuously update this page.

Schedule

Date Topic Recommended Readings Others
Jan25 Introduction
Jan30 Text classification M&S 7.4,16.2-16.3, Collins: Naive Bayes (Sec 1-4), Collins: Log Linear (Sec 2), MaxEnt, Baselines, CNN Classification Naive Bayes prior derivation
Feb1 '' Assignment 1 out
Feb6 ''
Feb8 ''
Neural networks Primer, Back-prop, Deep Averaging Networks, Gradient Checks (briefly), Gradient Checks (in details)
Feb13 ''
Computation graphs Intro to Computation Graphs
Feb15 Lexical semantics and embeddings w2v explained, word2vec, word2vec phrases, Hill2016, Turney2010
Feb20 No class: February break
Feb22 '' Assignment 1 due, assignment 2 out
Feb27 ''
Language modeling J&M 4, M&S 6, Collins: LM, Smoothing, Char RNN
Mar1 ''
Mar6 ''
Machine translation Neural MT Tutorial, BLEU Score
Mar8 ''
IBM translation models J&M 25.5, M&S 13.1-13.2, Collins: IBM Models, IBM Models, Collins: EM (Sec 5-6), HMM alignments, IBM Model 2 EM Notebook
Mar13 '' Assignment 2 due
Phrase-based machine translation J&M 25.4, 25.8, M&S 13.3, Collins: PBT, Statistical PBT, Pharaoh decoder
Mar15 ''
Mar20 '' Assignment 3 out
Mar22 ''
Sequence modeling J&M 5.1-5.3, 6, M&S 3.1, 9, 10.1-10.3, Collins: HMM, Collins: MEMMs (Sec 3), Collins: CRF (sec 4), Collins: Forward-backward, SOTA Taggers, TnT Tagger, Stanford Tagger
Mar27 ''
Mar29 ''
Apr3 No class: spring break
Apr5 No class: spring break
Apr10 Sequence modeling (contd.)
Apr12 RNNs BPTT, RNN Tutorial, Effectiveness Assignment 3 due April 15
Apr17 '' Assignment 4 out
Apr19 ''
Dependency parsing J&M 12.7, Nivre2003, Chen2014
Apr24 ''
Apr26 NLP@Bloomberg / Amanda Stent and Anju Kambadur
May1 NLP@Google / David Weiss
May3 Dependency parsing
Constituency parsing J&M 12.1-12.6, 13.1-13.4, 14.1-14.4, M&S 11, 12.1, Collins: PCFGs, Eisner: Inside-outside, Collins: Inside-outside
May8 ''
NLP in Startups / S.R.K. Branavan (ASAPP)
May10 Constituency parsing Assignment 4 due, final exam out and due May 15

Readings

Related Readings

Procedurals

Policies are subject to change. If something is not clear, please contact the course staff.

Grading: 40% assignments, 25% take-home final exam, and 30% class review quizzes, 5% participation (including both Slack and class).

Quizzes: First five minutes of every class. Top 20 quizzes count towards your grade, each 1.5%. Each quiz on the material from the slides of the previous lecture. Attendance in class is required to complete a quiz. Connecting remotely to complete a quiz is not allowed. Quiz time is just like an exam: no copying, no talking, and no browsing the web. Quiz may be taken on laptops or mobile devices.

Assignments: All assignments must be completed in pairs. Allowed usage of third-party code/frameworks/tools is specified in each assignment. Please ask for anything beyond what is specified. Better be safe than sorry. All assignments should be implemented in Python. The final exam is subject to the same policies as assignments (e.g., must be done in pairs, etc).

Kaggle: Some assignments may include participation in Kaggle competitions. Participation must be in teams with all accounts associated with your team. Please clearly list your team in the assignment writeup.

Late policy: 10% off for every 12 hours (e.g., 25 hours delay grade starts at 70). No late submission is allowed for the final exam.

Laptop and device policy: Except for quiz taking, not electronic devices are allowed in class.

Prerequisites: Strong programming experience (CS 2110 or equivalent) and CS 4780, CS 4786, or CS 5785 with a grade of B or above. Auditing does not count. If you did not complete any of these classes or your grade is below B enrollment requires instructor permission. For any other questions regarding enrollment, show up in person on the first day of class. Personal questions will be addressed following the lecture.

Auditing: Unfortunately, formal auditing is not possible. However, if the number of requests is small, we will provide an alternative way to take the class with a pass/fail grade. This will still require taking the quizzes. The top-20 quizzes will count, and you will need an average of 60/100 at least to get a pass. All requests must be emailed to the instructor before midnight Sunday, Jan 27.

Short and Incomplete List of NLP Pointers

NLP Conferences and Journals

The main publication venues are ACL, NACCL, EMNLP, TACL, EACL, CoNLL, and CL. All the paper from these publications can be found in the ACL Anthology. In addition, NLP publications often appear in ML and AI conferences, including ICML, NIPS, ICLR, AAAI, IJCAI. A calendar of NLP events is available here, and ACL sponsored events are listed here.

A Sample of Tasks and Problems in NLP

Corpora and Other Data

Tagging

Part-of-speech Tags

Both parsing corpora below (PTB and UD) contain POS tags. Each parse tree contains POS tags for all leaf nodes. You can view a sample of the PTB in NLTK:

>> import nltk
>> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents()[0]))
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
>> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents(tagset='universal')[0]))
Pierre/NOUN Vinken/NOUN ,/. 61/NUM years/NOUN old/ADJ ,/. will/VERB join/VERB the/DET board/NOUN as/ADP a/DET nonexecutive/ADJ director/NOUN Nov./NOUN 29/NUM ./.

The universal tag set is described here. The PTB tag set is described here.

Named Entity Recognition Data

The CoNLL 2002 shared task is available in NLTK:

>> import nltk
>> len(nltk.corpus.conll2002.iob_sents())
35651
>> len(nltk.corpus.conll2002.iob_words())
678377
>> print ' '.join(map(lambda x: x[0] + '/' + x[2], nltk.corpus.conll2002.iob_sents()[0]))
Sao/B-LOC Paulo/I-LOC (/O Brasil/B-LOC )/O ,/O 23/O may/O (/O EFECOM/B-ORG )/O ./O

CoNLL 2002 is annotated with the IOB annotation scheme and multiple entity types.

NYT Recipe Data

This is another example of tagging. The task is explained here, and the data release is described here.

Dependency Parsing

The Universal Dependencies (UD) project is publicly available online. The website includes statistics for all annotated languages. You can easily download v1.3 from here. UD files follow the simple CoNLL-U format.

Constituency Parsing

The Penn Treebank is available from the LDC You will find tgrep useful for quickly searching the corpus for patterns. NLTK can also be used to load parse trees. A few more browsers are available online.

Machine Translation

The WMT shared task from 2016 is a good source for newswire bi-text.

Textual Entailment

TE has been studied extensively for more than a decade now. Recently, SNLI has been receiving significant attention.

Reading Comprehension

MCTest is a relatively new corpus, which is receiving significant attention. SQuAD is even newer. Another corpus we will look at is DailyMail/CNN from Deep Mind.

Semantic Parsing

We will look at three data sets commonly used for semantic parsing:

  1. GeoQuery: A natural language interface to a small US geography database. The original data is available here, and the original query language is described here. The data with lambda calculus logical forms is available here.
  2. ATIS: A natural language interface for a flights database. The data is available from the LDC.
  3. Navi: Instructional language for robot navigation. The original data is described here, but we recommend using the data here.

Word Analogy

Question Answering

Online Demos, Systems, and Tools

If you encounter an interesting demo or system not listed here, please email the course instructor.

Deep Learning frameworks and tools:

Technical Tips

We recommend using virtualenv with Python. Here is a quick, but sufficient, explanation. All pip installations will then be local to the environment. For example, you may install NLTK to access data (i.e., using pip install nltk).

Got any good tips? Email to share!