Spring 2017 CS5740: Natural Language Processing

Time: Monday and Wednesday, 2:30-3:45pm
Room: Big Red
Instructor: Yoav Artzi, yoav@cs (office hours: Monday 5pm, Baron)
TA: Max Grusky, grusky@cs (office hours: Thursday 1pm, by Skype, coordinate via email)
Class listing: CS5740
CMS

This course constitutes a depth-first technical introduction to natural language processing (NLP). NLP is at the heart of many of today's most exciting technological achievements, including machine translation, automatic conversational assistants and Internet search. The goal of the course is to provide a deep understanding of the language of the field, including algorithms, core problems, methods, and data. Possible topics include text classification, lexical semantics, language modeling, machine translation, tagging and sequence modeling, parsing, compositional semantics, summarization, question answering, language grounding, information extraction, and sentiment analysis.

This course is not about using off-the-shelf NLP tools. If interested in tools and frameworks, see below for pointers to get your search started and questions answered.

We will continuously update this page.

Schedule

Date	Topic	Recommended Readings	Others
Jan25	Introduction
Jan30	Text classification	M&S 7.4,16.2-16.3, Collins: Naive Bayes (Sec 1-4), Collins: Log Linear (Sec 2), MaxEnt, Baselines, CNN Classification Naive Bayes prior derivation
Feb1	''		Assignment 1 out
Feb6	''
Feb8	''
	Neural networks	Primer, Back-prop, Deep Averaging Networks, Gradient Checks (briefly), Gradient Checks (in details)
Feb13	''
	Computation graphs	Intro to Computation Graphs
Feb15	Lexical semantics and embeddings	w2v explained, word2vec, word2vec phrases, Hill2016, Turney2010
~~Feb20~~	No class: February break
Feb22	''		Assignment 1 due, assignment 2 out
Feb27	''
	Language modeling	J&M 4, M&S 6, Collins: LM, Smoothing, Char RNN
Mar1	''
Mar6	''
	Machine translation	Neural MT Tutorial, BLEU Score
Mar8	''
	IBM translation models	J&M 25.5, M&S 13.1-13.2, Collins: IBM Models, IBM Models, Collins: EM (Sec 5-6), HMM alignments, IBM Model 2 EM Notebook
Mar13	''		Assignment 2 due
	Phrase-based machine translation	J&M 25.4, 25.8, M&S 13.3, Collins: PBT, Statistical PBT, Pharaoh decoder
Mar15	''
Mar20	''		Assignment 3 out
Mar22	''
	Sequence modeling	J&M 5.1-5.3, 6, M&S 3.1, 9, 10.1-10.3, Collins: HMM, Collins: MEMMs (Sec 3), Collins: CRF (sec 4), Collins: Forward-backward, SOTA Taggers, TnT Tagger, Stanford Tagger
Mar27	''
Mar29	''
~~Apr3~~	No class: spring break
~~Apr5~~	No class: spring break
Apr10	Sequence modeling (contd.)
Apr12	RNNs	BPTT, RNN Tutorial, Effectiveness	Assignment 3 due April 15
Apr17	''		Assignment 4 out
Apr19	''
	Dependency parsing	J&M 12.7, Nivre2003, Chen2014
Apr24	''
Apr26	NLP@Bloomberg / Amanda Stent and Anju Kambadur
May1	NLP@Google / David Weiss
May3	Dependency parsing
	Constituency parsing	J&M 12.1-12.6, 13.1-13.4, 14.1-14.4, M&S 11, 12.1, Collins: PCFGs, Eisner: Inside-outside, Collins: Inside-outside
May8	''
	NLP in Startups / S.R.K. Branavan (ASAPP)
May10	Constituency parsing		Assignment 4 due, final exam out and due May 15

Readings

Recommended: Michael Collins, Notes on Statistical NLP (on Michael's website)
Recommended: D. Jurafsky & James H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, Prentice Hall, Second Edition, 2009. (J&M)
Recommended: Y. Goldberg, Neural Network Methods in Natural Language Processing, 2017.
Optional: C.D. Manning & H. Schuetze, Foundations of Statistical Natural Language Processing, Cambridge: MIT Press, 1999. (M&S) (available online, free within the Cornell network)

Procedurals

Policies are subject to change. If something is not clear, please contact the course staff.

Grading: 40% assignments, 25% take-home final exam, and 30% class review quizzes, 5% participation (including both Slack and class).

Quizzes: First five minutes of every class. Top 20 quizzes count towards your grade, each 1.5%. Each quiz on the material from the slides of the previous lecture. Attendance in class is required to complete a quiz. Connecting remotely to complete a quiz is not allowed. Quiz time is just like an exam: no copying, no talking, and no browsing the web. Quiz may be taken on laptops or mobile devices.

Assignments: All assignments must be completed in pairs. Allowed usage of third-party code/frameworks/tools is specified in each assignment. Please ask for anything beyond what is specified. Better be safe than sorry. All assignments should be implemented in Python. The final exam is subject to the same policies as assignments (e.g., must be done in pairs, etc).

Kaggle: Some assignments may include participation in Kaggle competitions. Participation must be in teams with all accounts associated with your team. Please clearly list your team in the assignment writeup.

Late policy: 10% off for every 12 hours (e.g., 25 hours delay grade starts at 70). No late submission is allowed for the final exam.

Laptop and device policy: Except for quiz taking, not electronic devices are allowed in class.

Prerequisites: Strong programming experience (CS 2110 or equivalent) and CS 4780, CS 4786, or CS 5785 with a grade of B or above. Auditing does not count. If you did not complete any of these classes or your grade is below B enrollment requires instructor permission. For any other questions regarding enrollment, show up in person on the first day of class. Personal questions will be addressed following the lecture.

Auditing: Unfortunately, formal auditing is not possible. However, if the number of requests is small, we will provide an alternative way to take the class with a pass/fail grade. This will still require taking the quizzes. The top-20 quizzes will count, and you will need an average of 60/100 at least to get a pass. All requests must be emailed to the instructor before midnight Sunday, Jan 27.

Short and Incomplete List of NLP Pointers

NLP Conferences and Journals

The main publication venues are ACL, NACCL, EMNLP, TACL, EACL, CoNLL, and CL. All the paper from these publications can be found in the ACL Anthology. In addition, NLP publications often appear in ML and AI conferences, including ICML, NIPS, ICLR, AAAI, IJCAI. A calendar of NLP events is available here, and ACL sponsored events are listed here.

A Sample of Tasks and Problems in NLP

Tagging (e.g., part-of-speech, named-entities)
Dependency parsing
Constituency parsing
Semantic parsing (i.e., compositional semantics)
Discourse parsing
Language modeling
Machine translation
Semantic role labeling
Textual entailment
Question answering
Reading comprehension
Sentiment analysis
Co-reference resolution (including Winograd Schemas)
Word embeddings and distributional semantics
Lexical semantics
Vision+language (e.g., VQA, caption generation)
Information extraction
Time and event extraction
Math word problems
Chat bots

Corpora and Other Data

Tagging

Part-of-speech Tags

Both parsing corpora below (PTB and UD) contain POS tags. Each parse tree contains POS tags for all leaf nodes. You can view a sample of the PTB in NLTK:

>> import nltk
>> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents()[0]))
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
>> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents(tagset='universal')[0]))
Pierre/NOUN Vinken/NOUN ,/. 61/NUM years/NOUN old/ADJ ,/. will/VERB join/VERB the/DET board/NOUN as/ADP a/DET nonexecutive/ADJ director/NOUN Nov./NOUN 29/NUM ./.

The universal tag set is described here. The PTB tag set is described here.

Named Entity Recognition Data

The CoNLL 2002 shared task is available in NLTK:

>> import nltk
>> len(nltk.corpus.conll2002.iob_sents())
35651
>> len(nltk.corpus.conll2002.iob_words())
678377
>> print ' '.join(map(lambda x: x[0] + '/' + x[2], nltk.corpus.conll2002.iob_sents()[0]))
Sao/B-LOC Paulo/I-LOC (/O Brasil/B-LOC )/O ,/O 23/O may/O (/O EFECOM/B-ORG )/O ./O

CoNLL 2002 is annotated with the IOB annotation scheme and multiple entity types.

NYT Recipe Data

This is another example of tagging. The task is explained here, and the data release is described here.

Dependency Parsing

The Universal Dependencies (UD) project is publicly available online. The website includes statistics for all annotated languages. You can easily download v1.3 from here. UD files follow the simple CoNLL-U format.

Constituency Parsing

The Penn Treebank is available from the LDC You will find tgrep useful for quickly searching the corpus for patterns. NLTK can also be used to load parse trees. A few more browsers are available online.

Machine Translation

The WMT shared task from 2016 is a good source for newswire bi-text.

Textual Entailment

TE has been studied extensively for more than a decade now. Recently, SNLI has been receiving significant attention.

Reading Comprehension

MCTest is a relatively new corpus, which is receiving significant attention. SQuAD is even newer. Another corpus we will look at is DailyMail/CNN from Deep Mind.

Semantic Parsing

We will look at three data sets commonly used for semantic parsing:

GeoQuery: A natural language interface to a small US geography database. The original data is available here, and the original query language is described here. The data with lambda calculus logical forms is available here.
ATIS: A natural language interface for a flights database. The data is available from the LDC.
Navi: Instructional language for robot navigation. The original data is described here, but we recommend using the data here.

Word Analogy

The MSR dataset containing 8,000 analogy questions.
The GOOGLE dataset with 19,544 analogy questions.
The SEMEVAL dataset, covering 79 distinct relation types.

Question Answering

VQA
Paralex

Online Demos, Systems, and Tools

If you encounter an interesting demo or system not listed here, please email the course instructor.

SystemT Information Extraction Framework (Online Course)
CoreNLP: POS tagging, NER, co-references, dependency parsing
word2vecpride: word2vec in Pride and Prejudice
word2vec Playground
UWTime
Parsey McParseface
Cornell SPF
UIUC NLP Demos
ExplosionAI Demos (NER, dependency parsing, word embeddings)

Deep Learning frameworks and tools:

Technical Tips

We recommend using virtualenv with Python. Here is a quick, but sufficient, explanation. All pip installations will then be local to the environment. For example, you may install NLTK to access data (i.e., using pip install nltk).

Got any good tips? Email to share!