Time: Monday and Wednesday, 2:30-3:45pm
Room: Big Red
Instructor: Yoav Artzi, yoav@cs (office hours: Monday 5pm, Baron)
TA: Max Grusky, grusky@cs (office hours: Thursday 1pm, by Skype, coordinate via email)
Class listing: CS5740
This course constitutes a depth-first technical introduction to natural language processing (NLP). NLP is at the heart of many of today's most exciting technological achievements, including machine translation, automatic conversational assistants and Internet search. The goal of the course is to provide a deep understanding of the language of the field, including algorithms, core problems, methods, and data. Possible topics include text classification, lexical semantics, language modeling, machine translation, tagging and sequence modeling, parsing, compositional semantics, summarization, question answering, language grounding, information extraction, and sentiment analysis.
This course is not about using off-the-shelf NLP tools. If interested in tools and frameworks, see below for pointers to get your search started and questions answered.
We will continuously update this page.
|Jan30||Text classification||M&S 7.4,16.2-16.3, Collins: Naive Bayes (Sec 1-4), Collins: Log Linear (Sec 2), MaxEnt, Baselines, CNN Classification Naive Bayes prior derivation|
|Feb1||''||Assignment 1 out|
|Neural networks||Primer, Back-prop, Deep Averaging Networks, Gradient Checks (briefly), Gradient Checks (in details)|
|Computation graphs||Intro to Computation Graphs|
|Feb15||Lexical semantics and embeddings||w2v explained, word2vec, word2vec phrases, Hill2016, Turney2010|
|No class: February break|
|Feb22||''||Assignment 1 due, assignment 2 out|
|Language modeling||J&M 4, M&S 6, Collins: LM, Smoothing, Char RNN|
|Machine translation||Neural MT Tutorial, BLEU Score|
|IBM translation models||J&M 25.5, M&S 13.1-13.2, Collins: IBM Models, IBM Models, Collins: EM (Sec 5-6), HMM alignments, IBM Model 2 EM Notebook|
|Mar13||''||Assignment 2 due|
|Phrase-based machine translation||J&M 25.4, 25.8, M&S 13.3, Collins: PBT, Statistical PBT, Pharaoh decoder|
|Mar20||''||Assignment 3 out|
|Sequence modeling||J&M 5.1-5.3, 6, M&S 3.1, 9, 10.1-10.3, Collins: HMM, Collins: MEMMs (Sec 3), Collins: CRF (sec 4), Collins: Forward-backward, SOTA Taggers, TnT Tagger, Stanford Tagger|
|No class: spring break|
|No class: spring break|
|Apr10||Sequence modeling (contd.)|
|Apr12||RNNs||BPTT, RNN Tutorial, Effectiveness||Assignment 3 due April 15|
|Apr17||''||Assignment 4 out|
|Dependency parsing||J&M 12.7, Nivre2003, Chen2014|
|Apr26||NLP@Bloomberg / Amanda Stent and Anju Kambadur|
|May1||NLP@Google / David Weiss|
|Constituency parsing||J&M 12.1-12.6, 13.1-13.4, 14.1-14.4, M&S 11, 12.1, Collins: PCFGs, Eisner: Inside-outside, Collins: Inside-outside|
|NLP in Startups / S.R.K. Branavan (ASAPP)|
|May10||Constituency parsing||Assignment 4 due, final exam out and due May 15|
Policies are subject to change. If something is not clear, please contact the course staff.
Grading: 40% assignments, 25% take-home final exam, and 30% class review quizzes, 5% participation (including both Slack and class).
Quizzes: First five minutes of every class. Top 20 quizzes count towards your grade, each 1.5%. Each quiz on the material from the slides of the previous lecture. Attendance in class is required to complete a quiz. Connecting remotely to complete a quiz is not allowed. Quiz time is just like an exam: no copying, no talking, and no browsing the web. Quiz may be taken on laptops or mobile devices.
Assignments: All assignments must be completed in pairs. Allowed usage of third-party code/frameworks/tools is specified in each assignment. Please ask for anything beyond what is specified. Better be safe than sorry. All assignments should be implemented in Python. The final exam is subject to the same policies as assignments (e.g., must be done in pairs, etc).
Kaggle: Some assignments may include participation in Kaggle competitions. Participation must be in teams with all accounts associated with your team. Please clearly list your team in the assignment writeup.
Late policy: 10% off for every 12 hours (e.g., 25 hours delay grade starts at 70). No late submission is allowed for the final exam.
Laptop and device policy: Except for quiz taking, not electronic devices are allowed in class.
Prerequisites: Strong programming experience (CS 2110 or equivalent) and CS 4780, CS 4786, or CS 5785 with a grade of B or above. Auditing does not count. If you did not complete any of these classes or your grade is below B enrollment requires instructor permission. For any other questions regarding enrollment, show up in person on the first day of class. Personal questions will be addressed following the lecture.
Auditing: Unfortunately, formal auditing is not possible. However, if the number of requests is small, we will provide an alternative way to take the class with a pass/fail grade. This will still require taking the quizzes. The top-20 quizzes will count, and you will need an average of 60/100 at least to get a pass. All requests must be emailed to the instructor before midnight Sunday, Jan 27.
The main publication venues are ACL, NACCL, EMNLP, TACL, EACL, CoNLL, and CL. All the paper from these publications can be found in the ACL Anthology. In addition, NLP publications often appear in ML and AI conferences, including ICML, NIPS, ICLR, AAAI, IJCAI. A calendar of NLP events is available here, and ACL sponsored events are listed here.
Both parsing corpora below (PTB and UD) contain POS tags. Each parse tree contains POS tags for all leaf nodes. You can view a sample of the PTB in NLTK:
>> import nltk >> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents())) Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./. >> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents(tagset='universal'))) Pierre/NOUN Vinken/NOUN ,/. 61/NUM years/NOUN old/ADJ ,/. will/VERB join/VERB the/DET board/NOUN as/ADP a/DET nonexecutive/ADJ director/NOUN Nov./NOUN 29/NUM ./.
The universal tag set is described here. The PTB tag set is described here.
The CoNLL 2002 shared task is available in NLTK:
>> import nltk >> len(nltk.corpus.conll2002.iob_sents()) 35651 >> len(nltk.corpus.conll2002.iob_words()) 678377 >> print ' '.join(map(lambda x: x + '/' + x, nltk.corpus.conll2002.iob_sents())) Sao/B-LOC Paulo/I-LOC (/O Brasil/B-LOC )/O ,/O 23/O may/O (/O EFECOM/B-ORG )/O ./O
CoNLL 2002 is annotated with the IOB annotation scheme and multiple entity types.
This is another example of tagging. The task is explained here, and the data release is described here.
The Universal Dependencies (UD) project is publicly available online. The website includes statistics for all annotated languages. You can easily download v1.3 from here. UD files follow the simple CoNLL-U format.
The Penn Treebank is available from the LDC You will find tgrep useful for quickly searching the corpus for patterns. NLTK can also be used to load parse trees. A few more browsers are available online.
The WMT shared task from 2016 is a good source for newswire bi-text.
TE has been studied extensively for more than a decade now. Recently, SNLI has been receiving significant attention.
MCTest is a relatively new corpus, which is receiving significant attention. SQuAD is even newer. Another corpus we will look at is DailyMail/CNN from Deep Mind.
We will look at three data sets commonly used for semantic parsing:
If you encounter an interesting demo or system not listed here, please email the course instructor.
Deep Learning frameworks and tools:
We recommend using virtualenv with Python. Here is a quick, but sufficient, explanation. All
pip installations will then be local to the environment. For example, you may install NLTK to access data (i.e., using
pip install nltk).
Got any good tips? Email to share!