CS5740 SP19

Time: TuThu, 10:55am-12:10pm
Room: Bloomberg 161
Class listing: CS5740

Instructor: Yoav Artzi
Office hours: Tuesday, 4:15pm-5:15pm
Location: Bloomberg 371

Teaching assistants: Xinya Du and Valts Blukis
Graders: Emily Tseng, Kelly Wang, and Iris Zhang
TA Office hours: Friday, 11:00am-12:00pm
Location: Bloomberg 360

CMS | Piazza

Short and Incomplete List of Resources

NLP Conferences and Journals

The main publication venues are ACL, NACCL, EMNLP, TACL, EACL, CoNLL, and CL. All the paper from these publications can be found in the ACL Anthology. In addition, NLP publications often appear in ML and AI conferences, including ICML, NIPS, ICLR, AAAI, IJCAI. A calendar of NLP events is available here, and ACL sponsored events are listed here.

Corpora and Other Data


Part-of-speech Tags

Both parsing corpora below (PTB and UD) contain POS tags. Each parse tree contains POS tags for all leaf nodes. You can view a sample of the PTB in NLTK:

>> import nltk
>> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents()[0]))
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
>> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents(tagset='universal')[0]))
Pierre/NOUN Vinken/NOUN ,/. 61/NUM years/NOUN old/ADJ ,/. will/VERB join/VERB the/DET board/NOUN as/ADP a/DET nonexecutive/ADJ director/NOUN Nov./NOUN 29/NUM ./.

The universal tag set is described here. The PTB tag set is described here.

Named Entity Recognition Data

The CoNLL 2002 shared task is available in NLTK:

>> import nltk
>> len(nltk.corpus.conll2002.iob_sents())
>> len(nltk.corpus.conll2002.iob_words())
>> print ' '.join(map(lambda x: x[0] + '/' + x[2], nltk.corpus.conll2002.iob_sents()[0]))
Sao/B-LOC Paulo/I-LOC (/O Brasil/B-LOC )/O ,/O 23/O may/O (/O EFECOM/B-ORG )/O ./O

CoNLL 2002 is annotated with the IOB annotation scheme and multiple entity types.

NYT Recipe Data

This is another example of tagging. The task is explained here, and the data release is described here.

Dependency Parsing

The Universal Dependencies (UD) project is publicly available online. The website includes statistics for all annotated languages. You can easily download v1.3 from here. UD files follow the simple CoNLL-U format.

Constituency Parsing

The Penn Treebank is available from the LDC You will find tgrep useful for quickly searching the corpus for patterns. NLTK can also be used to load parse trees. A few more browsers are available online.

Machine Translation

The WMT shared task from 2016 is a good source for newswire bi-text.

Textual Entailment

TE has been studied extensively for more than a decade now. Recently, SNLI has been receiving significant attention.

Reading Comprehension

Semantic Parsing

We will look at three data sets commonly used for semantic parsing:

  1. GeoQuery: A natural language interface to a small US geography database. The original data is available here, and the original query language is described here. The data with lambda calculus logical forms is available here.
  2. ATIS: A natural language interface for a flights database. The data is available from the LDC.
  3. Navi: Instructional language for robot navigation. The original data is described here, but we recommend using the data here.

Word Analogy

Question Answering

Vision and Language

Online Demos, Systems, and Tools

If you encounter an interesting demo or system not listed here, please email the course instructor.

Deep Learning frameworks and tools: