Time: TuThu, 10:55am-12:10pm
Room: Bloomberg 161
Class listing: CS5740
Instructor: Yoav Artzi
Office hours: Tuesday, 4:15pm-5:15pm
Location: Bloomberg 371
Teaching assistants: Xinya Du and Valts Blukis
Graders: Emily Tseng, Kelly Wang, and Iris Zhang
TA Office hours: Friday, 11:00am-12:00pm
Location: Bloomberg 360
CMS | Piazza
The main publication venues are ACL, NACCL, EMNLP, TACL, EACL, CoNLL, and CL. All the paper from these publications can be found in the ACL Anthology. In addition, NLP publications often appear in ML and AI conferences, including ICML, NIPS, ICLR, AAAI, IJCAI. A calendar of NLP events is available here, and ACL sponsored events are listed here.
Both parsing corpora below (PTB and UD) contain POS tags. Each parse tree contains POS tags for all leaf nodes. You can view a sample of the PTB in NLTK:
>> import nltk
>> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents()[0]))
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
>> print ' '.join(map(lambda x: '/'.join(x), nltk.corpus.treebank.tagged_sents(tagset='universal')[0]))
Pierre/NOUN Vinken/NOUN ,/. 61/NUM years/NOUN old/ADJ ,/. will/VERB join/VERB the/DET board/NOUN as/ADP a/DET nonexecutive/ADJ director/NOUN Nov./NOUN 29/NUM ./.
The universal tag set is described here. The PTB tag set is described here.
The CoNLL 2002 shared task is available in NLTK:
>> import nltk
>> len(nltk.corpus.conll2002.iob_sents())
35651
>> len(nltk.corpus.conll2002.iob_words())
678377
>> print ' '.join(map(lambda x: x[0] + '/' + x[2], nltk.corpus.conll2002.iob_sents()[0]))
Sao/B-LOC Paulo/I-LOC (/O Brasil/B-LOC )/O ,/O 23/O may/O (/O EFECOM/B-ORG )/O ./O
CoNLL 2002 is annotated with the IOB annotation scheme and multiple entity types.
This is another example of tagging. The task is explained here, and the data release is described here.
The Universal Dependencies (UD) project is publicly available online. The website includes statistics for all annotated languages. You can easily download v1.3 from here. UD files follow the simple CoNLL-U format.
The Penn Treebank is available from the LDC You will find tgrep useful for quickly searching the corpus for patterns. NLTK can also be used to load parse trees. A few more browsers are available online.
The WMT shared task from 2016 is a good source for newswire bi-text.
TE has been studied extensively for more than a decade now. Recently, SNLI has been receiving significant attention.
We will look at three data sets commonly used for semantic parsing:
If you encounter an interesting demo or system not listed here, please email the course instructor.
Deep Learning frameworks and tools: