 1/21: Introduction
 Four handouts: course information sheet (includes syllabus), Student
information sheet, Reaction essay readings, Introductory paper
 Introductory paper: ``I'm sorry Dave, I'm
afraid I can't do that'': Linguistics, Statistics, and Natural
Language Processing circa 2001
 Bill
Gates quote, Gartner symposium, 1997. (Here
is an alternate link.)
 About the word
"undertoad" (see John Irving, The World According to Garp,
1976.)
 Mary Klages, 2001. Structuralism
and Sausurre
 Roni
Rosenfeld, the Universal Speech
Interface Manifesto
 Alan M. Turing, 1950. Computing
machinery and intelligence. Mind LIX(236), pp. 433460.
 1/23: Featurebased contextfree grammars
 1/28: Tree adjoining grammars
 Handout: lecture examples
 Christopher Culy, 1985. The complexity of the vocabulary of
Bambara. Linguistics and Philosophy, 8:345351.
 Aravind K. Joshi, Leon S. Levy, Masako Takahashi, 1975. Tree
adjunct grammars. Journal of Computer and System Sciences
10(1), pp. 136163.
 Aravind K. Joshi and Yves Schabes, Treeadjoining
grammars. I don't know of a publication source for this.
 Aravind K. Joshi,
K. VijayShankar, and David Weir,
1991. The convergence of mildly contextsensitive grammar formalisms.
In Peter Sells, Stuart Shieber, and Tom Wasow, editors, Foundational
Issues in Natural Language Processing, pp. 3181. MIT Press.
 Aravind K. Joshi, 1985. Tree adjoining grammars: how much
contextsensitivity is required to provide reasonable structural
description. In David R. Dowty, Lauri Karttunen, and Arnold M. Zwicky,
eds, Natural Language Processing: psychological, computational,
and theoretical perspectives, Cambridge. Note  there may be an
error in the parsingtime claim.
 Geoff Pullum, 1986. Footloose
and contextfree. Natural Language and
Linguistic Theory 4, pp. 283289. Reprinted in The Great Eskimo Vocabulary Hoax, U. of
Chicago Press, 1991.
 Stuart
Shieber, 1985. Evidence against the contextfreeness of natural
language. Linguistics and Philosophy, 8:333343.
 Stuart
Shieber and Yves Schabes, 1990. Synchronous
TreeAdjoining Grammars. In Proceedings of the 13th International Conference on Computational Linguistics, volume 3, pp. 16.
 K. VijayShankar and Aravind K. Joshi, 1985. Some computational
properties of Tree Adjoining Grammars. Proc. of the 23rd ACL,
pp. 8293.
 K. VijayShankar and
David
Weir, 1994. The
equivalence of four extensions of contextfree grammars. Mathematical Systems Theory, 27:511545.
 The XTAG home page.
 TAG+6: The 6th
International Workshop on Tree Adjoining Grammars and Related Frameworks .
[back to
lecture index  back to top]
 1/30: TAGs with adjunction
constraints
[back to
lecture index  back to top]
 2/04: Featurebased TAGs
 Handout: Lecture examples
 Steven Abney, 1996. Statistical
methods and linguistics. The Balancing
Act, Judith
Klavans and Philip Resnik, eds, MIT Press. Paper available online as ps
or
pdf
 Sanguthevar
Rajasekaran and Shibu Yooseph, 1995. TAL recognition in O(M(n^2)) time. Proc. of the 33rd
ACL, pp. 166173. (Link is to the journal version, in
Journal of Computer and System Sciences, 56(1), pp. 8389, 1998.)
 Giorgio Satta, 1994. Tree Adjoining Grammar parsing
and Boolean matrix multiplication. Computational Linguistics, 20(2),
pp. 173191.
[back to
lecture index  back to top]
 2/06: (Some) statistics of
language
 Handout: Lecture handout
 J.B. Estoup, 1916. Gammes Stenographiques. Institut
Stenographique de France.
 W. Nelson Francis and Henry Kucera, 1982. Frequency Analysis of
English Usage. Houghton Mifflin.
Also of potential interest: the Brown corpus
manual, 1979; and a search interface.
 Tommi Jaakkola
and David Haussler, 1998. Exploiting
generative models in discriminative classifiers. NIPS 11.
 Mark Johnson, 2001.
Joint
and conditional estimation of tagging and parsing
models. Proc. of ACL, pp. 314321.
 John Lafferty, Andrew McCallum, and Fernando Pereira,
2001. Conditional
random fields: Probabilistic models for segmenting and labeling
sequence data. Proc. of ICML.
 Benoit Mandelbrot, 1957. Théorie mathématique de la
d'EstoupZipf. Inst. de Statistique de l'Univèrsité.
 George A. Miller, 1957. Some
effects of intermittent silence. American J. Psychology 70,
pp. 311313.
 Andrew Y. Ng and
Michael Jordan, 2002. On
discriminative vs. generative classifiers: A comparison of logistic
regression and Naive Bayes. Proc. of NIPS.
 Y. Dan Rubenstein and Trevor Hastie, 1997.
Discriminative
vs Informative Learning. Proc. of KDD.
 George Zipf, 1949. Human Behavior and the principle of least
effort. AddisonWesley Press.
 Zipf's
Law (webpage by Wentian Li). Many Zipf's Law refs in
many fields.
 See also section 1.4 of the Manning/Schütze text or chapter
4 of Timothy C. Bell, John G. Cleary, and Ian H. Witten, Text
Compression, 1990.
[back to
lecture index  back to top]
 2/11: Hidden Markov Models
 2/13 and 2/18: Hidden Markov Models (cont.)
[back to
lecture index  back to top]
 2/20: GoodTuring smoothing
 Handout: Lecture handout
 Peter F. Brown and Vincent J. DellaPietra and Peter V. deSouza
and Jennifer C. Lai and Robert L. Mercer, 1992. Classbased ngram
models of natural language. Computational Linguistics 18(4),
pp. 467479.

Stanley F. Chen and Joshua Goodman, 1996. An
empirical study of smoothing techniques for language modeling
Proceedings of the 34th Meeting of the Association for Computational
Linguistics, pp 310318.
TR version: TR1098,
Computer Science Group, Harvard University, 1998.
 Kenneth W. Church and William A. Gale, 1991. A comparison of the
enhanced GoodTuring and deleted estimation methods for estimating
probabilities of English bigrams. Computer Speech and Language
5, pp. 1954.
 Michael
Collins and James Brooks, 1995. Prepositional Attachment
through a Backedoff Model. Third Workshop on Very Large
Corpora, pp. 2738.
 I. J. Good, 1953. The population frequencies of species and the
estimation of population parameters. Biometrika 40, pp.
237264.
 Pierre Simon Laplace. Essai Philosophique sur les
probabilities. There appear to be several editions; Ristad's A
natural law of succession dates it to 1775.
 Arthur Nadas, 1985. On Turing's formula for word probabilities.
IEEE Transactions on Acoustics, Speech, and Signal
Processing, ASSP33 (6), pp. 14141416.
 Geoffrey Sampson's GoodTuring
page.
 See also section 6.2 of the Manning/Schütze text or chapter
15 of the Jelinek text.
[back to
lecture index  back to top]
 2/25: Discourse phenomena
 Handout: Lecture handout
 Ralph Grishman, 1986. Computational Linguistics: An
Introduction. Cambridge.
 Jerry R. Hobbs, 1978. Resolving Pronoun References. Reprinted
in Grosz, Sparck Jones, and Webber, Readings in Natural Language
Processing.
 Jerry R. Hobbs, 1979. Coherence and coreference. Cognitive
Science 3(1), pages 6782.
 Ray Jackendoff, 1972. Semantic Interpretation in Generative
Grammar. MIT Press.
 William
C. Mann and Sandra A. Thompson, 1986. Relational propositions in
discourse. Discourse Processes 9(1), pp. 5790.
 Vladimir Nabokov, Lolita. 1955.
 Candace Sidner, 1979,
Towards a Computational Theory of Definite Anaphora Comprehension
in English Discourse. PhD Thesis, MIT.
 Remko Scha and Livia Polyani, 1988. An augmented context free grammar
for discourse. Proceedings of COLING.
 Yorick Wilks, 1975. An intelligent analyzer and understander of
English. Communications of the ACM 18(5), 264274. Reprinted in Grosz et al,
Readings in Natural Language Processing
 See also ch. 14 of Allen. or ch. 18 of Jurafsky/Martin.
[back to
lecture index  back to top]
 2/27: The Grosz and Sidner
discourse theory
[back to
lecture index  back to top]
 3/4: Word sense disambiguation
 Handouts: Lecture handout
 Kathleen G. Dahlgren, 1988. Naive Semantics for Natural
Language Understanding. Kluwer.
 William Gale, Kenneth Church, and David Yarowsky, 1992. Estimating upper and lower bounds on the performance of wordsense disambiguation programs.
Proceedings of the ACL, pp. 249256.
 Nancy Ide and Jean Veronis. Introduction to the special issue on
word sense disambiguation: The state of the art. Computational
Linguistics 24(1), pp. 140.
 Abraham Kaplan, 1950. An experimental study of ambiguity and
context. Mechanical Translation 2(2): 3946 (issue appeared
in 1955).
 Jerrold
J. Katz and Jerry A. Fodor, 1963, The structure of semantic
theory. Language (39), pp. 170210.
 Alpha Luk, 1995. Statistical sense disambiguation with
relatively small corpora using dictionary definitions.
Proceedings of the 33rd ACL.
 Ray Mooney, 1996. Comparative experiments on disambiguating word
senses: An illustration of the role of bias in machine learning. Proceedings of the 1996 Conference on Empirical Methods in Natural Language Processing, pp. 8291.
 Erwin Reifler, 1955. The mechanical determination of meaning.
In William N. Locke and A. Donald Booth, eds., Machine
Translation of Languages. John Wiley and Sons. Also included in
the upcoming Readings in Machine Translation, MIT Press.
 Philip
Resnik, 1995. Disambiguating noun
groupings with respect to WordNet senses. Proc. of the 3rd
Workshop on Very Large Corpora.
 Hinrich
Schütze, 1992. Dimensions
of meaning. Proceedings of Supercomputing, pp 787796. (postscript)
 Yorick Wilks and
Mark Stevenson, 1998. The
Grammar of Sense: Using partofspeech tags as a first step in
semantic disambiguation. Journal of Natural Language
Engineering 4(2), pp. 135144. (See also cmplg/9607028)
 David Yarowsky,
1992. WordSense
Disambiguation Using Statistical Models of Roget's Categories Trained
on Large Corpora. Proc. of COLING, pp. 454460.
 See also ch 17.117.2 of Jurafsky and Martin
[back to
lecture index  back to top]
 3/6: Word sense
disambiguation methods
[back to
lecture index  back to top]
 3/11: Supervised and
bootstrapped WSD
methods
[back to
lecture index  back to top]
 3/13: Mostlyunsupervised
Japanese segmentation
[back to
lecture index  back to top]
 3/25: Representing semantics
[back to
lecture index  back to top]
 3/27: Quantifier scope ambiguity
[back to
lecture index  back to top]
 4/1: Introduction to information theory
 Handouts: lecture handout, sample literature
survey (hardcopy only)
 Thomas
M. Cover and Joy
A. Thomas, 1991. Elements of
Information Theory. Wiley.
 Ming Li and Paul Vitanyi, An Introduction to
Kolmogorov Complexity and Its Applications, 1997 (2nd ed).
 Claude Shannon, 1948. A mathematical theory of communication. Bell System Technical Journal, vol. 27, pp. 379423 and
623656. Republished as "The mathematical theory of communication" in Warren Weaver and Claude E. Shannon, eds., The
Mathematical Theory of Communication, U. Illinois Press,
1949.
 See also ch 7 and 8 of the Jelinek text, or Ch 2.2 of the Manning/
Schütze text. Jelinek recommends
Abramson's Information Theory and Coding, 1963.
[back to
lecture index  back to top]
 4/3: Classbased language
modeling with hard clusters
[back to
lecture index  back to top]
 4/3: Classbased language
modeling with soft clusters
 Handout: Lecture handout
 Slava M. Katz, 1987. Estimation of Probabilities from Sparse
Data for the Language Model Component of a Speech Recognizer.
IEEE Transactions on Acoustics, Speech and Signal Processing,
ASSP35 (3), pp 400401.
 Lillian Lee and Fernando Pereira, 1999. Distributional
similarity models: Clustering vs. nearest neighbors.
 Fernando Pereira,
Naftali Tishby, and Lillian Lee, 1993. Proceedings of the 37th ACL, pp 3340. Distributional
Clustering of English Words. Proceedings of the 31st ACL,
pp 183190.
 See also Chapter 4 of SimilarityBased
Approaches to Natural Language Processing (has proofs)
[back to
lecture index  back to top]
 4/10: Introduction to
machine translation
[back to
lecture index  back to top]
 4/15: Statistical
machine translation
 Handout: lecture handout
 Adam L. Berger, Peter F. Brown, Stephen A. Della Pietra, Vincent
J. Della Pietra, John R. Gillett, John D. Lafferty, Robert L. Mercer,
Harry Printz, and Lubos Ures, 1994. The
Candide System for Machine Translation. Proceedings of the
1994 ARPA Workshop on Human Language Technology
 Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra,
and Robert L. Mercer, 1993.
The
Mathematics of Statistical Machine Translation". Computational
Linguistics 19(2), pp. 263311.
 Kevin
Knight, 1999. A Statistical
MT Tutorial Workbook. (html)
 Voltaire, 1759. Candide
 Warren Weaver. Translation. Memorandum. See MT
News International discussion, July 1999.
[back to
lecture index  back to top]
 4/17: The EM algorithm
 Michael
Collins, 1997. The EM
Algorithm. Written Preliminary Exam (WPE) paper.
 Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin, 1977. Maximum
Likelihood From Incomplete Data via the EM Algorithm. Journal of
the Royal Statistical Society Series B, 39(1), pp. 138.
 Geoffrey J. McLachlan and Thriyambakam Krishnan, 1997. The
EM Algorithm and Extensions. Wiley.
 Ted Pedersen's
introductory talk
and paper
 See also chapter 9 of the Jelinek text, or chapter 11 of the
Durbin, Eddy et al text, or the "When does EM work?" panel at EMNLP 2001
[back to
lecture index  back to top]
 4/22: Statistical Summarization
[back to
lecture index  back to top]
 4/2: Linguistics and
statistics: the case of POS tagging
 Handouts: lecture handout

Eric Brill and Grace
Ngai, 1999. Man
[and Woman] vs. Machine: A Case Study in Base Noun Phrase Learning
 JeanPierre Chanod and Pasi Tapanainen, 1995. Tagging French  comparing a statistical and a constraintbased method. Proceedings of the EACL.
 Kenneth W. Church,
1992. Current practice in part of speech tagging and suggestions for
the future. In Simmons, ed., Sbornik praci: In Honor of Henri Kucera.
 Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun,
1992. A Practical PartofSpeech Tagger. Proceedings of the
Third Conference on Applied Natural Language Processing,
pp. 133140.
 Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building
a large annotated corpus of English: the Penn Treebank. Computational
Linguistics 19, pp. 313330.
 Christer Samuelsson and Atro Voutilainen, 1997. Comparing
a linguistic and a stochastic tagger, Proceedings of the 35th ACL/8th
EACL. Also cmplg/97060
 Pasi Tapanainen and Atro Voutilainen, 1994. Tagging
accurately  Don't guess if you know, Proceedings of the Fourth
ACL Conference on Applied Natural Language Processing.
Also cmplg/9408009
[back to
lecture index  back to top]
