CS 775: Seminar in Natural Language Understanding, Spring 2001
"Statistical Natural Language Processing: Models and Methods"
Wednesdays, 1:252:15, Hollister 362
Natural language processing (NLP) has been considered one of the "holy
grails" for artificial intelligence ever since Turing proposed his
famed "imitation game" (the Turing Test).
Statistical approaches have revolutionized the way NLP is done.
Furthermore, some of these approaches can be employed in other
applications, such as computational biology and data mining.
This course will explore important classes of probabilistic models of
language and survey some of the common general techniques.
Syllabus
 I. Models
 Zipf's law: introduction to the statistics of language
 Hidden Markov Models and related formalisms: the Viterbi and
BaumWelch algorithms, integration with semantic hierarchies,
language modeling, smoothing, and Maximum Entropy Markov Models
 The noisy channel model (basics of information theory): entropy,
the KullbackLeibler divergence, and mutual information
 Parametric clustering
 II. Methods
 The ExpectationMaximization (EM) algorithm
 The Maximum Entropy (ME) principle
 The singular value decomposition (SVD) and latent semantic
indexing (LSI)
Reference Texts
 Eugene Charniak, Statistical Language Learning, 1993
 Thomas
M. Cover and Joy
A. Thomas, Elements of
Information Theory, 1991
 Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison,
Biological Sequence Analysis: Probabilistic Models of Proteins
and Nucleic Acids, 1998
 Frederick Jelinek, Statistical Methods for Speech Recognition,
1997.
 Christopher
D. Manning and Hinrich Schuetze, Foundations of Statistical Natural
Language Processing, 1999.
Notes and References from Lecture
Some references may be available only internally to Cornell.
Lecture list: Jan 24  Jan 31  Feb 7  Feb 14  Feb
21  Feb 28  Mar 7  Mar 14  Mar 28  April 11  April 18  April 25
 Jan 24: Models and (vs.) methods What is NLP  AIcompleteness
 Statistical NLP  generative models  benefits and tradeoffs.
References:
 Jan 31: Zipf's law and the model moral Statistics of words 
Zipflike behavior  why loglog  least effort  Miller's monkeys.
References:
 Timothy
C. Bell, John G. Cleary, and Ian H. Witten, 1990.
Text Compression, PrenticeHall.
 J.B. Estoup, 1916. Gammes Stenographiques. Institut
Stenographique de France.
 Wentian Li, 1992. Random texts
exhibit Zipf'slawlike word frequency distribution. IEEE
Transactions on Information Theory, 38(6) 18421845. (Click here for
alternate link.)
 Zipf's
Law (webpage by Wentian Li). Many Zipf's Law refs in
many fields.
 Benoit Mandelbrot, 1957. Théorie mathématique de la
d'EstoupZipf. Inst. de Statistique de l'Univèrsité.
 The Manning/Schuetze text, section 1.4.3
 George A. Miller, 1957. Some effects of intermittent silence. American Journal of Psychology, 70:311314.
 George Kinglsey Zipf, 1949. Human Behavior and the Principle
of Least Effort. AddisonWesley.
 Feb 7: Hidden Markov Models (I) Markov chains  decoupling the
output from the mechanism  HMMs 
calculating the probability of an observed sequence the hard way  the
forwardbackward algorithm to the rescue.
References:
 Feb 14: Hidden Markov Models (II) Finding likely state sequences 
finding likely paths with the Viterbi algorithm  revealing the hidden
states  BaumWelch when
nobody in the lab knows the language.
Further references:
 Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun,
1992. A Practical PartofSpeech Tagger. Proceedings of the
Third Conference on Applied Natural Language Processing,
pp. 133140. Available as ps.
 Stephen J. DeRose, 1988. Grammatical category disambiguation by
statistical optimization. Computational Linguistics 14(2),
pp. 3139.
 David Elworthy, 1994. Does BaumWelch Reestimation help taggers?
Proceedings of the 4th Conference on Applied Natural Language
Processing.
 Julian Kupiec, 1992. Robust PartofSpeech Tagging Using a Hidden
Markov Model. Computer Speech and Language 6, pp. 225242.
 Bernard Merialdo, 1994. Tagging English Text with a
Probabilistic Model. Computational Linguistics 20(2), pp. 155171.
 Ralph Weischedel, Marie Meteer, Richard Schwartz, Lance
A. Ramshaw, and Jeff Palmucci, 1993. Coping with ambiguity and unknown
words through probabilistic models. Computational
Linguistics 19(2), pp. 359382.
 Feb 21: Abney and Light: a WordNetbased HMM Selectional
preferences  word sense disambiguation via HMM's  WordNetinduced
HMM topology  fun with handsimulation  alterations to BaumWelch 
problems with BaumWelch, WordNet, or the approach?
References:
 Steven Abney and Marc Light, 1999. Hiding a
semantic hierarchy in a Markov model. Proceedings of the
ACL'99 Workshop on Unsupervised Learning in Natural Language
Processing , pp. 18. (Longer version)
 Massimiliano Ciaramita and Mark Johnson, 2000. Explaining away ambiguity:
Learning verb selectional preference with Bayesian networks.
Proceedings of the 18th International Conference on Computational
Linguistics, Volume 1.
 Stephen Clark and David Weir, An iterative approach to
estimating frequencies over a semantic hierarchy, Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 258265.
 Hang Li and Naoki Abe, 1998. Generalizing
Case Frames Using a Thesaurus and the MDL Principle. Computational Linguistics, Vol. 24(2).
 Philip Resnik, 1997. Selectional
preference and sense disambiguation. ACL SIGLEX Workshop on
Tagging Text with Lexical Semantics: Why, What, and How?
 WordNet. See
also WordNet: An Electronic Lexical Database, Christiane
Fellbaum, ed., 1998.
 Feb 28: Smoothing, the Good
oldfashioned Turing way The zerofrequency problem  the
return of Zipf's law  the GoodTuring estimate  counting bunnies in
Bletchley Park and cracking the Enigma  a Bayesian derivation 
singleton rates and novelitem rates  Jelinek and Mercer's
empirical derivation  leaveoneout  leavehighcountsalone.
References:

Stanley F. Chen and Joshua Goodman, 1996. An
empirical study of smoothing techniques for language modeling
Proceedings of the 34th Meeting of the Association for Computational
Linguistics, pp 310318.
TR version: TR1098,
Computer Science Group, Harvard University, 1998.
 Kenneth W. Church
and William A. Gale, 1991. A comparison of the enhanced GoodTuring
and deleted estimation methods for estimating probabilities of English
bigrams. Computer Speech and Language 5, 1954.
 Irving J. Good, 1953. The population frequencies of species and
the estimation of population parameters. Biometrika 40(34),
pp. 237264.
 Jelinek text: see last chapter.
 Slava M. Katz, 1987. Estimation of Probabilities from Sparse
Data for the Language Model Component of a Speech Recognizer.
IEEE Transactions on Acoustics, Speech and Signal Processing,
ASSP35 (3), pp 400401.
 Reinhard Kneser and Hermann Ney, 1995. Improved backingoff for mgram
language modeling. Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing, 181184.
 Arthur Nadas, 1985. On Turing's formula for word probabilities.
IEEE Transactions on Acoustics, Speech, and Signal
Processing, Vol. ASSP33(6), 14141416.
 Geoffrey Sampson's GoodTuring
Frequency Estimation page
 Mar 7: EM and
BaumWelch Using hidden structure variables  maximizing expected logjointlikelihood increases likelihood  deriving
BaumWelch as a special case.
References:
 Mar 14: Statistical
Generation Presented by Regina Barzilay. Note use of
generative models for ranking proposed utterances.
References:
 Mar 28: Introduction to
Information Theory Relating information to communication,
randomness, and surprise  semantics vs. engineering?  definitions and axioms for entropy  cross
entropy when an unknown source is involved  relative entropy/Kullback
Leibler divergence to measure distributional similarity  mutual
information as a special case  can the mutual information be
negative? It depends on whom you ask
 April 11: Maximum Entropy
Models The missing key, the principle of insufficient
reason, and the probability the sun will
rise tomorrow  constraints on the empirical and estimated
expectations  deriving the maximumentropy distribution (the return
of Lagrange)  Candide appears that the Bank of Boston has almost
completed its review of Shawmut (or, improving statistical machine
translation)
 Adam Berger's
MaxEnt reading list
 Adam Berger, Stephen Della Pietra, and Vincent Della Pietra, 1996. A
maximum entropy approach to natural language processing .
Computational Linguistics 22(1).
 P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek,
J. Lafferty, R. Mercer, and P. Roosin, 1990. A statistical approach
to machine translation. Computational Linguistics 16, 7985.
 P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer, 1991. The
mathematics of statistical machine translation: parameter
estimation". Computational Linguistics, 19(2), 263311.
 Albert Einstein. "Everything should be made as simple as
possible, but not simpler".
 John Maynard Keynes, 1921. A Treatise on Probability
 Pierre Simon Laplace, 1775. Essai Philosophique sur les
probabilities.
 See also chapters 1314 of the Jelinek text.
 April 18: Maximum Entropy
Models: iterative scaling (references as above) deriving
the update equations  sanity checks  comparisons to EM
 April 25: Latent Semantic
Indexing and Iterative Residual Rescaling Subspace
projection, or, why hopefully no one will make a model of car named
the Chomsky  a geometric view of the singular value decomposition 
an analysis not based on a generative model (for once)  compensating
for nonuniformity via the IRR algorithm
 Rie Kubota Ando, 2000. Latent
semantic space: Iterative scaling improves precision of interdocument
similarity measurement. SIGIR '00.
 Rie Kubota
Ando and Lillian Lee, 2001. Semantic space: On the
effectiveness of iterative residual rescaling. SIGIR
2001, pp. 154162. (alternate formats)
 Scott Deerwester, Susan Dumais,
George Furnas, Thomas Landauer
and Richard Harshman, 1990. Indexing by Latent Semantic Analysis.
Journal of the American Society for Information Science 41(6),
pp. 391407. (Alternate link here)
 Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki and
Santosh Vempala, 2000. Latent Semantic Indexing: A
Probabilistic Analysis. Journal of Computer and System
Sciences 61(2), pp. 217235.
 Telcordia (formerly
BellCore) LSI page
 Colorado LSA homepage
 Tennessee LSI web page
Administrative Information
2 credits, S/U only. Grade based on attendance/participation.
Prereqs: elementary probability, mathematical maturity. COMS674 not
required.
URL: http://www.cs.cornell.edu/courses/cs775/2001sp/
CS775, Spring '01
Lillian Lee
Related Links: NLP
at CUCS  COMS674
 Other
COMS course web pages  CS775, Spring 2000