CS574 Language Technologies

Fall 2002

Lecture Notes and Assigned Readings
Assignments and Other Handouts

Time and Place
Mondays and Wednesdays, 2:55-4:10
Hollister 110
  • First lecture: Monday, September 2
  • No class: Monday, October 14 (fall break)
  • No class: Wednesday, November 27 (Thanksgiving break)
  • Last lecture: Wednesday, December 4
  • Final examination: Tuesday, December 17, 3:00 - 5:30 pm; Thurston 202.
  • Claire Cardie, 5161 Upson Hall, office hours: Tuesday 3:00-4:00, Thursday 9:30-10:30
  • Thorsten Joachims, 4153 Upson Hall, office hours: Monday 4:15-5:00, Wednesday 4:15-5:00
Course Description
This course studies computational techniques for large-scale text-processing applications including: information retrieval, text classification, information extraction, document clustering, document ranking, summarization, topic detection and tracking, and question answering. The course focuses on statistical and machine learning approaches to these natural language processing tasks as well as methods for their empirical evaluation.

Syllabus (tentative)
  • introduction to the course (1 lecture)
  • information retrieval basics: vector space model, inverted indexes, statistical properties of text, evaluation in information retrieval (4 lectures)
  • text classification: support vector machines, naive bayes, k-nearest neighbors, feature selection, transduction and the use of unlabeled data for supervised learning (4 lectures)
  • text clustering and unsupervised learning: distance-based clustering, probabilistic clustering, (probabilistic) latent semantic analysis (3 lectures)
  • information extraction: traditional system architecture, part-of-speech tagging, learning partial parsers, learning extraction patterns coreference resolution (3 lectures)
  • question answering: system architecture, named entity detection, learning question types (3 lectures)
  • extracting information from hypertext structure: pagerank, hubs and authorities (2 lectures)
  • learning rankings: ranking SVMs, rank Boosting, learning from pair-wise preferences (2 lectures)
  • document summarization: single- and multi-document summarization, summarization evaluation (1 lecture)
  • statistical parsing techniques for language technologies (1 lecture)
Lecture Notes, Slides, and Handouts
Electronic versions of handouts, homeworks, and lecture slides will be made available (when available). Hardcopies will be provided in class.

Reference Material
We will provide reading material and hand it out in class. For further reading, we recommended parts of the following books:
  • Christopher Manning and Hinrich Schutze. "Foundations of Statistical NLP", MIT Press, 1999.
  • Ricardo Baeza-Yates and Berthier Ribeiro-Neto, "Modern Information Retrieval", Addison-Wesley, 1999.
  • James Allen. "Natural Language Understanding", 2nd edition. 
  • Ian H. Witten, Alistair Moffat, and Timothy C. Bell, "Managing Gigabytes: Compressing and Indexing Documents and Images", 2nd edition, Morgan Kaufmann, 1999.
  • Karen Sparck Jones and Peter Willett (editors), "Readings in Information Retrieval", Morgan Kaufman, 1997. 
  • Thorsten Joachims, "Learning to Classify Text using Support Vector Machines", Kluwer, 2002.
  • Tom Mitchell, "Machine Learning", McGraw Hill, 1997.
  • John Shawe-Taylor, Nello Cristianini, "Introduction to Support Vector Machines", Cambridge University Press, 2000.
Any of the following:
  • CS472
  • CS478
  • CS578
  • or the equivalent of any of the above
  • 30%: 2 prelim exams
    • prelim 1: Monday, October 7 
    • prelim 2: Monday, November 20 
  • 25%: final exam
  • 25%: 4-6 homework assignments
  • 10%: critiques of selected readings and research papers
  • 10%: class participation (You'll be expected to participate in class discussion or otherwise demonstrate an interest in the material studied in the course.)

Roughly: A=90-100; B=80-90; C=70-80; D=60-70; F= below 60

Late assignment policy: Barring extenuating circumstances, all homeworks and critiques must be turned in on the date specified, AT THE START OF CLASS. Assignments turned in within 24 hours of the due date will be penalized one full grade (e.g. A-->B). Assignments more than 24 hours late will not be accepted.

Academic Integrity
You are responsible for knowing and following Cornell's academic integrity policy. Absolute integrity is expected of every Cornell student in all academic undertakings; he/she must in no way misrepresent his/her work fraudulently or unfairly advance his/her academic status, or be a party to another student's failure to maintain academic integrity. The maintenance of an atmosphere of academic honor and the fulfillment of the provisions of this Code are the responsibilities of the students and faculty of Cornell University. Therefore, all students and faculty members shall refrain from any action that would violate the basic principles of this Code. Violation of the academic integrity policy will not be tolerated, and will result in an F in the course.

See the University Code of Academic Integrity and the Department Policy on Academic Integrity.

Professor Cardie received NSF support under Award 0074896 for development of this course.  Any opinions, findings, and conclusions or recommendations expressed in these materials or on this web site are those of the instructors and do not necessarily reflect the views of the National Science Foundation.