Representing and Accessing Digital Information

CS/INFO 630 - FALL 2004
Cornell University
Department of Computer Science

 
Time and Place
First lecture: August 30th, 2004
Last lecture: December 1st, 2004
  • Monday, 2:55pm - 4:10pm in Rhodes 484
  • Wednesday, 2:55pm - 4:10pm in Rhodes 484

Exam: Wednesday, October 20th (in class).

Instructor
Thorsten Joachims, tj@cs.cornell.edu, 4153 Upson Hall. Office hours: Wednesdays 4:15pm - 5:00pm
Syllabus
This course discusses automated techniques that enhance our ability to handle unstructured and semi-structured information, with an emphasis on textual information. This includes the analysis of content (e.g. text), meta-data (e.g. reference structure and authorship), as well as usage data (e.g. clickthrough and purchases). The course spans the areas of information retrieval, natural language processing, and machine learning, with links to work in databases and data mining. In particular, the course will cover the following topics:
  • Text Retrieval: introduction to basic information retrieval, boolean retrieval, vector-space model, term weighting, inverted index, query processing, evaluation
  • Text Classification: machine learning, bag-of-words representation, support vector machines, naive Bayes, Rocchio 
  • Text Clustering: k-means clustering, hierarchical agglomerative clustering
  • Latent Semantic Analysis: SVD, probabilistic LSA
  • Novelty Detection: topic detection and tracking
  • Semi-Structured Data and Semantic Web: XML databases and information retrieval, metadata, RDF 
  • Information Extraction: manual and learned extraction patterns, hidden Markov models, part-of-speech tagging, named entity detection
  • Bibliographic Analysis: co-citation, PageRank
  • Usage Data: clickthrough patterns, navigation paths, personalized adaptive retrieval
  • Recommender Systems: product recommendations, collaborative filtering

Methods and theory will be illustrated with practical examples.

Slides and Lecture Notes
  • 08/30: Introduction (slides pdf)
  • 09/01: Information retrieval basics, evaluation, statistical properties of text (slides pdf)
  • 09/06: Information retrieval datastructures (slides pdf)
  • 09/08: Indexing and preprocessing (slides pdf)
  • 09/13: Retrieval models (slides pdf)
  • 09/20: Hypertext retrieval (slides pdf)
  • 09/22: Text classification and Naive Bayes (slides pdf)
  • 09/27: Rocchio and K-NN (slides pdf)
  • 09/29: Support Vector Machines (slides pdf)
  • 10/13: Text clustering (slides pdf)
  • 10/25: Latent Semantic Indexing (slides pdf)
  • 10/27: Transduction (slides pdf)
  • 11/01: Supervised clustering (slides pdf)
  • 11/03: Implicit feedback in retrieval (slides pdf)
  • 11/08: Part-of-speech tagging (slides pdf)
  • 11/10: Named-entity recognition (slides pdf)
  • 11/15: Information extraction (slides pdf)
  • 11/17: Recommender systems (slides pdf)
Readings
  • 09/01: MIR 1.1-1.4, 2.1-2.3, 3.1-3.3,6.3.3.  SNLP 1.4.2-1.4.5.
  • 09/06: MIR 8.1, 8.2, 8.4.  GIGA 4.6.
  • 09/08: MIR 6.1�6.2, 7.1�7.3, (extra info: Appendix Porter Stemmer).
  • 09/13: MIR 2.4, 2.5.1-2.5.3.
  • 09/20: MWEB 5
  • 09/22: TCAT 2
  • 09/27: TCAT 6
  • 10/29: TCAT 3
  • 10/04: TCAT 4
  • 10/13: SNLP 14.0-14.1.2, 14.2-14.2.1
  • 10/25: SNLP 15.4
  • 10/27: TCAT 7
  • 11/08: SNLP 3.1-3.2.0 (background), SNLP 10.0-10.1, 10.4, HMM tutorial.
  • 11/15: NLPOA Chapter 3
  • 11/17: MWEB Chapter 8
Homework Assignments
Reference Material
  • Christopher Manning and Hinrich Schutze. "Foundations of Statistical NLP", MIT Press, 1999.
  • Ricardo Baeza-Yates and Berthier Ribeiro-Neto, "Modern Information Retrieval", Addison-Wesley, 1999.
  • Ian H. Witten, Alistair Moffat, and Timothy C. Bell, "Managing Gigabytes: Compressing and Indexing Documents and Images", 2nd edition, Morgan Kaufmann, 1999.
  • Karen Sparck Jones and Peter Willett (editors), "Readings in Information Retrieval", Morgan Kaufman, 1997. 
  • Thorsten Joachims, "Learning to Classify Text using Support Vector Machines", Kluwer, 2002.
  • Tom Mitchell, "Machine Learning", McGraw Hill, 1997.
  • Thomas Connolly and Carolyn Begg, �Database Systems�, 3rd edition, Addison Wesley, 2002.
  • Peter Jackson and Isabelle Moulinier, �Natural Language Processing for Online Applications�, Benjamins, 2002.
  • Pierre Baldi, Paolo Frasconi, and Padhraic Smyth, �Modelling the Internet and the Web�, Wiley, 2003.
Prerequisites
Any of the following:
  • CS472
  • CS478 / CS578 / CS678
  • CS674
  • equivalent of any of the above
  • permission from the instructors
Grading
This is a 4-credit course. Grades will be determined based on a written mid-term exam, a final research project, homework assignments, and student presentations of selected papers.
  • 25%: Mid-Term Exam
  • 25%: Final Project
  • 30%: Homework: (~3 homework assignments, some programming, some non-programming)
  • 10%: Paper Critiques
  • 10%: Class Participation

Roughly: A=90-100; B=80-90; C=70-80; D=60-70; F= below 60

Academic Integrity
This course follows the Cornell University Code of Academic Integrity. Each student in this course is expected to abide by the Cornell University Code of Academic Integrity. Any work submitted by a student in this course for academic credit will be the student's own work. Violations of the rules (e.g. cheating, copying, non-approved collaborations) will not be tolerated.