CS/INFO 630: Representing and Accessing Digital Information

Representing and Accessing Digital Information CS/INFO 630 - FALL 2004 Cornell University Department of Computer Science

Time and Place
	First lecture: August 30th, 2004 Last lecture: December 1st, 2004 Monday, 2:55pm - 4:10pm in Rhodes 484 Wednesday, 2:55pm - 4:10pm in Rhodes 484 Exam: Wednesday, October 20th (in class).
Instructor
	Thorsten Joachims, tj@cs.cornell.edu, 4153 Upson Hall. Office hours: Wednesdays 4:15pm - 5:00pm
Syllabus
	This course discusses automated techniques that enhance our ability to handle unstructured and semi-structured information, with an emphasis on textual information. This includes the analysis of content (e.g. text), meta-data (e.g. reference structure and authorship), as well as usage data (e.g. clickthrough and purchases). The course spans the areas of information retrieval, natural language processing, and machine learning, with links to work in databases and data mining. In particular, the course will cover the following topics: Text Retrieval: introduction to basic information retrieval, boolean retrieval, vector-space model, term weighting, inverted index, query processing, evaluation Text Classification: machine learning, bag-of-words representation, support vector machines, naive Bayes, Rocchio Text Clustering: k-means clustering, hierarchical agglomerative clustering Latent Semantic Analysis: SVD, probabilistic LSA Novelty Detection: topic detection and tracking Semi-Structured Data and Semantic Web: XML databases and information retrieval, metadata, RDF Information Extraction: manual and learned extraction patterns, hidden Markov models, part-of-speech tagging, named entity detection Bibliographic Analysis: co-citation, PageRank Usage Data: clickthrough patterns, navigation paths, personalized adaptive retrieval Recommender Systems: product recommendations, collaborative filtering Methods and theory will be illustrated with practical examples.
Slides and Lecture Notes
	08/30: Introduction (slides pdf) 09/01: Information retrieval basics, evaluation, statistical properties of text (slides pdf) 09/06: Information retrieval datastructures (slides pdf) 09/08: Indexing and preprocessing (slides pdf) 09/13: Retrieval models (slides pdf) 09/20: Hypertext retrieval (slides pdf) 09/22: Text classification and Naive Bayes (slides pdf) 09/27: Rocchio and K-NN (slides pdf) 09/29: Support Vector Machines (slides pdf) 10/13: Text clustering (slides pdf) 10/25: Latent Semantic Indexing (slides pdf) 10/27: Transduction (slides pdf) 11/01: Supervised clustering (slides pdf) 11/03: Implicit feedback in retrieval (slides pdf) 11/08: Part-of-speech tagging (slides pdf) 11/10: Named-entity recognition (slides pdf) 11/15: Information extraction (slides pdf) 11/17: Recommender systems (slides pdf)
Readings
	09/01: MIR 1.1-1.4, 2.1-2.3, 3.1-3.3,6.3.3. SNLP 1.4.2-1.4.5. 09/06: MIR 8.1, 8.2, 8.4. GIGA 4.6. 09/08: MIR 6.1–6.2, 7.1–7.3, (extra info: Appendix Porter Stemmer). 09/13: MIR 2.4, 2.5.1-2.5.3. 09/20: MWEB 5 09/22: TCAT 2 09/27: TCAT 6 10/29: TCAT 3 10/04: TCAT 4 10/13: SNLP 14.0-14.1.2, 14.2-14.2.1 10/25: SNLP 15.4 10/27: TCAT 7 11/08: SNLP 3.1-3.2.0 (background), SNLP 10.0-10.1, 10.4, HMM tutorial. 11/15: NLPOA Chapter 3 11/17: MWEB Chapter 8
Homework Assignments
	Critique 1: Justin Zobel. How reliable are large-scale information retrieval experiments. SIGIR 1998. Due: 09/06 before class. Critique guidelines Homework 1: Building a basic retrieval system. Due: 09/15 before class. Assignment, Data, Trie Tutorial, Perl Tutorial Solutions: invindex.pl, zipf.ps Critique 2: Jon Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Due: 09/27 before class. Homework 2: Text Classification. Due: 10/08 at noon. Assignment, arxiv_doc.train.gz, arxiv_classes.train.gz, arxiv_doc.test.gz, arxiv_classes.test.gz, SVM-light, Lemur Solution: vectorizer.pl Final Project Guidelines Critique 3: Doug Beeferman, Adam Berger. Agglomerative clustering of a search engine query log. ACM SIGKDD international conference on Knowledge discovery and data mining, 2000. Due: 10/18 before class. Critique 4: Avrim Blum, Tom Mitchell. Combining Labeled and Unlabeled Data with Co-Training, Proceedings of the Workshop on Computational Learning Theory (COLT), 1998. Due: 11/8 before class. Homework 3: Part-of-speech tagging. Due: 11/22 before class. Assignment, Data via email
Reference Material
	Christopher Manning and Hinrich Schutze. "Foundations of Statistical NLP", MIT Press, 1999. Ricardo Baeza-Yates and Berthier Ribeiro-Neto, "Modern Information Retrieval", Addison-Wesley, 1999. Ian H. Witten, Alistair Moffat, and Timothy C. Bell, "Managing Gigabytes: Compressing and Indexing Documents and Images", 2nd edition, Morgan Kaufmann, 1999. Karen Sparck Jones and Peter Willett (editors), "Readings in Information Retrieval", Morgan Kaufman, 1997. Thorsten Joachims, "Learning to Classify Text using Support Vector Machines", Kluwer, 2002. Tom Mitchell, "Machine Learning", McGraw Hill, 1997. Thomas Connolly and Carolyn Begg, “Database Systems”, 3^rd edition, Addison Wesley, 2002. Peter Jackson and Isabelle Moulinier, “Natural Language Processing for Online Applications”, Benjamins, 2002. Pierre Baldi, Paolo Frasconi, and Padhraic Smyth, “Modelling the Internet and the Web”, Wiley, 2003.
Prerequisites
	Any of the following: CS472 CS478 / CS578 / CS678 CS674 equivalent of any of the above permission from the instructors
Grading
	This is a 4-credit course. Grades will be determined based on a written mid-term exam, a final research project, homework assignments, and student presentations of selected papers. 25%: Mid-Term Exam 25%: Final Project 30%: Homework: (~3 homework assignments, some programming, some non-programming) 10%: Paper Critiques 10%: Class Participation Roughly: A=90-100; B=80-90; C=70-80; D=60-70; F= below 60
Academic Integrity
	This course follows the Cornell University Code of Academic Integrity. Each student in this course is expected to abide by the Cornell University Code of Academic Integrity. Any work submitted by a student in this course for academic credit will be the student's own work. Violations of the rules (e.g. cheating, copying, non-approved collaborations) will not be tolerated.