Representing and Accessing Digital
Information
CS/INFO 630 - FALL 2004 Cornell University Department of
Computer Science |
|
Time and Place |
|
First lecture: August 30th, 2004 Last
lecture: December 1st, 2004
- Monday, 2:55pm - 4:10pm in Rhodes 484
- Wednesday, 2:55pm - 4:10pm in
Rhodes 484
Exam: Wednesday, October 20th (in class).
|
Instructor |
|
Thorsten Joachims, tj@cs.cornell.edu, 4153 Upson Hall.
Office hours: Wednesdays 4:15pm - 5:00pm |
Syllabus |
|
This course discusses automated techniques that enhance our
ability to handle unstructured and semi-structured information, with an
emphasis on textual information. This includes the analysis of content
(e.g. text), meta-data (e.g. reference structure and authorship), as well
as usage data (e.g. clickthrough and purchases). The course spans the
areas of information retrieval, natural language processing, and machine learning,
with links to work in databases and data mining. In particular, the course will cover the following topics:
- Text Retrieval: introduction to basic
information retrieval, boolean retrieval, vector-space model, term
weighting, inverted index, query processing, evaluation
- Text Classification: machine learning,
bag-of-words representation, support vector machines, naive Bayes,
Rocchio
- Text Clustering: k-means clustering,
hierarchical agglomerative clustering
- Latent Semantic Analysis: SVD, probabilistic
LSA
- Novelty Detection: topic detection and
tracking
- Semi-Structured Data and Semantic Web: XML
databases and information retrieval, metadata, RDF
- Information Extraction: manual and learned
extraction patterns, hidden Markov models, part-of-speech tagging, named entity
detection
- Bibliographic Analysis: co-citation, PageRank
- Usage Data: clickthrough patterns, navigation
paths, personalized adaptive retrieval
- Recommender Systems: product recommendations,
collaborative filtering
Methods and
theory will be illustrated with practical examples. |
Slides and Lecture Notes |
|
- 08/30: Introduction (slides pdf)
- 09/01: Information retrieval basics, evaluation,
statistical properties of text (slides pdf)
- 09/06: Information retrieval datastructures (slides pdf)
- 09/08: Indexing and preprocessing (slides pdf)
- 09/13: Retrieval models (slides pdf)
- 09/20: Hypertext retrieval (slides pdf)
- 09/22: Text classification and Naive Bayes (slides pdf)
- 09/27: Rocchio and K-NN (slides pdf)
- 09/29: Support Vector Machines (slides pdf)
- 10/13: Text clustering (slides pdf)
- 10/25: Latent Semantic Indexing (slides pdf)
- 10/27: Transduction (slides pdf)
- 11/01: Supervised clustering (slides pdf)
- 11/03: Implicit feedback in retrieval (slides pdf)
- 11/08: Part-of-speech tagging (slides pdf)
- 11/10: Named-entity recognition (slides pdf)
- 11/15: Information extraction (slides pdf)
- 11/17: Recommender systems (slides pdf)
|
Readings |
|
- 09/01: MIR
1.1-1.4, 2.1-2.3, 3.1-3.3,6.3.3. SNLP 1.4.2-1.4.5.
- 09/06: MIR
8.1, 8.2, 8.4. GIGA 4.6.
- 09/08: MIR
6.1�6.2, 7.1�7.3, (extra info: Appendix Porter Stemmer).
- 09/13: MIR
2.4, 2.5.1-2.5.3.
- 09/20:
MWEB 5
- 09/22:
TCAT 2
- 09/27:
TCAT 6
- 10/29:
TCAT 3
- 10/04:
TCAT 4
- 10/13:
SNLP
14.0-14.1.2, 14.2-14.2.1
- 10/25:
SNLP 15.4
- 10/27:
TCAT 7
- 11/08:
SNLP
3.1-3.2.0 (background), SNLP 10.0-10.1, 10.4, HMM
tutorial.
- 11/15:
NLPOA Chapter 3
- 11/17:
MWEB Chapter 8
|
Homework Assignments |
|
- Critique 1: Justin Zobel. How
reliable are large-scale information retrieval experiments.
SIGIR 1998.
Due: 09/06 before class.
Critique guidelines
- Homework 1: Building a basic retrieval system.
Due: 09/15 before class.
Assignment, Data,
Trie
Tutorial, Perl
Tutorial
Solutions: invindex.pl, zipf.ps
- Critique 2: Jon Kleinberg. Authoritative
sources in a hyperlinked environment. Proc. 9th ACM-SIAM
Symposium on Discrete Algorithms, 1998.
Due: 09/27 before class.
- Homework 2: Text Classification.
Due: 10/08 at noon.
Assignment, arxiv_doc.train.gz,
arxiv_classes.train.gz,
arxiv_doc.test.gz, arxiv_classes.test.gz,
SVM-light, Lemur
Solution: vectorizer.pl
- Final Project
Guidelines
- Critique 3: Doug Beeferman, Adam Berger. Agglomerative
clustering of a search engine query log. ACM SIGKDD international
conference on Knowledge discovery and data mining, 2000.
Due: 10/18 before class.
- Critique 4: Avrim Blum, Tom Mitchell. Combining
Labeled and Unlabeled Data with Co-Training, Proceedings of the
Workshop on Computational Learning Theory (COLT), 1998.
Due: 11/8 before class.
- Homework 3: Part-of-speech tagging.
Due: 11/22 before class.
Assignment, Data via email
|
Reference Material |
|
- Christopher Manning and Hinrich Schutze.
"Foundations of Statistical NLP", MIT Press, 1999.
- Ricardo Baeza-Yates and Berthier Ribeiro-Neto,
"Modern Information Retrieval", Addison-Wesley, 1999.
- Ian H. Witten, Alistair Moffat, and Timothy C. Bell,
"Managing Gigabytes: Compressing and Indexing Documents and
Images", 2nd edition, Morgan Kaufmann, 1999.
- Karen Sparck Jones and Peter Willett (editors),
"Readings in Information Retrieval", Morgan Kaufman, 1997.
- Thorsten Joachims, "Learning to Classify Text
using Support Vector Machines", Kluwer, 2002.
- Tom Mitchell, "Machine Learning", McGraw
Hill, 1997.
- Thomas
Connolly and Carolyn Begg, �Database Systems�, 3rd
edition, Addison Wesley, 2002.
- Peter Jackson and Isabelle Moulinier, �Natural
Language Processing for Online Applications�, Benjamins, 2002.
- Pierre Baldi, Paolo Frasconi, and Padhraic
Smyth, �Modelling the Internet and the Web�, Wiley, 2003.
|
Prerequisites |
|
Any of the following:
- CS472
- CS478 / CS578 / CS678
- CS674
- equivalent of any of the above
- permission from the instructors
|
Grading |
|
This is a 4-credit course. Grades will be determined based on a written
mid-term exam, a final research project, homework assignments, and
student presentations of selected papers.
- 25%: Mid-Term Exam
- 25%: Final Project
- 30%: Homework: (~3 homework assignments, some programming, some
non-programming)
- 10%: Paper Critiques
- 10%: Class Participation
Roughly: A=90-100; B=80-90; C=70-80; D=60-70; F= below
60 |
Academic Integrity |
|
This course follows the Cornell
University Code of Academic Integrity. Each student in this course is
expected to abide by the Cornell University Code of Academic Integrity.
Any work submitted by a student in this course for academic credit will be
the student's own work. Violations of the rules (e.g. cheating, copying,
non-approved collaborations) will not be tolerated. |