CS 430
Information Discovery
Fall 2003

Readings

General Books

There is no text book for this course. The following books cover much of the material for this course.

Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999.
William B. Frakes and Ricardo Baeza-Yates, Information Retrieval Data Structures and Algorithms. Prentice Hall, 1992.
G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, 1983.
Van Rijsbergen, C. J., Information Retrieval, Butterworths, 1979. http://www.dcs.gla.ac.uk/Keith/

Discussion Classes

Readings for discussion classes are to be studied in preparation for the classes on Wednesday evenings.

Discussion Class 1, September 3, 2003

In preparation for this class, explore three information retrieval systems and compare them:

Google -- a Web search engine (http://www.google.com/).
The Library of Congress catalog -- a very large bibliographic catalog (http://catalog.loc.gov/).
Medline -- an indexing and abstracting service for medicine and related fields (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi).

Consider the two information discovery tasks:

What is the medical evidence that red wine is good or bad for your health? (Use Medline and google.)
What in history led to the current turmoil in Palestine and the neighboring countries? (Use the Library of Congress catalog and Google.)

Study each search service in two ways. (a) From a technical viewpoint. Does the service search full text or surrogates? Are fielded searched offered? What Boolean operators are supported? What regular expressions? How does it handle non-Roman character sets? What is the stop list? How are results ranked? Are they sorted, if so in what order? (b) From a usability viewpoint. What style of user interface(s) is provided? What training or help services? If there are basic and advanced user interfaces, what does each offer?

Overall, how effective is each service? What do you consider its strengths and its weaknesses? When would you use it?

Discussion Class 2, September 10, 2003

Read and be prepared to discuss:

G. Salton, A. Wong and C. S. Yang, A vector space model for automatic indexing. Communications of the ACM Volume 18 , Issue 11 (November 1975) pages: 613 - 620. http://doi.acm.org/10.1145/361219.361220

This paper describes many of the concepts behind the vector space model and the SMART system.

{Note that to access this paper from the ACM Digital Library, you need to use a computer with a Cornell IP address.}

Discussion Class 3, September 17, 2003

Read and be prepared to discuss:

M. F. Porter, An algorithm for suffix stripping. (Originally published in Program, 14 no. 3, pp 130-137, July 1980.) http://www.tartarus.org/~martin/PorterStemmer/def.txt

This paper describes one of the standard algorithms uses for stemming English text.

[CS 430 Home Page]

William Y. Arms
(wya@cs.cornell.edu)
Last changed: Septmeber 15, 2003