CS430/INFO430: Information Retrieval

CS 430 / INFO 430
Information Retrieval
Fall 2004

Books and Readings

General Books

There is no text book for this course. The following books cover much of the material for this course.

Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999.
William B. Frakes and Ricardo Baeza-Yates, Information Retrieval Data Structures and Algorithms. Prentice Hall, 1992.
G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, 1983.
Van Rijsbergen, C. J., Information Retrieval, Butterworths, 1979. http://www.dcs.gla.ac.uk/Keith/

Discussion Classes

Readings for discussion classes are to be studied in preparation for the classes on Wednesday evenings.

Discussion Class 1, September 1, 2004

In preparation for this class, explore three information retrieval systems and compare them:

Google -- a Web search engine (http://www.google.com/).
The Library of Congress catalog -- a very large bibliographic catalog (http://catalog.loc.gov/).
Medline -- an indexing and abstracting service for medicine and related fields (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi).

Consider the two information discovery tasks:

What is the medical evidence that red wine is good or bad for your health? (Use Medline and Google.)
What in history led to the current turmoil in Palestine and the neighboring countries? (Use the Library of Congress catalog.)

Study each search service in two ways.

(a) From a technical viewpoint. Does the service search full text or surrogates? Is fielded searching offered? What Boolean operators are supported? What regular expressions? How does it handle non-Roman character sets? What is the stop list? How are results ranked? Are they sorted, if so in what order?

(b) From a usability viewpoint. What style of user interface(s) is provided? What training or help services? If there are basic and advanced user interfaces, what does each offer?

Overall, what do you consider the strengths and weaknesses of each service? When would you use them?

Discussion Class 2, September 8, 2004

Read and be prepared to discuss:

G. Salton, A. Wong and C. S. Yang, A vector space model for automatic indexing. Communications of the ACM Volume 18 , Issue 11 (November 1975) pages: 613 - 620. http://doi.acm.org/10.1145/361219.361220

This paper describes many of the concepts behind the vector space model and the SMART system.

{Note that to access this paper from the ACM Digital Library, you need to use a computer with a Cornell IP address.}

Discussion Class 3, September 15, 2004

Read and be prepared to discuss:

M. F. Porter, An algorithm for suffix stripping. (Originally published in Program, 14 no. 3, pp 130-137, July 1980.) http://www.tartarus.org/~martin/PorterStemmer/def.txt

This paper describes one of the standard algorithms used for stemming English text.

Discussion Class 4, September 22, 2004

Read and be prepared to discuss the following paper, concentrating on Sections 1 to 4, and 5.3. You do not need to study the details of the methods described in Sections 5.1 and 5.2. Section 6 is for general interest only.

E. Voorhees, D. Harman, Overview of the Eighth Text REtrieval Conference (TREC-8). http://trec.nist.gov/pubs/trec8/papers/overview_8.ps.

This is one of a sequence of publications. The full sequence of TREC publications is at http://trec.nist.gov/pubs.html.

{Note that this paper is in PostScript format. You can view it using the GhostView viewer, which is available on the Web for downloading for all standard computer systems. The PDF version of the file on the TREC Web site is damaged. Here is a PDF version that was generated from the PostScript file.}

Discussion Class 5, September 29, 2004

Read and be prepared to discuss the following paper:

Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman, "Indexing by latent semantic analysis". Journal of the American Society for Information Science, Volume 41, Issue 6, 1990. http://www3.interscience.wiley.com/cgi-bin/issuetoc?ID=10049584

{Note that to access this paper from Wiley InterScience, you need to use a computer with a Cornell IP address.}

Discussion Class 6, October 6, 2004

Read and be prepared to discuss:

Caroline R. Arms and William Y. Arms, "Mixed Content and Mixed Metadata: Information Discovery in a Messy World." In Metadata in Practice, edited by Diane Hillmann and Elaine Westbrooks, ALA Editions in 2004. http://www.cs.cornell.edu/wya/papers/ALA-2003.php.

This paper provides an overview of many of the topics that will be covered in the second half of the course.

Discussion Class 7, October 20, 2004

The purpose of this class is to explore the Jakarta Lucene search engine. It is described on the web site:

http://jakarta.apache.org/lucene/

This is a large web site and you are not expected to read everything on the site. Concentrate on the following:

What are the underlying search mechanisms supported by Lucene? What algorithms does it use? What data structures?
How do you load free text into Lucene? How do you load fielded text? What format options are there? How does it handle various character sets, stoplists, stemming, etc.?
How do you incorporate Lucene queries and results into your own user interface?
If you wanted to modify Lucene to support a novel search algorithm, how would you go about it?

Discussion Class 8, October 27, 2004

Read and be prepared to discuss the following paper:

Betty Furrie, "Understanding MARC Bibliographic Machine-Readable Cataloging". Library of Congress, Network Development and MARC Standards Office, 2003. http://www.loc.gov/marc/umb/

In reading this paper, concentrate on understanding what MARC is and what it does. Who uses MARC and for what purpose? What is the underlying data model? MARC was developed in the 1960s. What problems does this cause today?

Discussion Class 9, November 3, 2004

Read and be prepared to discuss:

Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. Seventh International World Wide Web Conference. Brisbane, Australia, 1998. http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm

Note. A second copy of this paper is available at http://www-db.stanford.edu/~backrub/google.html.

Discussion Class 10, November 10, 2004

Read and be prepared to discuss:

Howard Wactlar, Informedia - Search and Summarization in the Video Medium. Proceedings of Imagina 2000 Conference, Monaco, January 31 - February 2, 2000. http://www.informedia.cs.cmu.edu/documents/imagina2000.pdf

Discussion Class 11, November 17, 2004

The aim of this class is to explore Search/Retrieve Web Services (SRW). The main reading is:

Rob Sanderson, A Gentle Introduction to SRW. Z39.50 International Maintenance Agency, February 2004. http://www.loc.gov/z3950/agency/zing/srw/introduction.html

To appreciate this paper, you will need also to read an introduction to Z39.50:

Clifford A. Lynch, The Z39.50 Information Retrieval Standard, Part I: A Strategic View of Its Past, Present and Future, D-Lib Magazine, April 1997. http://www.dlib.org/dlib/april97/04lynch.html

You will also need to read an introduction to the Common Query Language (CQL):

Mike Taylor, A Gentle Introduction to CQL. Z39.50 International Maintenance Agency, September 2003. http://zing.z3950.org/cql/intro.html

Discussion Class 12, December 1, 2004

The purpose of this discussion class is to explore the Medical Subject Headings (MeSH) and Unified Medical Language System (UML), maintained by the National Library of Medicine. There is no fixed reading. You have to search for documents about MeSH and UML, decide what to read and gain your own overview of the subject matter.

In the class, we will discuss what you found about MeSH and UML. We will also discuss the search strategy that you followed.

William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 9, 2004