CS430/INFO430: Information Retrieval

CS 430 / INFO 430
Information Retrieval
Fall 2005

Assignments

General Instructions

During this course there will be several assignments which require programming. Java is strongly preferred as the programming language. If, however, you have limited Java experience C++ is permitted.

Running your Programs

We will run your programs on a Windows computer, using the software environment in the CSUG lab. If you have any questions about this, please email cs430-l@cs.cornell.edu or make arrangements to meet with the Teaching Assistant.

The assignments give specific instructions that are intended to make your programs easy for us to run and grade consistently. Points will be subtracted if you do not follow these instructions.

When we run your Java programs, we run them from the command prompt using the "java ..." command. It is fine to develop your program and compile and test it in a development environment, such as Codewarrior or JCreator. But realize that it will be run from the command prompt. So, before submitting your program, test it from the command prompt.
If you are working in Java, you should submit your source Java files and your compiled .class files. Do not send .mcp files (i.e., Codewarrior project files) or any other project type files.
If you work in C or C++, send .c or .cpp and .exe files, with instructions on how to run from the command prompt. Your program should compile and run on Visual Studio .NET.

Repeat all instructions separately for each assignment.

Submission Instructions

Each assignment will require you to submit a list of files.

Assemble all the files into one .zip file.
Name the .zip file xxx-1.zip, where xxx is your NetID. (For example, if your NetID is abc123, create a file called abc123-1.zip.)
If you want to revise your assignment, submit the entire assignment again in a new file called abc123-2.zip, etc. We will grade the one with the highest sequence number.
Email your .zip file to cs430-l@cs.cornell.edu.
The subject field of the email message should be, "NetID - Assignment x", where x is the number of the assignment.

Assignment 1
Due: Saturday, September 17, at midnight

This assignment was slightly modified for clarity on September 8 and 15. The paragraphs that have been changed are identified by notes.

The object of this assignment is to write a suite of programs that take raw text and creates an inverted file.

Test collection

The test collection that you will use to test your programs has a set of files that are described on the Web page, testData.html. This Web page also describes several Excel spreadsheets that will be useful if you wish to check test results manually.

Programs

1. Inverted file setup

Write a program that reads the input data and create three files, as defined in Lecture 4.

(1a) Index file. A sorted list of the terms in the test collection, excluding stop words from the file stoplist.txt. The index data structure should be organized for both random look-up and sequential processing (lexicographic order). The structure should permit new terms to be added without rebuilding it completely. The index should contain the document frequency for each term (i.e., the number of documents that contain the term) and a link to the list of postings for that term.

[Note: This paragraph was modified on September 8 for clarity.]
(1b) Postings file. A list of postings for each term, sorted in order of documents and location within document. For each posting, the file should include the term frequency (i.e., the number of times that the term occurs in the document).
(1c) Document file. This file should contain the documents in the test collection. It may be organized in any convenient manner or you may use the documents stored on the web site.

The program should finish by writing out these files to disk, so that they can be read by the Test program. A binary dump of the data structures is not adequate.

[Note: This paragraph was added on September 15 for clarity.]

2. Test program

Write a test program. The program begins by reading the files that were created by the Index program. It then reads as input one term at a time.

[Note: This paragraph was modified on September 15 for clarity.]

If the term is in the word file, the program should display a list of the postings for that term. For each posting it should display:
- The term weightings, tf, idf and tf.idf (as defined in Lecture 3), for that posting.
- The document identifier and location within the document for that posting.
- About ten words from the document that include the term in its context within the document.
If the term is not in the inverted file, it should display a suitable message.
The program should loop until the term ZZZ is input.

Report

Write a brief report (one to two pages) that describes the algorithms and data structures that you used for each file and Program 1. The report should include instructions on how to run your programs. The report will be about one third of the grade.

Submission

You should submit the following:

Programs 1 and 2 (source and binary)
Dumps of files 1a and 1b [For each file, list out the data in the first few records, with the values in the various fields. The definitions of the fields and the data structures used to store the records should be described in the report.]
Report

As part of the grading, we will run your programs against the test data.

Hints

Remember that you are not building a production system and the volume of test data is quite small. This suggests the following:

You can choose data structures, etc. that illustrate the concepts but are straightforward to implement (e.g., you do not need to implement B trees).
Consider batch loading of data (e.g., no need to provide for incremental update).
User interface can be minimal (e.g., single letter commands).
Use standard Java (or C++) classes where available, even if their performance is slow.

Assignment 2
Due: Sunday, October 9, at midnight

Because of disruptions to the computing facilities in Upson Hall scheduled for Saturday, the due date for Assignment 2 has been extended to midnight on Sunday, October 9.

[Minor revisions to the wording were made on, Wednesday, September 28.]

The purpose of this assignment is to demonstrate your understanding of latent semantic indexing. For this assignment it is recommended that you write your programs in Java. For the matrix calculations in Java, you can use the JAMA package http://math.nist.gov/javanumerics/jama/, which provides singular value decomposition.

If you choose to write in C++ you will have to find a matrix package that will do singular value decomposition.

For the assignment, you will use the set of files that were used for Assignment 1. They are described on the Web page, testData.html.

A search engine that indexes full text using latent semantic indexing

Write a single program LSI that does the following:

Building the index(es)

Create the term document vector space and apply singular value decomposition to create an LSI representation.
Store the representation of the pages in the concept space.

Searching the indexes

After the indexes have been built, the user interface accepts queries from users. A query is a simple list of terms.

The program should do the following:

Transform the query vector to the LSI space.
Use a suitable similarity measure to compute and rank relevant documents.
For each hit, the program should print out a portion of the document from testData.

Run four tests of your program that demonstrate its features. For each test, state which features it is demonstrating.

Report

Provide a brief report that describes: the mathematical basis for your programs, how your program takes advantage of the capabilities of JAMA, and which features you have implemented yourself (about one page).

The graders will run your programs, but they will not go into the source code and modify it. All files and any other options must be specified as input. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the stop list and test data file.

Submission

You should submit the following:

Program LSI
Output from your test runs
Report

You do not need to submit JAMA, but your report should state very clearly what the graders should do to build and run your programs.

Assignment 3
Due: Saturday, November 5, at midnight

The aim of this assignment is to implement a program, explore, in Java or C++ that provides a simple information discovery system with fielded searching and basic relevance feedback. This assignment builds on Assignment 1.

The information discovery process

The information discovery process is as follows.

The user types a query to the search system.
The search system searches a catalog and provides a ranked list of all records that match the query.
The user looks at the list and selects one or more that appear useful
The system creates a new query by combining terms from the selected records.
The new query is used to search the catalog and provide a new ranked list of matching records.

Each of these steps is described in more detail below.

Catalog data

The records to be indexed are contained in a file of tagged ASCII text of the following format:

The input data is delimited by a pair of tags: <metadata>, </metadata>.
Each record within the file is delimited by a pair of tags: <record>, </record>.
Each record consists of author and title fields delimited by pairs of tags: <author>, </author> or <title>,</title>

Each record begins on a new line. There may be blank lines. See the test data file for examples of this format. Your search engine does not need to handle records in any other format. A file of test data test3.txt has been provided. This data is the catalog records from one year of articles in D-Lib Magazine, lightly edited. (This test data is actually an XML file. Your program can ignore the data before the first <metadata> tag. If you are knowledgeable about tools for processing XML files, e.g., XSLT, you may use them for this assignment.)

Inverted file set up

When the program starts up, it reads a file in the format of the test data, and builds an inverted index, which includes a word list and postings files of the search terms. Each posting should be labeled to show whether the term is in an <author> or <title> field. It should use the same stop list as for Assignment 1, stoplist.txt.

User interface

The program has a simple user interface that allows the user to input a query, display the results, identify relevant results as relevance feedback, and run the revised query. It should be possible to run several queries in a single run. This can be a very simple command line interface.

Input a query

To specify a search, the user must specify a query and a search option.

A query is a string of printable ASCII text containing one or more search terms.

Three search options are provided:

Author: Search for terms in the query that are in the <author> field
Title: Search for terms in the query that are in the <title> field
General: Search for terms in query that are in any field.

Search algorithm

The search algorithm should rank the similarity of each document to a query. Depending on the search option chosen, the similarity should be calculated using terms in the <author>, <title> or all fields.

Relevance feedback

After a search has been carried out, the user selects one or more catalog records from the results as relevant. The user also specifies a search option for the new search (author, title or general). A new query is created by the system which combines all the terms in the appropriate field of the selected records.

Another search is carried out using this revised query and search option.

{Hint: The standard tf.idf scheme is optimized for full text documents. In class we discussed fielded searches in general but made no specific recommendation for weights. You will have to devise a reasonable scheme and explain briefly why you chose it.}

Report

Provide a brief report that describes the algorithms and data structures that you use (about one page). Be sure to specify the similarity and ranking methods, and how the new query is created using relevance feedback.

The graders will run your program, but they will not go into the source code and modify it. All files and any other options must be specified as input. One option is to provide file names as command line parameters. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the stop list and test data files.

Submission checklist

You should submit the following:

The program explore
Report

Before submitting your assignment check the following:

Does the program compile and run program run "out of the box", or does the grader need to ensure certain things are in order (e.g., put files in the right directory, edit a script file). If so, did you note clearly what those things are?
Did your instructions include the arguments the program requires (if any)?
Did you include default arguments/datafiles for the program (if applicable)?

Did you provide detailed instructions for running the program in your report?
Does the program use absolute paths? (The answer should be no.)

While not required, the graders would be grateful if you also kept the following things in mind

Java is meant to be portable but it allows file paths to be system dependent. For example, the command new File("sourcefiles\datafile1.txt") will only function properly on a Windows machine. Consider the following alternatives for hard coding path names:

   new File(new URI("sourcefiles/datafile1.txt"))
   new File("sourcefiles", "datafile1.txt")
   new File("sourcefiles"+File.pathSeparator+"datafile1.txt")

Alternatively, do not hard code paths. Keep them in an external file, or allow the user to specify them on the command line (making note of this in your report).
When zipping your files for submission, zip a directory containing your files rather than the files themselves.
Not hard coding filenames is a good thing. It makes your code and your program modular. However, assignments often include large lists of files. If you avoid hard coding filenames, please don't make your grader type in all the filenames by hand (and be sure note how the grader can avoid this in your instructions)

Assignment 4
Due: Friday, December 2, at midnight

The purpose of this assignment is to demonstrate familiarity with concepts used in searching the web, such as the PageRank algorithm and the use of anchor text.

The assignment is divided into four separate parts. You will probably find that it is most convenient to write a single program that does all four parts, but you may submit several programs if you prefer.

1. Extracting hyperlinks from the input data

The file test4.txt in the test directory contains a list of URLs, one per line. Each URL identifies an html page on the Information Science web site. The pages have been numbered for convenience.

You will need to write code that reads the html pages in the test data. For each page, extract each hyperlink out from the page. Ignore links to pages that are not in the test data set.

Use the links to pages that are in the test data set to build the link matrix as defined in Lecture 20, Slide 16.

Extract the anchor text associated with each hyperlink and associate it with the page to which it points.

[See the file URLhints.html in the test directory for hints on extracting hyperlinks from these web pages.]

2. PageRank

Use the PageRank algorithm to calculate a ranking of the pages. Use a damping factor of 0.2.

3. Index

For each page build a short index record that contains only:

The title of the page. (This is tagged by a pair of tags <title> and </title>.)
Anchor text from pages that link to the page being indexed.

4. Metadata

Write a file, metadata, that contains the following metadata for each page: page ID, PageRank, list of terms in anchor text from pages that link to the page.

5. Searching

Build a simple search program that receives as input a single term, searches the short index records and returns a snippet for each page that has that term in its short index record. The pages should be returned in the order of their PageRank.

Hint: It is not necessary that the snippet contains the search term. If fact, it may happen that the a page is returned that does not contain the search term. A possible snippet might consist of the title followed by the first few words of the page.

Report

Provide a brief report that describes the algorithms and data structures that you use (about one page).

Test

Run your program with five queries. Your report should list the queries and indicate what function each is testing.

Submission

You should submit the following:

Your program(s)
The file metadata
Test output and results file
Report

Before submitting your assignment check the following:

Does the program compile and run program run "out of the box", or does the grader need to ensure certain things are in order (e.g., put files in the right directory, edit a script file). If so, did you note clearly what those things are?
Did your instructions include the arguments the program requires (if any)?
Did you include default arguments/datafiles for the program (if applicable)?

Did you provide detailed instructions for running the program in your report?
Does the program use absolute paths? (The answer should be no.)

While not required, the graders would be grateful if you also kept the following things in mind

Java is meant to be portable but it allows file paths to be system dependent. For example, the command new File("sourcefiles\datafile1.txt") will only function properly on a Windows machine. Consider the following alternatives for hard coding path names:

   new File(new URI("sourcefiles/datafile1.txt"))
   new File("sourcefiles", "datafile1.txt")
   new File("sourcefiles"+File.pathSeparator+"datafile1.txt")

Alternatively, do not hard code paths. Keep them in an external file, or allow the user to specify them on the command line (making note of this in your report).
When zipping your files for submission, zip a directory containing your files rather than the files themselves.
Not hard coding filenames is a good thing. It makes your code and your program modular. However, assignments often include large lists of files. If you avoid hard coding filenames, please don't make your grader type in all the filenames by hand (and be sure note how the grader can avoid this in your instructions)

William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 10, 2005