CS430/INFO430: Information Retrieval

CS 430 / INFO 430
Information Retrieval
Fall 2004

Assignments

General Instructions

During this course there will be several assignments which require programming. Java is strongly preferred as the programming language. If, however, you have limited Java experience C++ is permitted.

Running your Programs

We will run your programs on a Windows computer, using the software environment in the CSUG lab. If you have any questions about this, please email cs430-l@cs.cornell.edu or make arrangements to meet with the Teaching Assistant.

The assignments give specific instructions that are intended to make your programs easy for us to run and grade consistently. Points will be subtracted if you do not follow these instructions.

When we run your Java programs, we run them from the command prompt using the "java ..." command. It is fine to develop your program and compile and test it in a development environment, such as Codewarrior or JCreator. But realize that it will be run from the command prompt. So, before submitting your program, test it from the command prompt.
If you are working in Java, you should submit your source Java files and your compiled .class files. Do not send .mcp files (i.e., Codewarrior project files) or any other project type files.
If you work in C or C++, send .c or .cpp and .exe files, with instructions on how to run from the command prompt. Your program should compile and run on Visual Studio .NET.

Repeat all instructions separately for each assignment.

Submission Instructions

Each assignment will require you to submit a list of files.

Assemble all the files into one .zip file.
Name the .zip file xxx-1.zip, where xxx is your NetID. (For example, if your NetID is abc123, create a file called abc123-1.zip.)
If you want to revise your assignment, submit the entire assignment again in a new file called abc123-2.zip, etc. We will grade the one with the highest sequence number.
Email your .zip file to cs430-l@cs.cornell.edu.
The subject field of the email message should be, "NetID - Assignment x", where x is the number of the assignment.

Assignment 1
Due: Monday, September 20, at 5:00 p.m.

The object of this assignment is to write a suite of programs that take raw text and creates an inverted file.

Test collection

The test collection that you will use to test your programs has a set of files that are stored in the directory, testData.

Programs

1. Inverted file setup

Write a program that reads the input data and create three files, as defined in Lecture 4.

(1a) Index file. A sorted list of the terms in the test collection, excluding stop words from the file stoplist.txt. The index file should be organized for both random look-up and sequential processing (lexicographic order). The file structure should permit new terms to be added without rebuilding the entire file. The index file should contain the document frequency for each term (i.e., the number of documents that contain the term) and a link to the list of postings for that term.
(1b) Postings file. A list of postings for each term, sorted in order of documents and location within document. For each posting, the file should include the term frequency (i.e., the number of times that the term occurs in the document).
(1c) Document file. This file should contain the documents in the test collection. It may be organized in any convenient manner or you may use the documents stored on the web site.

2. Test program

Write a test program that receives as input one term at a time.

If the term is in the word file, the program should display a list of the postings for that term. For each posting it should display:
- The term weightings, tf.idf (as defined in Lecture 6), for that posting.
- The document identifier and location within the document for that posting.
- About ten words from the document that include the term in its context within the document.
If the term is not in the inverted file, it should display a suitable message.
The program should loop until the term ZZZ is input.

Report

Write a brief report (one to two pages) that describes the algorithms and data structures that you used for each file and Program 1. The report should include instructions on how to run your programs. The report will be about one third of the grade.

Submission

You should submit the following:

Programs 1 and 2 (source and binary)
Dumps of files 1a and 1b [For each file, list out the data in the first few records, with the values in the various fields. The definitions of the fields and the data structures used to store the records should be described in the report.]
Report

As part of the grading, we will run your programs against the test data.

Hints

Remember that you are not building a production system and the volume of test data is quite small. This suggests the following:

You can choose data structures, etc. that illustrate the concepts but are straightforward to implement (e.g., you do not need to implement B trees).
Consider batch loading of data (e.g., no need to provide for incremental update).
User interface can be minimal (e.g., single letter commands).
Use standard Java (or C++) classes where available, even if their performance is slow.

Assignment 2
Due: Friday, October 8, at 5:00 p.m.

The purpose of this assignment is to demonstrate your understanding of latent semantic indexing. For this assignment it is recommended that you write your programs in Java. For the matrix calculations in Java, you can use the JAMA package http://math.nist.gov/javanumerics/jama/, which provides singular value decomposition.

For the assignment, you will use the set of files that are stored in the directory, testData. This is the same test data as was used for Assignment 1.

A search engine that indexes full text using latent semantic indexing

Write a program LSI that does the following:

Building the index(es)

Create the term vector space and apply singular value decomposition to create an LSI representation for the pages.
Store the representation of the pages.
Create appropriate inverted file(s).

Searching the indexes

After the indexes have been built, the user interface accepts queries from users. A query is a simple list of terms.

The program should do the following:

Transform the query vector to the LSI space.
Use a suitable similarity measure to compute and rank relevant documents.
For each hit, the program should print out a portion of the document from testData.

Run four tests of your program that demonstrate its features. For each test, state which features it is demonstrating.

Report

Provide a brief report that describes: the mathematical basis for your programs, how your program takes advantage of the capabilities of JAMA, and which features you have implemented yourself (about one page).

The graders will run your programs, but they will not go into the source code and modify it. All files and any other options must be specified as input. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the stop list and test data file.

Submission

You should submit the following:

Program LSI
Output from your test runs
Report

You do not need to submit JAMA, but your report should state very clearly what the graders should do to build and run your programs.

Assignment 3
Due: Monday, November 8, at 5:00 p.m.

The aim of this assignment is to implement a simple information retrieval system. This assignment builds on Assignment 1. If you did not complete Assignment 1, please contact a Teaching Assistant (email to cs430-l@cs.cornell.edu) to help you prepare for this assignment.

The search engine

The Syracuse Information Retrieval Experiment, SIRE, was a search engine that combined features from both Boolean and the vector space model. The basic concept is to use Boolean retrieval to find a set of records that match a query and then to use vector similarity to rank those records.

For this assignment, you will build a search engine that uses some of these concepts. The search engine processes each query against the collection of records as follows.

It does a Boolean match of the query against the terms in the records to create a set of hits, set1. Each result in set1 is an exact match against the query.
It ignores the Boolean operators and matches the query against the terms in the records to create a second set of results, set2. Each record in set2 has at least one term in common with a term in the query.
It ranks the records in set1 and set2 against the query, using standard tf.idf ranking. [Note. This item has been reworded for clarity.]
It provides as output the ranked records in set1 followed by the ranked records in set2.

Input data

A file of test data, test3.txt has been provided in testData.� For convenience the documents in the test data set have been combined into a single file, separated by <record>, </record> tags. See the test data file for examples of this format. Your search engine does not need to handle records in any other format.

The input data is delimited by a pair of tags: <metadata>, </metadata>.
Each record within the file is delimited by a pair of tags: <record>, </record>.
Each record begins on a new line. There may be blank lines.

This file corresponds to the document file in Assignment 1, with each record being a separate document. The objective of the search engine is to identify which documents (records) match each query.�

This data is the catalog records from one year of articles in D-Lib Magazine, lightly edited. (This test data is actually an XML file.� Your program can ignore the data before the first <metadata> tag. If you are knowledgeable about tools for processing XML files, e.g., XSLT, you may use them for this assignment.)

Query

A query is a string of printable ASCII text.� It may contain the following:

One or more search terms.�
The Boolean, and and or operators, with and having higher precedence.
The truncation symbol, ?,� at the end of a search term to indicate 0 or more characters.

Here are some examples of queries, with interpretation:

moll��
[ A single search term "moll".]
digit?�or number
[ A term that begins with the letters "digit" or the term "number".]
moll and digit? or number�
[ Either both the term "moll" and a term that begins with the letters "digit", or the term "number".]

Programs

Write two programs in Java or C++:

1. Inverted file set up

The first program should be called build. It will be a modified version of the first program that you wrote for Assignment 1. It reads a test file in this format, builds an inverted index, which includes an index file and postings files of the search terms, and writes the files to disk. It should use the same stop list as for Assignment 1, stoplist.txt.

2. Search

The second program should be called search. It allows a user to search the indexes and writes the results as an neatly formatted Web page.

The search program should have the following features:

The program is run from a command line interface. It begins by reading the index file and postings file created by the build program.
The user types in a query and receives a response, which is either an error message or a message:
```
  Number of records that match your query exactly: x
  Number of records that match your query partially: y
```
where x and y are the number of results in set1 and set2.
The results found should be written in html format to a file results.html. The results should contain the following
:

For a single run of the program, the results of each search should be appended to the file in sequence.
For each query the results should begin with a line containing the query and should end with two or more blank lines and a horizontal line, using the <hr> tag.
For each query, the results in set1 should be separated from the results in set2.
For each hit, the program should print out the record with the search terms highlighted in bold font.

Report

Provide a brief report that describes the algorithms and data structures that you use (about one page).

The graders will run your program, but they will not go into the source code and modify it. All files and any other options must be specified as input. One option is to provide file names as command line parameters. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the stop list and test data files.

Test

Run your program with five queries. Your report should list the queries and indicate what function each is testing.

Submission

You should submit the following:

Programs 1, 2
Test output and results file created by the second program
Report

Assignment 4
Due: Friday, December 3, at 5:00 p.m.

[Notes added November 24,2004.

A problem in line 13 of the test data has been corrected. The URL should end ".html".

For the calculation of PageRank, please use a damping factor of 0.3.

See the hint below about the form of snippet that is returned for each record found by the search.]

The purpose of this assignment is to demonstrate familiarity with concepts used in searching the web, such as the PageRank algorithm and the use of anchor text.

The assignment is divided into four separate parts. You will probably find that it is most convenient to write a single program that does all four parts, but you may submit several programs if you prefer.

1. Extracting hyperlinks from the input data

The file test4.txt in the testData directory contains a list of URLs, one per line. Each URL identifies an html page on the Information Science web site. The pages have been numbered for convenience.

You will need to write code that reads the html pages in the test data. For each page, extract each hyperlink out from the page. Ignore links to pages that are not in the test data set.

Use the links to pages that are in the test data set to build the link matrix as defined in Lecture 20, Slide 25.

Extract the anchor text associated with each hyperlink and associate it with the page to which it points.

[See the file URLhints.html in the testData directory for hints on extracting hyperlinks from these web pages.]

2. PageRank

Use the PageRank algorithm to calculate a ranking of the pages. Use a damping factor of 0.3.

3. Index

For each page build a short index record that contains only:

The title of the page. (This is tagged by a pair of tags <title> and </title>.)
Anchor text from pages that link to the page being indexed.

4. Searching

Build a simple search program that receives as input a single term, searches the short index records and returns a snippet for each page that has that term in its short index record. The pages should be returned in the order of their PageRank.

Hint: It is not necessary that the snippet contains the search term. If fact, it may happen that the a page is returned that does not contain the search term. A possible snippet might consist of the title followed by the first few words of the page.

Report

Provide a brief report that describes the algorithms and data structures that you use (about one page).

Test

Run your program with five queries. Your report should list the queries and indicate what function each is testing.

Submission

You should submit the following:

Your program(s)
Test output and results file
Report

William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 24, 2004