CS 430 / INFO 430
Information Retrieval
Fall 2004
Assignments
|
General Instructions During this course there will be several assignments which require programming. Java is strongly preferred as the programming language. If, however, you have limited Java experience C++ is permitted. Running your Programs We will run your programs on a Windows computer, using the software environment in the CSUG lab. If you have any questions about this, please email cs430-l@cs.cornell.edu or make arrangements to meet with the Teaching Assistant. The assignments give specific instructions that are intended to make your programs easy for us to run and grade consistently. Points will be subtracted if you do not follow these instructions.
Repeat all instructions separately for each assignment. Submission Instructions Each assignment will require you to submit a list of files.
Assignment 1 The object of this assignment is to write a suite of programs that take raw text and creates an inverted file. Test collection The test collection that you will use to test your programs has a set of files that are stored in the directory, testData.Programs 1. Inverted file setup Write a program that reads the input data and create three files, as defined in Lecture 4.
2. Test program Write a test program that receives as input one term at a time.
Report Write a brief report (one to two pages) that describes the algorithms and data structures that you used for each file and Program 1. The report should include instructions on how to run your programs. The report will be about one third of the grade.Submission You should submit the following:
As part of the grading, we will run your programs against the test data. Hints Remember that you are not building a production system and the volume of test data is quite small. This suggests the following:
Assignment 2 The purpose of this assignment is to demonstrate your understanding of latent semantic indexing. For this assignment it is recommended that you write your programs in Java. For the matrix calculations in Java, you can use the JAMA package http://math.nist.gov/javanumerics/jama/, which provides singular value decomposition. For the assignment, you will use the set of files that are stored in the directory, testData. This is the same test data as was used for Assignment 1. A search engine that indexes full text using latent semantic indexing Write a program LSI that does the following: Building the index(es)
Searching the indexes After the indexes have been built, the user interface accepts queries from users. A query is a simple list of terms. The program should do the following:
Run four tests of your program that demonstrate its features. For each test, state which features it is demonstrating. Report Provide a brief report that describes: the mathematical basis for your programs, how your program takes advantage of the capabilities of JAMA, and which features you have implemented yourself (about one page). The graders will run your programs, but they will not go into the source code and modify it. All files and any other options must be specified as input. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the stop list and test data file. Submission You should submit the following:
You do not need to submit JAMA, but your report should state very clearly what the graders should do to build and run your programs. Assignment 3 The aim of this assignment is to implement a simple information retrieval system. This assignment builds on Assignment 1. If you did not complete Assignment 1, please contact a Teaching Assistant (email to cs430-l@cs.cornell.edu) to help you prepare for this assignment. The search engine The Syracuse Information Retrieval Experiment, SIRE, was a search engine that combined features from both Boolean and the vector space model. The basic concept is to use Boolean retrieval to find a set of records that match a query and then to use vector similarity to rank those records. For this assignment, you will build a search engine that uses some of these concepts. The search engine processes each query against the collection of records as follows.
Input data A file of test data, test3.txt has been provided in testData. For convenience the documents in the test data set have been combined into a single file, separated by <record>, </record> tags. See the test data file for examples of this format. Your search engine does not need to handle records in any other format.
This file corresponds to the document file in Assignment 1, with each record being a separate document. The objective of the search engine is to identify which documents (records) match each query. This data is the catalog records from one year of articles in D-Lib Magazine, lightly edited. (This test data is actually an XML file. Your program can ignore the data before the first <metadata> tag. If you are knowledgeable about tools for processing XML files, e.g., XSLT, you may use them for this assignment.) Query A query is a string of printable ASCII text. It may contain the following:
Here are some examples of queries, with interpretation:
Programs Write two programs in Java or C++: 1. Inverted file set up The first program should be called build. It will be a modified version of the first program that you wrote for Assignment 1. It reads a test file in this format, builds an inverted index, which includes an index file and postings files of the search terms, and writes the files to disk. It should use the same stop list as for Assignment 1, stoplist.txt. 2. Search The second program should be called search. It allows a user to search the indexes and writes the results as an neatly formatted Web page. The search program should have the following features:
Report Provide a brief report that describes the algorithms and data structures that you use (about one page). The graders will run your program, but they will not go into the source code and modify it. All files and any other options must be specified as input. One option is to provide file names as command line parameters. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the stop list and test data files. Test Run your program with five queries. Your report should list the queries and indicate what function each is testing. Submission You should submit the following:
Assignment 4 [Notes added November 24,2004.
The purpose of this assignment is to demonstrate familiarity with concepts used in searching the web, such as the PageRank algorithm and the use of anchor text. The assignment is divided into four separate parts. You will probably find that it is most convenient to write a single program that does all four parts, but you may submit several programs if you prefer. 1. Extracting hyperlinks from the input data The file test4.txt in the testData directory contains a list of URLs, one per line. Each URL identifies an html page on the Information Science web site. The pages have been numbered for convenience. You will need to write code that reads the html pages in the test data. For each page, extract each hyperlink out from the page. Ignore links to pages that are not in the test data set. [See the file URLhints.html in the testData directory for hints on extracting hyperlinks from these web pages.] 2. PageRank Use the PageRank algorithm to calculate a ranking of the pages. Use a damping factor of 0.3. 3. Index For each page build a short index record that contains only:
4. Searching Build a simple search program that receives as input a single term, searches the short index records and returns a snippet for each page that has that term in its short index record. The pages should be returned in the order of their PageRank. Hint: It is not necessary that the snippet contains the search term. If fact, it may happen that the a page is returned that does not contain the search term. A possible snippet might consist of the title followed by the first few words of the page. Report Provide a brief report that describes the algorithms and data structures that you use (about one page). The graders will run your program, but they will not go into the source code and modify it. All files and any other options must be specified as input. One option is to provide file names as command line parameters. The report should include a short set of instructions that tells the grader how to run your program, including how to specify the test data file. Test Run your program with five queries. Your report should list the queries and indicate what function each is testing. Submission You should submit the following:
|
[ Home | Notices | Syllabus | Readings | Assignments | Examinations | Academic Integrity ]
William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 24, 2004