CS 430
Information Discovery
Fall 2002

Assignments


Instructions on How to Submit Programs

Please see the additional information below under "Running your Programs."

11/4/02

Notice:  Assignment 2

Assignment 2 is posted below.  Note that the due date has been changed.

This assignment builds on Assignment 1.  If you did not complete Assignment 1, please contact a Teaching Assistant (email to cs430@cs.cornell.edu) to help you prepare for this assignment.

10/8/02

Assignment 1

There has been a delay in setting up the email address for this assignment (cs430-1@cs.cornell.edu).  If this address is not set up when you submit, please send your assignment to cs430@cs.cornell.edu.

Several people have asked for more details about the file formats for Files 1a, 1b and 2a.  Designing these formats is part of the assignment.  There is no single correct answer.  Your report should specify the format and explain the choices that you made.

9/19/02


General Instructions

During this course there will be four programming assignments, which together form a simple web search system.

The preferred programming language is Java.  If, however, you have limited Java experience C++ is permitted.

Running your Programs

The graders will run your programs to grade them.  In addition, the assignments give specific instructions that are intended to make your programs easy for the graders to run and grade consistently.  Points will be subtracted if you do not follow these instructions.

  • When the graders run your Java programs, they run them from the command prompt using the "java ..." command.  It is fine to develop your program and compile and test it in a development environment, such as Codewarrior or JCreator.  But realize that it will be run from the command prompt. So, before submitting your program, test it from the command prompt and submit instructions on how to run it from the command prompt. 
  • What programming files should you submit? If you are working in Java, submit your source .java files and your compiled .class files. Do not send .mcp files (i.e., Codewarrior project files) or any other project type files. If you work in C or C++, send .c or .cpp and .exe files (both with instructions on how to run from the command prompt). 
  • If you use the -classpath option when running your code and there is an aspect that is operating system specific, please tell us about it.  The graders test in both Unix and Windows. Please tell us how to do both. 
  • The same person will not be grading your assignment every week. This is because we want to make sure we are fair to everyone. Therefore, repeat all instructions separately for each assignment. Just be clear and explain things. 

If you have any questions about this, PLEASE email cs430@cs.cornell.edu or make arrangements to meet with one of the Teaching Assistants.

Submission Instructions

Each assignment will require you to submit a list of files.  

  • Assemble all  the files into one *.zip file.
  • Name the *.zip file xxx-1.zip, where xxx is your NetID. (For example, if your NetID is abc123, create a file called abc123-1.zip.) 
  • If you want to revise your assignment, submit the entire assignment again in a new file called abc123-2.zip, etc. We will grade the one with the highest sequence number. 

There will be special email addresses set up for each assignment.  They are cs430-1@cs.cornell.edu, cs430-2@cs.cornell.edu, etc.  

  • Email your  *.zip file to the email address for this assignment. 
  • The subject field of the email message should be, "NetID - Assignment x", where x is the number of the assignment.

9/13/02

Assignment 1: Building an Inverted Index
Due: Friday, September 20, at 5:00 p.m.

The object of this assignment is to write a suite of programs that take raw text and prepare the entries for an inverted file. 

Test data 

The text that you will use to test your programs is 16 source files from the Computer Science web site. The files are stored in a Zip archive, webpages.zip.

Programs 

1. Lexical analysis and stop list

This program should read the input data and output two files: 

  • (1a) Sorted word list with document frequencies 
  • (1b) Postings file, sorted in alphabetic order

Use the stop list in the file stoplist.txt.  In addition, reject all HTML tags (strings that are delimited by the symbols < and >) and HTML special characters (alphanumeric strings that are delimited by & and ;).  

2. Term frequency and inverse document frequency

Use the standard tf.idf weighting scheme, described in Lecture 5. The program should read the files written by the first program and output one additional file: 

  • (2a) Postings file with term frequency and inverse document frequency for each posting, sorted in alphabetic order. 

3. Test program

This program receives as input one term at a time.  If the term is in the inverted file, it should display a list of the postings for that term, including the term weightings.  If the term is not in the inverted file, it should display a suitable message.

The program should loop until the term ZZZ is input.

Report 

Write a brief report (not more than one page) that describes the algorithms and data structures that you used for each file and program. 

Submission 

You should submit the following: 

  • Programs 1, 2, 3 (source and binary)
  • Files 1a, 1b, 2a using the test data 
  • Report 

As part of the grading, the graders will run your programs against the test data.  

Hints

Remember that you are not building a production system and the volume of test data is quite small.  This suggests the following:

  • You can choose data structures, etc. that illustrate the concepts but are straightforward to implement (e.g., do not implement B trees). 
  • Consider batch loading of data (e.g., no need to provide for incremental update). 
  • User interface can be minimal (e.g., single letter commands). 
  • Use standard Java or C++ classes where available, even if their performance is slow..

To save typing, the files C Code and Java Code contain the arrays char_class and convert_class from Frake Section 7.2. 

Revised: 9/14/02

Assignment 2: A Boolean Fielded Search Engine
Due: Monday, October 21, at 5 p.m.

The aim of this assignment is to implement a Boolean information retrieval system for fielded searching.

This assignment builds on Assignment 1.  If you did not complete Assignment 1, please contact a Teaching Assistant (email to cs430@cs.cornell.edu) to help you prepare for this assignment.

Input data

The input data that is to be indexed is a file of tagged ASCII text of the following format:

  • The input data is delimited by a pair of tags: <metadata>, </metadata>.
  • Each record within the file is delimited by a pair of tags: <record>, </record>.
  • Each record consists of a sequence of fields, where each field is delimited by a pair of tags: <field name>, </field name> The field name can be any alphabetic string.

Each record and each field begins on a new line. Fields may not extend over more than one line. There may be blank lines.

See the test data file for examples of this format. Your search engine does not need to handle data in any other format, but must be prepared to recognize other field tags for different metadata types.

A file of test data test2.txt has been provided.  This data is the catalog records from one year of articles in D-Lib Magazine, lightly edited. (This test data is actually an XML file.  Your program can ignore the data before the first <metadata> tag.)

Query

A query is a string of printable ASCII text.  It may contain the following:

  • One or two search terms, consisting of a field name enclosed in <>, followed by a search term.  If the field name <any> is given, the query is applied to all fields. 
  • The Boolean, and and or operators.  
  • Parentheses need not be supported.
  • Phrases need not be supported.
  • The truncation symbol, *,  at the end of a search term to indicate 0 or more characters. 

Here are some examples of queries, with interpretation:

  • <creator> moll     
    (The creator field includes the term "moll".)
  • <title> digit*     
    (The title field includes a term that begins with the letters "digit".)
  • <creator> moll and <title> digit*    
    (The creator field includes the term "moll" and the title field includes a term that begins with the letters "digit".)
  • <any> digital or <any> library   
    (Any field contains the term "digital" or the term "library" .)

Programs

Write two programs in Java or C++:

1.  The first program should be called build.  It reads a test file in this format and builds an inverted index of the search terms. It should use the same stop list as for Assignment 1, stoplist.txt.  The program should create:
  • a file called word-list.txt which is a list of the terms in the index, after removal of words from the stop list.  This list should be in alphabetical order with one term per line.
  • the index file(s) to be used as input to the second program.
2.  The second program should be called search.  It allows a user to search the indexes. It should have the following features:
  • The program is run from a command line interface.  The user types in a query and receives a response, which is either an error message or a message , "x records match your query", where x is a count, 0 or greater.
  • The results found should be written in html format to a file results.html.  The results should contain the following:
    • For a single run of the program, the results of each search should be appended to the file in sequence.
    • For each query the results should begin with a line containing the query and should end with two or more blank lines and a horizontal line, using the <hr> tag.
    • For each hit, the program should print out a catalog record with the search terms highlighted in bold font.
  • The interface should allow a search on any of the field names that are tagged in the metadata records.  Searching on other field names is an error.

Report

1.  Provide a brief report that describes the algorithms that you use (not more than one page).

2.  The graders will run your program, but they will not go into the source code and modify it.  All files and any other options must be specified as input.  One option is to provide file names as command line parameters.  Write a short set of instructions called ReadMe that tells the grader how to run your program, including how to specify the stop list and test data files.

Test

Run your program with five queries.  Your report should list the queries and indicate what function each is testing.

Submission 

You should submit the following: 

  • Programs 1, 2
  • Files created by the first program
  • Test output and file(s) created by  the second program
  • Report 
  • ReadMe

As part of the grading, the graders will run your programs using the instructions in your ReadMe file. 

10/8/02


Assignment 3
Due:   Monday, November 11 at 5:00 p.m.

The purpose of this assignment is to write a simple web crawler.  For the final assignment you will combine the work of the first three assignments to create a basic web search engine.

Program

Write a program in Java or C++, called crawl, that crawls a web site.  The program should have the following features:

  • The program is run from a command line.  It has two input parameters: a URL and a stopping parameter.
  • The URL defines the web site that is to be crawled.  For example, if the URL is http://www.foo.edu/bar/index.html, the program should crawl www.foo.edu/bar/, beginning with the file index.htm.
  • The stopping parameter is either a positive integer or zero.  Zero indicates that the crawl should continue until all links have been examined.  A positive integer is a count of the number of pages to examine before stopping.
  • The program should ignore links to all files except those of type html (including htm) or txt.  Do not look for links from pages of type txt.
  • The program should print out a list of all pages discovered and, for pages of type html, of all links from that page.
  • The program should respect robots.txt files.

Report

1.  Provide a brief report that describes the algorithms that you use (not more than one page).

2.  The graders will run your program, but they will not go into the source code and modify it.  All files and any other options must be specified as input.  Write a short set of instructions called ReadMe that tells the grader how to run your program, including how to specify the input parameters.

Test

Test you program with two files.  One should be the Information Science web site, http://www.cis.cornell.edu/infoscience/index.htm.  The other should be a site that you choose to test the program.

Submission 

You should submit the following: 

  • Programs, crawl
  • Test outputs
  • Report 
  • ReadMe

As part of the grading, the graders will run your programs using the instructions in your ReadMe file. 

10/29/02

Assignment 4
Due:   Friday, December 6 at 5:00 p.m.

The purpose of this assignment is to combine the work done in the previous assignments to create a simple search and discovery system for a Web site.  If you did the other assignments well, this should be a comparatively simple assignment.  Alternatively, it is an opportunity to compensate for problems with the earlier assignments.

Programs

1.  Index a Web Site

Write a program in Java or C++, called index, that crawls a web site and indexes it.  The program should have the following features:

  • The program is run from a command line.  It has two input parameters: a URL and a stopping parameter.
  • The index or indexes should be designed to support the services provided by the Web search and discovery service.

2.  Web Search and Discovery Service

Write a program in Java or C++, called discover, that reads the indexes created by first program and provides the following services to users:

Search options.  Given one or more words as input:

  • Find every page with all of the words.
  • Find every page with at least one of the words.
  • Find every page with all the words in a <title> field.
  • Find every page with an exact phrase which is the same as the input.

Browse options.  Given the URL of a page in the Web site:

  • List the URL's of all pages that it links to.
  • List the URL's of pages in the Web site that link to it.

Interface.

  • The program should run from the command prompt.  
  • The program should loop until the user instructs it to stop, e.g., with special input message, ZZZ.  The stopping command should be specified in the ReadMe file.
  • (Extra credit) Provide a Graphical User Interface and display the output as formatted html.

Report

1.  Provide a brief report that describes the algorithms and data structures that you use (not more than one page).

2.  The graders will run your programs, but they will not go into the source code and modify it.  All files and any other options must be specified as input.  Write a short set of instructions called ReadMe that tells the grader how to run your program, including how to specify the input parameters.

Test

Test you program with two files.  One should be the Information Science web site, http://www.cis.cornell.edu/infoscience/index.htm.  The other should be a site that you choose to test the program.

Submission 

You should submit the following: 

  • Programs, index and discover
  • Test outputs
  • Report 
  • ReadMe

As part of the grading, the graders will run your programs using the instructions in your ReadMe file. 

11/18/02


[CS 430 Home Page]

William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 19, 2002