|
CS
430
Information Discovery
Fall 2002
Assignments |
Instructions on How to Submit Programs
Please see the additional information below under "Running your
Programs."
11/4/02
Notice: Assignment 2
Assignment 2 is posted below. Note that the due date has been changed.
This assignment builds on Assignment 1. If you did not complete
Assignment 1, please contact a Teaching Assistant (email to cs430@cs.cornell.edu)
to help you prepare for this assignment.
10/8/02
Assignment 1
There has been a delay in setting up the email address for this assignment (cs430-1@cs.cornell.edu).
If this address is not set up when you submit, please send your assignment to cs430@cs.cornell.edu.
Several people have asked for more details about the file formats for Files
1a, 1b and 2a. Designing these formats is part of the assignment.
There is no single correct answer. Your report should specify the format
and explain the choices that you made.
9/19/02
General Instructions
During this course there will be four programming assignments, which together
form a simple web search system.
The preferred programming language is Java. If, however, you have
limited Java experience C++ is permitted.
Running your Programs
The graders will run your programs to grade them. In addition,
the assignments give specific instructions that are intended to make your
programs easy for the graders to run and grade consistently. Points
will be subtracted if you do not follow these instructions.
- When the graders run your Java programs, they run them from the
command prompt using the "java ..." command. It is
fine to develop your program and compile and test it in a development
environment, such as Codewarrior or JCreator. But realize that
it will be run from the command prompt. So, before submitting your
program, test it from the command prompt and submit
instructions on how to run it from the command prompt.
- What programming files should you submit? If you are working in
Java, submit your source .java files and your compiled .class files. Do
not send .mcp files (i.e., Codewarrior project files) or any other
project type files. If you work in C or C++, send .c or .cpp and .exe
files (both with instructions on how to run from the command
prompt).
- If you use the -classpath option when running your code and there is
an aspect that is operating system specific, please tell us about
it. The graders test in both Unix and Windows. Please tell us
how to do both.
- The same person will not be grading your assignment every
week. This is because we want to make sure we are fair to everyone.
Therefore, repeat all instructions separately for each assignment.
Just be clear and explain things.
If you have any questions about this, PLEASE email cs430@cs.cornell.edu or
make arrangements to meet with one of the Teaching Assistants.
Submission Instructions
Each assignment will require you to submit a list of files.
- Assemble all the files into one *.zip file.
- Name the *.zip file xxx-1.zip, where xxx is your
NetID. (For example, if your NetID is abc123, create a file called abc123-1.zip.)
- If you want to revise your assignment, submit the entire assignment
again in a new file called abc123-2.zip, etc. We will grade the
one with the highest sequence number.
There will be special email addresses set up for each assignment.
They are cs430-1@cs.cornell.edu,
cs430-2@cs.cornell.edu,
etc.
- Email your *.zip file to the email address for this
assignment.
- The subject field of the email message should be, "NetID -
Assignment x", where x is the number of the assignment.
9/13/02
Assignment 1: Building an Inverted Index
Due: Friday, September 20, at 5:00 p.m.
The object of this assignment is to write a suite of programs that take
raw text and prepare the entries for an inverted file.
Test data
The text that you will use to test your programs is 16 source
files from the Computer Science web site. The files are stored in a Zip
archive, webpages.zip.
Programs
1. Lexical analysis and stop list
This program should read the input data and output two
files:
- (1a) Sorted word list with document frequencies
- (1b) Postings file, sorted in alphabetic order
Use the stop list in the file stoplist.txt. In addition, reject
all HTML tags (strings that are delimited by the symbols < and >)
and HTML special characters (alphanumeric strings that are delimited by
& and ;).
2. Term frequency and inverse document frequency
Use the standard tf.idf weighting scheme, described in Lecture 5. The program
should read the files written by the first program and output one additional
file:
- (2a) Postings file with term frequency and inverse document
frequency for each posting, sorted in alphabetic order.
3. Test program
This program receives as input one term at a time. If the term is
in the inverted file, it should display a list of the postings for that
term, including the term weightings. If the term is not in the
inverted file, it should display a suitable message.
The program should loop until the term ZZZ is input.
Report
Write a brief report (not more than one page) that describes the
algorithms and data structures that you used for each file and program.
Submission
You should submit the following:
- Programs 1, 2, 3 (source and binary)
- Files 1a, 1b, 2a using the test data
- Report
As part of the grading, the graders will run your programs against the
test data.
Hints
Remember that you are not building a production system and the volume
of test data is quite small. This suggests the following:
- You can choose data structures, etc. that illustrate the concepts
but are straightforward to implement (e.g., do not implement B
trees).
- Consider batch loading of data (e.g., no need to provide for
incremental update).
- User interface can be minimal (e.g., single letter commands).
- Use standard Java or C++ classes where available, even if their
performance is slow..
To save typing, the files C Code and Java
Code contain
the arrays char_class and convert_class from Frake Section 7.2.
Revised: 9/14/02
Assignment 2: A Boolean Fielded Search Engine
Due: Monday, October 21, at 5 p.m.
The aim of this assignment is to implement a Boolean information retrieval system for fielded searching.
This assignment builds on Assignment 1. If you did not complete
Assignment 1, please contact a Teaching Assistant (email to cs430@cs.cornell.edu)
to help you prepare for this assignment.
Input data
The input data that is to be indexed is a file of tagged ASCII text of the following format:
- The input data is delimited by a pair of tags: <metadata>, </metadata>.
- Each record within the file is delimited by a pair of tags: <record>, </record>.
- Each record consists of a sequence of fields, where each field is delimited by a pair of
tags: <field name>, </field name>. The field name can be any
alphabetic string.
Each record and each field begins on a new line. Fields may not extend over more than one line. There may be blank lines.
See the test data file for examples of this format. Your search engine does not need to handle data in any other format,
but must be prepared to recognize other field tags for different metadata types.
A file of test data test2.txt has been
provided. This data is the catalog records from one year of articles in D-Lib Magazine,
lightly edited. (This test data is actually an XML file. Your program can
ignore the data before the first <metadata> tag.)
Query
A query is a string of printable ASCII text. It may contain the
following:
- One or two search terms, consisting of a field name enclosed in
<>, followed by a search term. If the field name <any> is given, the
query is applied to all fields.
- The Boolean, and and or operators.
-
Parentheses need not be supported.
- Phrases need not be supported.
- The truncation symbol, *, at the end of a search term
to indicate 0 or more characters.
Here are some examples of queries, with interpretation:
- <creator> moll
(The creator field includes
the term "moll".)
- <title> digit*
(The title field includes a
term that begins with the letters "digit".)
- <creator> moll and <title> digit*
(The
creator field includes the term "moll" and the title field
includes a term that begins with the letters "digit".)
- <any> digital or <any> library
(Any field contains the term
"digital" or the term "library" .)
Programs
Write two programs in Java or C++:
1. The first program should be called build. It reads a
test file in this format and builds an inverted index of the search terms.
It should use the same stop list as for Assignment 1, stoplist.txt.
The program should create:
- a file called word-list.txt which is a list of the terms in
the index, after removal of words from the stop list. This list
should be in alphabetical order with one term per line.
- the index file(s) to be used as input to the second program.
2. The second program should be called search. It
allows a user to search the indexes. It should have the following
features:
- The program is run from a command line interface. The user
types in a query and receives a response, which is either an error
message or a message , "x records match your query", where x
is a count, 0 or greater.
- The results found should be written in html format to a file results.html.
The results should contain the following:
- For a single run of the program, the results of each search should
be appended to the file in sequence.
- For each query the results should begin with a line containing the
query and should end with two or more blank lines and a horizontal
line, using the <hr> tag.
- For each hit, the program should print out a catalog record with
the search terms highlighted in bold font.
- The interface should allow a search on any of the field names that
are tagged in the metadata records. Searching on other
field names is an error.
Report
1. Provide a brief report that describes the algorithms that you use (not
more than one page).
2. The graders will run your program, but they will not go
into the source code and modify it. All files and any other
options must be specified as input. One option is to provide file
names as command line parameters. Write a short set of instructions
called ReadMe that tells the grader how to run your program, including how
to specify the stop list and test data files.
Test
Run your program with five queries. Your report should list
the queries and indicate what function each is testing.
Submission
You should submit the following:
- Programs 1, 2
- Files created by the first program
- Test output and file(s) created by the second program
- Report
- ReadMe
As part of the grading, the graders will run your programs using the
instructions in your ReadMe file.
10/8/02
Assignment 3
Due: Monday, November 11 at 5:00 p.m.
The purpose of this assignment is to write a simple web crawler. For
the final assignment you will combine the work of the first three assignments to
create a basic web search engine.
Program
Write a program in Java or C++, called crawl, that crawls a web
site. The program should have the following features:
- The program is run from a command line. It has two input
parameters: a URL and a stopping parameter.
- The URL defines the web site that is to be crawled. For
example, if the URL is http://www.foo.edu/bar/index.html, the program
should crawl www.foo.edu/bar/,
beginning with the file index.htm.
- The stopping parameter is either a positive integer or zero.
Zero indicates that the crawl should continue until all links have
been examined. A positive integer is a count of the number of
pages to examine before stopping.
- The program should ignore links to all files except those of type
html (including htm) or txt. Do not look for links from pages of
type txt.
- The program should print out a list of all pages discovered and, for
pages of type html, of all links from that page.
- The program should respect robots.txt files.
Report
1. Provide a brief report that describes the algorithms that you use (not
more than one page).
2. The graders will run your program, but they will not go
into the source code and modify it. All files and any other
options must be specified as input. Write a short set of instructions
called ReadMe that tells the grader how to run your program, including how
to specify the input parameters.
Test
Test you program with two files. One should be the Information
Science web site, http://www.cis.cornell.edu/infoscience/index.htm.
The other should be a site that you choose to test the program.
Submission
You should submit the following:
- Programs, crawl
- Test outputs
- Report
- ReadMe
As part of the grading, the graders will run your programs using the
instructions in your ReadMe file.
10/29/02
Assignment 4
Due: Friday, December 6 at 5:00 p.m.
The purpose of this assignment is to combine the work done in the previous
assignments to create a simple search and discovery system for a Web site.
If you did the other assignments well, this should be a comparatively simple
assignment. Alternatively, it is an opportunity to compensate for problems
with the earlier assignments.
Programs
1. Index a Web Site
Write a program in Java or C++, called index, that crawls a web site
and indexes it. The program should have the following features:
- The program is run from a command line. It has two input
parameters: a URL and a stopping parameter.
- The index or indexes should be designed to support the services
provided by the Web search and discovery service.
2. Web Search and Discovery Service
Write a program in Java or C++, called discover, that reads the indexes
created by first program and provides the following services to users:
Search options. Given one or more words as input:
- Find every page with all of the words.
- Find every page with at least one of the words.
- Find every page with all the words in a <title> field.
- Find every page with an exact phrase which is the same as the input.
Browse options. Given the URL of a page in the Web site:
- List the URL's of all pages that it links to.
- List the URL's of pages in the Web site that link to it.
Interface.
- The program should run from the command prompt.
- The program should loop until the user instructs it to stop, e.g.,
with special input message, ZZZ. The stopping command should be
specified in the ReadMe file.
- (Extra credit) Provide a Graphical User Interface and display the
output as formatted html.
Report
1. Provide a brief report that describes the algorithms and data
structures that you use (not
more than one page).
2. The graders will run your programs, but they will not go
into the source code and modify it. All files and any other
options must be specified as input. Write a short set of instructions
called ReadMe that tells the grader how to run your program, including how
to specify the input parameters.
Test
Test you program with two files. One should be the Information
Science web site, http://www.cis.cornell.edu/infoscience/index.htm.
The other should be a site that you choose to test the program.
Submission
You should submit the following:
- Programs, index and discover
- Test outputs
- Report
- ReadMe
As part of the grading, the graders will run your programs using the
instructions in your ReadMe file.
11/18/02
[CS 430 Home Page]
William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 19, 2002
|