CS 430: Information Discovery: Assignments

CS 430
Information Discovery
Spring 2001

Assignments

Submission Instructions (Revised February 23, 2001)

Reports should be carefully written and well formatted, using a good word processor. Use spelling and grammar checkers to remove errors.

All assignments are to be submitted online. When you have completed your work, submit is as follows: There is a shared network folder on nomad1.nomad.cornell.edu called CS430. Students can submit to this folder, but not read it.

To submit your homework, Map Network Drive to: \\nomad1.nomad.cornell.edu\cs430 and use your nomad account to login. If you do not have a nomad account, please go to Upson 311.

You will find a folder called with the number of the assignment. Create a folder called xxx-1, where xxx is your NetID, and copy your folder into the assignment folder. (For example, if your NetID is abc123, create a folder called abc123-1.) If you want to revise your assignment, submit the entire assignment again in a new folder called xxx-2, etc. We will grade the one with the highest sequence number.

Assignment 4, A Fielded Search Engine (Extra Information added April 23)
Due Friday, April 27, at 5:00 p.m.

The aim of this assignment is to implement a simple information retrieval system for fielded searching.

The objective of the system is to search simple metadata records. A file of test data is provided. This data is the catalog records from one year of articles in D-Lib Magazine, which have been formatted as a simple tagged file. Your search engine does not need to handle data in any other format or any other metadata tags.

The user interface should allow a search on any of the fields that are tagged in the metadata records. Thus it should be possible to search for records in which the <creator> field contain the word "Moll". If the user does not specify a field, the system should search all fields. Boolean operators and wild cards need not be supported explicitly. If the query has several terms, after the removal of stop words, treat them as having an implied "and".

Write two programs. The first reads the test file and indexes the test data. The second allows a user to search the indexes and retrieve the URL of the records that are found.
Provide a brief description of the algorithms that you use. The algorithms need not be complex. For example, in building the index, you might use a binary tree for the inverted file, a very simple stop list and no stemming. Retrieval could be either Boolean or a simple ranked list.
Run your program with about five queries.

Submission

Submit the following six items as separate files that can be graded individually:

Source code listing of the first program that reads the test file and indexes the data.
Executable version of the first program.
Source code listing of the second program that allows a user to search the indexes and retrieve the URL of the records found.
Executable version of the second program.
A short (less than one page) description of the algorithms used.
A file containing the output when your second program is run with the test queries.

To submit your homework, Map Network Drive to: \\nomad1.nomad.cornell.edu\cs430 and use your nomad account to login. You will find a folder named with the number of the assignment. Create a folder called xxx-1, where xxx is your NetID, and copy your folder containing your answers into the assignment folder. (For example, if your NetID is abc123, create a folder called abc123-1.) If you want to revise your assignment, submit the entire assignment again in a new folder called xxx-2, etc. We will grade the one with the highest sequence number.

Assignment 3, Design Study on Automatic Indexing
Due Friday, April 6, at 5:00 p.m.

Suppose that you have taken a new job with the company that operates the online news service http://www.cnn.com/. The company is setting up an archive of all the articles that have been published in the news service, including the associated services that cover business, sports, etc. You are set the task of designing an automatic text indexing system for this archive. Write a report with your design and recommendations.

The report should contain:

An estimate of the size of the document collection to be indexed.
A description of any classes of material that will not be indexed by your system (e.g., images).
Your assumptions about the user community and recommendations for the user interface, including wild cards and Boolean operators (if any).
Recommendations for the methods that will be used to divide the documents into tokens, including the algorithms used for stop words, stemming and term weighting (if any).
The file structures that will be used.
Procedures for creating and maintaining the indexes.
The procedures that will be used to process queries.
The ranking algorithms that will be used and the order in which the results will be returned to the users.
Hardware requirements.
Any other important design considerations.

It is important that your report should be clearly written and well presented.

Submit your assignment according to the instructions at the beginning of this web page.

Assignment 2, Metadata Crosswalk
Due Friday, March 2, at 5:00 p.m.

Onix is a metadata standard used by the book trade. It is defined on http://www.editeur.org/onixfiles1.2/onixfiles.html. Your assignment is based on Level 1, which is a kernel subset. (See the "Guidelines for Publishers, Level 1", linked from the Onix page.)

A metadata crosswalk is a mapping that takes metadata elements from one metadata scheme and replaces them with metadata elements in another scheme. Your task is to create a crosswalk between Onix, Level 1 and the Dublin Core metadata standard, using appropriate qualifiers. The Dublin Core metadata must be fully compliant with the latest Dublin Core specifications, which are at http://www.dublincore.org/. Use only qualifiers that are recommended on this site. (Note that this is a new web site.)

Identify the elements of Onix, Level 1 for which there are equivalent elements in Dublin Core. State you reasons for rejecting the other elements.
For those elements that you have identified, specify a crosswalk from Onix, Level 1 to qualified Dublin Core. For each Onix element, specify precisely the rules that determine what Dublin Core element it maps into and the corresponding syntax.
The files ass2-test.txt contains five Onix, Level 1 records (with XML tags). Write a program that takes an Onix file in this format, converts the metadata elements in Group b to their Dublin Core equivalents, and prints them out.

For part #3, your program should be written in Java or C++.

Submit your assignment according to the instructions at the beginning of this web page.

[Questions have been asked about the choice of programming languages. After carefully consideration, the original wording of the question remains. The program should be in Java or C++. I am sorry not to be more flexible. WYA 2/27/01]

Assignment 1, Market Research
Due Friday, February 9, at 5:00 p.m.

The objective of this assignment is to explore a number of information retrieval systems and to compare them. Explore the following indexes and catalogs:

Google – a web search engine (http://www.google.com/).
Ask Jeeves – an alternative web search engine (http://www.ask.com/).
The Library of Congress catalog – a very large bibliographic catalog (http://catalog.loc.gov/).
Inspec using OCLC's First Search system – an indexing and abstracting service (access through the Cornell University Library gateway).
American Memory at the Library of Congress – a digital library of materials converted from physical artifacts (http://memory.loc.gov/).

Analyze each search service in three ways.

Technical: Describe the services offered from a technical viewpoint. Does the service search full text or surrogates? Are fielded searched offered? What Boolean operators are supported? What regular expressions? How does it handle non-Roman character sets? What is the stop list? How are results ranked? Are they sorted, if so in what order?
User interface: What style of user interface(s) is provided? What training or help services? If there are basic and advanced user interfaces, what does each offer?
Experience: Experiment with a number of searches on each system. Try some long queries and some short ones. Can you find a search that is very slow? Or one that fails? Try searches with simple and complex syntax. How effective is each service? Do you find it easy to use? What do you consider its strengths and its weaknesses?

Hints. Some of these questions can be answered by reading information provided at the search site. Others may require detective work. Search the computer science literature to see if you can find out about the underlying search engines. There may be a technical paper that the creators have written or you may find a review article that compares search systems. Some information can be discovered by experiment. Some information is trade secret and you will not be able to answer all of the questions. Remember to cite your sources.

[CS 430 Home Page]

William Y. Arms

(wya@cs.cornell.edu)
Last changed: April 17, 2001

Submission Instructions (Revised February 23, 2001)

Assignment 4, A Fielded Search Engine (Extra Information added April 23) Due Friday, April 27, at 5:00 p.m.

Assignment 3, Design Study on Automatic Indexing Due Friday, April 6, at 5:00 p.m.

Assignment 2, Metadata Crosswalk Due Friday, March 2, at 5:00 p.m.

Assignment 1, Market Research Due Friday, February 9, at 5:00 p.m.

Assignment 4, A Fielded Search Engine (Extra Information added April 23)
Due Friday, April 27, at 5:00 p.m.

Assignment 3, Design Study on Automatic Indexing
Due Friday, April 6, at 5:00 p.m.

Assignment 2, Metadata Crosswalk
Due Friday, March 2, at 5:00 p.m.

Assignment 1, Market Research
Due Friday, February 9, at 5:00 p.m.