![]() |
CS
430 Assignments |
Reports should be carefully written and well formatted, using a good word processor. Use spelling and grammar checkers to remove errors.
All assignments are to be submitted online. When you have completed your work, submit is as follows: There is a shared network folder on nomad1.nomad.cornell.edu called CS430. Students can submit to this folder, but not read it.
To submit your homework, Map Network Drive to: \\nomad1.nomad.cornell.edu\cs430 and use your nomad account to login. If you do not have a nomad account, please go to Upson 311.
You will find a folder called with the number of the assignment. Create a folder called xxx-1, where xxx is your NetID, and copy your folder into the assignment folder. (For example, if your NetID is abc123, create a folder called abc123-1.) If you want to revise your assignment, submit the entire assignment again in a new folder called xxx-2, etc. We will grade the one with the highest sequence number.
The aim of this assignment is to implement a simple information retrieval system for fielded searching.
The objective of the system is to search simple metadata records. A file of test data is provided. This data is the catalog records from one year of articles in D-Lib Magazine, which have been formatted as a simple tagged file. Your search engine does not need to handle data in any other format or any other metadata tags.
The user interface should allow a search on any of the fields that are tagged in the metadata records. Thus it should be possible to search for records in which the <creator> field contain the word "Moll". If the user does not specify a field, the system should search all fields. Boolean operators and wild cards need not be supported explicitly. If the query has several terms, after the removal of stop words, treat them as having an implied "and".
Submission
Submit the following six items as separate files that can be graded individually:
To submit your homework, Map Network Drive to: \\nomad1.nomad.cornell.edu\cs430 and use your nomad account to login. You will find a folder named with the number of the assignment. Create a folder called xxx-1, where xxx is your NetID, and copy your folder containing your answers into the assignment folder. (For example, if your NetID is abc123, create a folder called abc123-1.) If you want to revise your assignment, submit the entire assignment again in a new folder called xxx-2, etc. We will grade the one with the highest sequence number.
Suppose that you have taken a new job with the company that operates the online news service http://www.cnn.com/. The company is setting up an archive of all the articles that have been published in the news service, including the associated services that cover business, sports, etc. You are set the task of designing an automatic text indexing system for this archive. Write a report with your design and recommendations.
The report should contain:
It is important that your report should be clearly written and well presented.
Submit your assignment according to the instructions at the beginning of this web page.
Onix is a metadata standard used by the book trade. It is defined on http://www.editeur.org/onixfiles1.2/onixfiles.html. Your assignment is based on Level 1, which is a kernel subset. (See the "Guidelines for Publishers, Level 1", linked from the Onix page.)
A metadata crosswalk is a mapping that takes metadata elements from one metadata scheme and replaces them with metadata elements in another scheme. Your task is to create a crosswalk between Onix, Level 1 and the Dublin Core metadata standard, using appropriate qualifiers. The Dublin Core metadata must be fully compliant with the latest Dublin Core specifications, which are at http://www.dublincore.org/. Use only qualifiers that are recommended on this site. (Note that this is a new web site.)
For part #3, your program should be written in Java or C++.
Submit your assignment according to the instructions at the beginning of this web page.
[Questions have been asked about the choice of programming languages. After carefully consideration, the original wording of the question remains. The program should be in Java or C++. I am sorry not to be more flexible. WYA 2/27/01]
The objective of this assignment is to explore a number of
information retrieval systems and to compare them. Explore the following indexes and catalogs:
Google – a web search engine (http://www.google.com/).
Ask Jeeves – an alternative web search engine (http://www.ask.com/).
The Library of Congress catalog – a very large bibliographic catalog (http://catalog.loc.gov/).
Inspec using OCLC's First Search system – an indexing and abstracting service (access through the Cornell University Library gateway).
American Memory at the Library of Congress – a digital library of materials converted from physical artifacts (http://memory.loc.gov/).
Analyze each search service in three ways.
Hints. Some of these questions can be answered by reading information provided at the search site. Others may require detective work. Search the computer science literature to see if you can find out about the underlying search engines. There may be a technical paper that the creators have written or you may find a review article that compares search systems. Some information can be discovered by experiment. Some information is trade secret and you will not be able to answer all of the questions. Remember to cite your sources.
William Y. Arms
(wya@cs.cornell.edu)
Last changed: April 17, 2001