CS 430 Information Discovery

Midterm Examination

Wednesday March 7, 2001

7:30 to 9:00 p.m.

Instructions

1) Answer all questions.

2) Write your answers in an examination book. WRITE YOUR NETID ON THE FRONT OF EACH BOOK.

3) This is an open book examination.

Question 1

(a) Define the terms inverted file, inverted list, posting.

(b) When implementing an inverted file system, what are the criteria that you would use to judge whether the system is suitable for very large-scale information retrieval?

(c) You are designing an inverted file system to be used with Boolean queries on a very large collection of textual documents. New documents are being continually added to the collection.

(i) What file structure(s) would you use?

(ii) How well does your design satisfy the criteria listed in Part (b)?

Question 2

(a) Explain how vector space concepts can be used to calculate the similarity between two documents.

(b) You have the collection of documents that contain the following index terms:

D₁: alpha bravo charlie delta echo foxtrot golf

D₂: golf golf golf delta alpha

D₃: bravo charlie bravo echo foxtrot bravo

D₄: foxtrot alpha alpha golf golf delta

(i) Use an incidence matrix of terms to calculate a similarity matrix for these four documents, with no term weighting.

(ii) Use a frequency matrix of terms to calculate a similarity matrix for these documents, with weights proportional to the term frequency and inversely proportional to the document frequency.

Question 3

(a) Define the terms recall and precision.

(b) Q is a query. D is a collection of 1,000,000 documents. When the query Q is run, a set of 200 documents is returned.

(i) How in a practical experiment would you calculate the precision?

(ii) How in a practical experiment would you calculate the recall?

(c) Suppose that, by some means, it is known that 100 of the documents in D are relevant to Q. Of the 200 documents returned by the search, 50 are relevant.

(i) What is the precision?

(ii) What is the recall?

(d) Explain in general terms the method used by TREC to estimate the recall.

Question 4

Here is a Dublin Core metadata record:

Title Gore/Lieberman 2000

Title.alternative Welcome to the Gore-Lieberman 2000 official campaign Web site

Title.alternative Gore 200

Title.alternative Viva Gore Lieberman 2000

Identifier.LCCN 00530047

Identifier.URI http://www.algore2000.com/

Type.OCLCg Computer file

Type.AACR2g-gmd [computer file]

Contributor.nameCorporate Gore/Lieberman, Inc.

Coverage.spatial.MARC21-gac n-us---

Date.issued.MARC21-Date 2000-9999

Description.note Title from home page as viewed on Nov. 1, 2000.

Description.summary Presents information on U.S. Vice President Albert Arnold Gore, Jr. (b. 1948) and his presidential campaign, provided by Gore 2000, Inc.

Language.ISO639-2 eng

Language.ISO639-2 engspa

Language In English and Spanish

Publisher Gore/Lieberman,

Publisher.place Nashville, Tenn. :

Relation.requires Mode of access: World Wide Web

Subject.class.LCC E840.8.G65

Subject.class.DDC 324.973

Subject.namePersonal.LCSH Gore, Albert, • 1948-

Subject.topical.LCSH Vice-Presidents • United States • Biography.

Subject.topical.LCSH Presidential candidates • United States • Biography.

Subject.topical.LCSH Presidents • United States • Election • 2000.

Subject.topical.LCSH Political campaigns • United States.

(a) What is the Dublin Core principle of dumbing-down? Are there any fields in this record that do not satisfy the principle?

(b) The metadata in the fields Publisher and Publisher place end in punctuation marks. Can you suggest any reasons for doing so?

(c) This record has no Creator field. It has a Contributor.nameCorporate field with value "Gore/Lieberman, Inc." Do you consider that this is correct use of Dublin Core? What would you put in the Creator and Contributor fields? Why?