# Midterm Examination

Wednesday March 7, 2001

7:30 to 9:00 p.m.

## Instructions

2)   Write your answers in an examination book. WRITE YOUR NETID ON THE FRONT OF EACH BOOK.

3)  This is an open book  examination.

## Question 1

(a)  Define the terms inverted file, inverted list, posting.

(b)  When implementing an inverted file system, what are the criteria that you would use to judge whether the system is suitable for very large-scale information retrieval?

(c)  You are designing an inverted file system to be used with Boolean queries on a very large collection of textual documents.  New documents are being continually added to the collection.

(i)  What file structure(s) would you use?

(ii)  How well does your design satisfy the criteria listed in Part (b)?

## Question 2

(a)  Explain how vector space concepts can be used to calculate the similarity between two documents.

(b)  You have the collection of documents that contain the following index terms:

D1:  alpha bravo charlie delta echo foxtrot golf

D2:  golf golf golf delta alpha

D3:  bravo charlie bravo echo foxtrot bravo

D4:  foxtrot alpha alpha golf golf delta

(i)  Use an incidence matrix of terms to calculate a similarity matrix for these four documents, with no term weighting.

(ii)  Use a frequency matrix of terms to calculate a similarity matrix for these documents, with weights proportional to the term frequency and inversely proportional to the document frequency.

## Question 3

(a)  Define the terms recall and precision.

(b)  Q is a query.  D is a collection of 1,000,000 documents.  When the query Q is run, a set of 200 documents is returned.

(i)   How in a practical experiment would you calculate the precision?

(ii)  How in a practical experiment would you calculate the recall?

(c)  Suppose that, by some means, it is known that 100 of the documents in D are relevant to Q.  Of the 200 documents returned by the search, 50 are relevant.

(i)   What is the precision?

(ii)  What is the recall?

(d)  Explain in general terms the method used by TREC to estimate the recall.

## Question 4

Here is a Dublin Core metadata record:

Title                                              Gore/Lieberman 2000

Title.alternative      Welcome to the Gore-Lieberman 2000 official campaign Web site

Title.alternative      Gore 200

Title.alternative      Viva Gore Lieberman 2000

Identifier.LCCN      00530047

Identifier.URI      http://www.algore2000.com/

Type.OCLCg                               Computer file

Type.AACR2g-gmd      [computer file]

Contributor.nameCorporate      Gore/Lieberman, Inc.

Coverage.spatial.MARC21-gac      n-us---

Date.issued.MARC21-Date      2000-9999

Description.summary      Presents information on U.S. Vice President Albert Arnold Gore, Jr. (b. 1948) and his presidential campaign, provided by Gore 2000, Inc.

Language.ISO639-2      eng

Language.ISO639-2      engspa

Language                                     In English and Spanish

Publisher                                      Gore/Lieberman,

Publisher.place      Nashville, Tenn. :

Relation.requires      Mode of access: World Wide Web

Subject.class.LCC      E840.8.G65

Subject.class.DDC      324.973

Subject.namePersonal.LCSH      Gore, Albert, • 1948-

Subject.topical.LCSH      Vice-Presidents • United States • Biography.

Subject.topical.LCSH      Presidential candidates • United States • Biography.

Subject.topical.LCSH      Presidents • United States • Election • 2000.

Subject.topical.LCSH      Political campaigns • United States.

(a)  What is the Dublin Core principle of dumbing-down?  Are there any fields in this record that do not satisfy the principle?

(b)  The metadata in the fields Publisher and Publisher place end in punctuation marks.  Can you suggest any reasons for doing so?

(c)  This record has no Creator field.  It has a Contributor.nameCorporate field with value "Gore/Lieberman, Inc."  Do you consider that this is correct use of Dublin Core? What would you put in the Creator and Contributor fields?  Why?