CS 430 / INFO 430
Information Retrieval
Fall 2004

Test Data


Stoplist

Use the following stop list for all assignments: stoplist.txt.

Test data for Assignments 1 and 2

The test collection that you will use to test your programs has 20 files which are stored in the testData directory. The files are news articles taken from the NASA web site. The average length is 600 terms.

The files are:

file01.txt
file02.txt
file03.txt
file04.txt
file05.txt
file06.txt
file07.txt
file08.txt
file09.txt
file10.txt
file11.txt
file12.txt
file13.txt
file14.txt
file15.txt
file16.txt
file17.txt
file18.txt
file19.txt
file20.txt

Test data for Assignment 3

The test collection that you will use to test your programs has a set of documents (records) separated by pairs of <record> and </record>tags, which are stored in a single file. The documents are short catalog records from D-Lib Magazine. They are stored in the file:

test3.txt

Test data for Assignment 4

The test data for this assignment is a file containing a list of URLs, one per line. Each URL identifies an html page on the Information Science web site. The pages have been numbered for convenience. The file is:

test4.txt

The file URLhints.html provides some hints about extracting hyperlinks from data such as this.

Calculating tf.idf manually

To understand tf.idf and be able to test the output of programs, such as Assignment 1, it is useful to calculate a few sample values manually. The files AllFiles1.xls and DocumentFreq1.xls are Excel spread sheets of the terms in the 20 test documents.

AllFiles1.xls

The file AllFiles1.xls has a column for the terms in each of the 20 files. Terms from the stop list and terms that do not begin with a letter have been removed. Otherwise no editing has been done. The terms are sorted in lexicographical order and each term is repeated as many times as it occurs in the file.

For each file, a second column gives a running total of the number of occurrences of search term. For example, if the term active appears twice, the first is labeled 1 and the second is labeled 2.

For each file, the following statistics are calculated:

Terms
The total number of terms in the file after removing terms from the stop list and terms that do not begin with a letter.
Max freq
The number of instances in the file of the most frequently occurring term.
Distinct terms
The number of distinct terms in the file.

DocumentFreq1.xls

The file DocumentFreq1.xls has all terms from the 20 files merged into a single list. In this table, each row represents the occurrence of one term in one document. The columns are as follows:

1. File
The number of the file, from 1 to 20, in which the term occurs.
2. Term
The term.
3. Doc frequency.
A running total of the number of different documents in which the term occurs. For example, if the term ability appears four times, the first is labeled 1, the second is labeled 2, etc.

The second row of the spread sheet calculates the following statistic:

Unique terms
The total number of unique terms in the file after removing terms from the stop list and terms that do not begin with a letter.

[Home]


William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 12, 2004