CS430/INFO430: Information Retrieval

CS 430 / INFO 430
Information Retrieval
Fall 2006

Test Data

Stoplist

Use the following stop list for all assignments: stoplist.txt. It is stored in the test directory.

Test Data for Assignment 1

The test collection that you will use to test your programs has 20 files which are stored in the test directory. The files are news articles taken from the NASA web site. The average length is 600 terms.

The files are:

file01.txt
file02.txt
file03.txt
file04.txt
file05.txt
file06.txt
file07.txt
file08.txt
file09.txt
file10.txt
file11.txt
file12.txt
file13.txt
file14.txt
file15.txt
file16.txt
file17.txt
file18.txt
file19.txt
file20.txt

Calculating tf.idf manually

To understand tf.idf and be able to test the output of programs such as Assignment 1, it is useful to calculate a few sample values manually. The files test/AllFiles1.xls and test/DocumentFreq1.xls are Excel spread sheets of the terms in the 20 test documents.

test/AllFiles1.xls

The file test/AllFiles1.xls has a column for the terms in each of the 20 files. Terms from the stop list and terms that do not begin with a letter have been removed. Otherwise no editing has been done. The terms are sorted in lexicographical order and each term is repeated as many times as it occurs in the file.

For each file, a second column gives a running total of the number of occurrences of search term. For example, if the term active appears twice in the file, the first is labeled 1 and the second is labeled 2.

For each file, the following statistics are calculated:

Terms
The total number of terms in the file after removing terms from the stop list and terms that do not begin with a letter.
Max freq
The number of instances in the file of the most frequently occurring term.
Distinct terms
The number of distinct terms in the file.

test/DocumentFreq1.xls

The file test/DocumentFreq1.xls has all terms from the 20 files merged into a single list. In this table, each row represents the occurrence of one term in one document. The columns are as follows:

File
The number of the file, from 1 to 20, in which the term occurs.
Term
The term.
Doc frequency.
A running total of the number of different documents in which the term occurs. For example, if the term ability appears in four documents, the first is labeled 1, the second is labeled 2, etc.

The second row of the spread sheet calculates the following statistic:

Unique terms
The total number of unique terms in the test data after removing terms from the stop list and terms that do not begin with a letter.

[Home]

William Y. Arms
(wya@cs.cornell.edu)
Last changed: September 4, 2006