Course 574									Oren Kurland

“How Reliable are the Results of Larg-Scale Information Retrieval Experiments?“
[Justin Zobel]
		
The goal of the paper is to test how reliable are the performance
measurement techniques used in the TREC (and generalize to large-scale
IR systems) and how reliable the pooling method is as a mean to
determine which documents are the relevant ones.

Estimating the validity of the precision measurements

1. The results of precision measurements in the TREC are proved to be
valid and therefore the TREC can be truly regarded as a reliable tool
to differentiate between the precision performance of different IR
systems.  Moreover, the techniques to measure precision could be
safely (without fear of unfairness or invalidity) deployed by those
who measure the performance of large-scale IR systems.

2. The shown result that as long as the difference between the
systems’ (retrieval techniques) performance is statistically
significant the type of the measurement used is relatively
unimportant, is important for the question of which measurement to use
when testing an IR system. This result also indicates the high
importance of using significance testing (which is easy to implement)
when testing the improvements one makes to a specific IR
system. Nevertheless, the shown result is only partially proved since
the correlation between the different measures wasn’t tested by the
author.

3. Although the author shows that Wilcoxon’s test has the best
discriminating results for 25 samples (in comparison to t-test and
ANOVA), he/she doesn’t mention that Wilcoxon’s n(s/r) parameter
is probably greater than 10 and therefore Wilcoxon’s test in this
case approximates normal distribution, the same as the t-test does
(with sample size of ~30). So the conclusion of choosing Wilcoxon’s
test is not obvious and the author should have compared the
statistical tests with respect to a varying number of queries (10,20
for un normal distribution and more than 30 for the normal one).

The Pooling method

1. An important result is that the use of a measurement depth larger
than the pool depth is unjustified because it introduces a great deal
of uncertainty to the results and the TREC organizers should take it
into account.

2. The pool’s depth of 100 which is used in the TRECK is adequate
for measurement of precision but not for recall since only 50%-70% of
the relevant documents in the corpus are found using the pooling
technique.

3. The suggested (incremental) algorithm for pooling is important in
the sense that it will enable those who evaluate the performance of
large-scale IR systems to both economize the manual efforts of
determining the relevant documents (because the algorithm may converge
to relatively small depths) and to obtain more reliable results when
determining the set of relevant documents in the entire collection.