Course 574 Oren Kurland “How Reliable are the Results of Larg-Scale Information Retrieval Experiments?“ [Justin Zobel] The goal of the paper is to test how reliable are the performance measurement techniques used in the TREC (and generalize to large-scale IR systems) and how reliable the pooling method is as a mean to determine which documents are the relevant ones. Estimating the validity of the precision measurements 1. The results of precision measurements in the TREC are proved to be valid and therefore the TREC can be truly regarded as a reliable tool to differentiate between the precision performance of different IR systems. Moreover, the techniques to measure precision could be safely (without fear of unfairness or invalidity) deployed by those who measure the performance of large-scale IR systems. 2. The shown result that as long as the difference between the systems’ (retrieval techniques) performance is statistically significant the type of the measurement used is relatively unimportant, is important for the question of which measurement to use when testing an IR system. This result also indicates the high importance of using significance testing (which is easy to implement) when testing the improvements one makes to a specific IR system. Nevertheless, the shown result is only partially proved since the correlation between the different measures wasn’t tested by the author. 3. Although the author shows that Wilcoxon’s test has the best discriminating results for 25 samples (in comparison to t-test and ANOVA), he/she doesn’t mention that Wilcoxon’s n(s/r) parameter is probably greater than 10 and therefore Wilcoxon’s test in this case approximates normal distribution, the same as the t-test does (with sample size of ~30). So the conclusion of choosing Wilcoxon’s test is not obvious and the author should have compared the statistical tests with respect to a varying number of queries (10,20 for un normal distribution and more than 30 for the normal one). The Pooling method 1. An important result is that the use of a measurement depth larger than the pool depth is unjustified because it introduces a great deal of uncertainty to the results and the TREC organizers should take it into account. 2. The pool’s depth of 100 which is used in the TRECK is adequate for measurement of precision but not for recall since only 50%-70% of the relevant documents in the corpus are found using the pooling technique. 3. The suggested (incremental) algorithm for pooling is important in the sense that it will enable those who evaluate the performance of large-scale IR systems to both economize the manual efforts of determining the relevant documents (because the algorithm may converge to relatively small depths) and to obtain more reliable results when determining the set of relevant documents in the entire collection.