Abstract

Benchmarking practices in information retrieval rely on measuring the per-topic performances of systems and aggregating these across the topics in a given test set. For an evaluation experiment, the per-topic scores represent the values in the matrix of the participating systems and the set of topics. In the absence of explicit external reference points indicating the true performance of systems, such a matrix represents a relative view over a sample of the universe of possible system-topic pairs, where a cyclical dependency exists between the systems and the topics. In this paper we develop a unified model for system evaluation by systematically modeling the relationship between topics and systems and by generalizing the way overall system performance is obtained from the individual topic scores with the use of a generalized means function with adaptive weights. We experiment with multiple definitions of the means on TREC evaluation runs and compare our rankings with the standard TREC averages. Our analysis of the different evaluations leads to recommendations for calibrating evaluation experiments.