On Bayesian models and event spaces in information retrieval

  • Stephen Robertson

Presented at SIGIR 2002 Workshop on Mathematical/Formal Methods in Information Retrieval; revised version in Information Retrieval

There have been several attempts recently to reconcile, or at least to understand the relationship between, traditional probabilistic models of information retrieval and the newer language models. Since both treat the retrieval problem probabilistically, it might be expected that they can be formulated in comparable terms. However, this has proved difficult. One question concerns the role of relevance, which takes a central position in some traditional models (such as Robertson and Sparck Jones [1976], referred to as RSJ), but does not appear explicitly in at least the early language models (e.g. Ponte and Croft [1998]). The present author and others [Sparck Jones et al. 2002] have recently claimed that the early language models assume that there is only one relevant document per query. This claim is based on the observation that language models ask the question of each document: What is the probability that this document, or rather the model which generated this document, also generated the query? Since each document is taken to have its own language model, if it turns out that a particular document is relevant (that is, its model did indeed generate the query), it would seem that no other model could have done. Lafferty and Zhai [2002], on the other hand, in a recent paper, develop a basic probabilistic model from which they derive both the RSJ model and the simple language model. They claim in conclusion that (a) RSJ and the simple language model are equivalent; and (b) that the language model requires no such assumption as that there is only one relevant document per query. The present paper discusses an issue underlying all probabilistic models, that of the event space assumed, and draws in part from a pair of old papers [Robertson et al. 1982; Robertson et al. 1983]. I discuss possible views of the event space in case of documents, queries and relevance judgements, 1 and come to some different conclusions about the relationship between RSJ and the simple language models. However, in order to illustrate the event space issues, the paper first introduces a rather different example from the IR one, with different structural characteristics.