Information retrieval effectiveness is usually evaluated using measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Precision at some cutoff (Precision@k) on a set of judged queries. Recent research has suggested an alternative, evaluating information retrieval systems based on user behavior. Particularly promising are experiments that interleave two rankings and track user clicks. According to a recent study, interleaving experiments can identify large differences in retrieval effectiveness with much better reliability than other click-based methods.
We study interleaving in more detail, comparing it with traditional measures in terms of reliability, sensitivity and agreement. To detect very small differences in retrieval effectiveness, a reliable outcome with standard metrics requires about 5,000 judged queries, and this is about as reliable as interleaving with 50,000 user impressions. Amongst the traditional measures, NDCG has the strongest correlation with interleaving. Finally, we present some new forms of analysis, including an approach to enhance interleaving sensitivity.