This paper studies the evaluation of policies that recommend an ordered set of items (e.g., a ranking) based on some context—a common scenario in web search, ads and recommender systems. We develop the first practical technique for evaluating page-level metrics of such policies offline using logged past data, alleviating the need for online A/B tests. Our method models the observed quality of the recommended set (e.g., time to success in web search) as an additive decomposition across items. Crucially, the per-item quality is not directly observed or easily modeled from the item’s features. A thorough empirical evaluation reveals that this model fits many realistic measures of quality and theoretical analysis shows exponential savings in the amount of required data compared with prior off-policy evaluation approaches.