Abstract

Although user simulations are increasingly employed in the development and assessment of spoken dialog systems, there is no accepted method for evaluating user simulations. In this paper, we propose a novel quality measure for user simulations. We view a user simulation as a predictor of the performance of a dialog system, where per-dialog performance is measured with a domain-specific scoring function. The quality of the user simulation is measured as the divergence between the distribution of scores in real dialogs and simulated dialogs, and we argue that the Crame´r-von Mises divergence is well-suited to this task. The technique is demonstrated on a corpus of real calls, and we present a table of critical values for practitioners to interpret the statistical significance of comparisons between user simulations.