Whereas traditional dialog systems operate on the top ASR hypothesis, statistical dialog systems claim to be more robust to ASR errors by maintaining a distribution over multiple hidden dialog states. Recently, these techniques have been deployed publicly for the first time, making empirical measurements possible. In this paper, we analyze two of these deployments. We find that performance was quite mixed: in some cases statistical techniques improved accuracy with respect to the top speech recognition hypothesis; in other cases, accuracy was degraded. Investigating degradations, we find the three main causes are (non-obviously) inaccurate parameter estimates, poor confidence scores, and correlations in speech recognition errors. Overall the results suggest fundamental weaknesses in the formulation as a generative model, and we suggest alternatives as future work.