This paper examines two statistical spoken dialog systems deployed to the public, extending an earlier study on one system. Results across the two systems show that statistical techniques improved performance in some cases, but degraded performance in others. Investigating degradations, we find the three main causes are (non-obviously) inaccurate parameter estimates, poor confidence scores, and correlations in speech recognition errors. We also find evidence for fundamental weaknesses in the formulation of the model as a generative process, and briefly show the potential of a discriminatively-trained alternative.