Large-margin discriminative training of hidden Markov models has received significant attention recently. A natural and interesting question is whether the existing discriminative training algorithms can be extended directly to embed the concept of margin. In this paper, we give this question an affirmative answer by showing that the sigmoid bias in the conventional minimum classification error (MCE) training can be interpreted as a soft margin. We justify this claim from a theoretical classification risk minimization perspective where the loss function associated with a non-zero sigmoid bias is shown to include not only empirical error rates but also a margin-bound risk. Based on this perspective, we propose a practical optimization strategy that adjusts the margin (sigmoid bias) incrementally in the MCE training process so that a desirable balance between the empirical error rates on the training set and the margin can be achieved. We call this modified MCE training process large-margin minimum classification error (LM-MCE) training to differentiate it from the conventional MCE. Speech recognition experiments have been carried out on two tasks. First, in the TIDIGITS recognition task, LM-MCE outperforms the state-of-the-art MCE method with 17% relative digit-error reduction and 19% relative string-error reduction. Second, on the Microsoft internal large vocabulary telephony speech recognition task (with 2000 h of training data and 120 K words in the vocabulary), significant recognition accuracy improvement is achieved, demonstrating that our formulation of LM-MCE can be successfully scaled up and applied to large-scale speech recognition tasks.
Keywords: Minimum classification error training; Discriminative training; Large-margin training; Speech recognition; Large-scale speech recognition; Theoretical classification risk minimization