Discriminative training of acoustic models has become one of the most important training methods for state-of-the-art speech recognition systems. This topic attracts more and more attentions of researchers, to develop new training criteria, parameter optimization methods, and application techniques. In this context, this thesis focuses on discriminative training of acoustic models and its application in automatic speech recognition. It provides a systematic and in-depth research in this topic, and introduces our innovations in criterion, optimization method, and application of discriminative training.
Firstly, this thesis proposes a novel discriminative training criterion, i.e. Minimum Word Classiﬁcation Error (MWCE). By localizing conventional string-level MCE loss function to word-level, a more direct measure of word classiﬁcation error is approximated and minimized. Because the word-level criterion better matches performance evaluation criteria in LVCSR, such as WER, an improved word recognition performance can be achieved. Comparing with other sub-string level criteria (e.g. MWE / MPE), MWCE provides another perspective of word-level classiﬁcation error, and achieves the best recognition performance in our experiments. This result suggests that it is still meaningful to develop new discriminative training criteria which have explicit physical meaning and more reasonable.
Secondly, the thesis proposes a new parameter optimization method for discriminative training, i.e. trust region based optimization for MMIE criterion. By imposing a trust region constraint into the optimization process, we avoid some disadvantages of the unbounded optimization of conventional EB method. The new optimization method is more reasonable in mathematics, and also physically meaningful. Meanwhile, because we can reach a global optimum in each iteration, the proposed method is more eﬃcient in optimizing criterion. Our experimental results suggest that the trust region based approach outperforms conventional EB method both in optimizing criterion and recognition performance.
Thirdly, this thesis introduces our research to improve the Soft Margin Estimation (SME) method. By imposing some important technologies of discriminative training in recent years, we successfully implement the SME method in LVCSR for the ﬁrst time. Meanwhile, we propose to use a reasonable frame-level separation measure to select certain frame samples that contain important discriminative information. We compare conventional MCE, string-level SME, and the proposed frame-level SME in our experiments. The results show that by using the concept of soft margin, both SME method scan achieve a better performance than MCE. And by imposing a factor which removes noisy frames, the frame-level SME achieves the best recognition performance which signiﬁcantly outperforms MCE.
Lastly, this thesis proposes an application method of discriminative training, i.e. MMIE based HMM topology optimization. We deﬁne a heuristic metric according to MMIE criterion, and use it to guild the topology optimization process. The approach tries to “exchange” Gaussian kernels among HMM states so as to allocate model parameters non-uniformly. Besides, a post-process is also carried out to reﬁne model topology in time axis. By doing this, we provide a more direct connection between topology optimization and discrimination. As a result, the discriminative model topology optimization outperforms other conventional, likelihood based optimization methods in our experiments.