Abstract

In this paper, we report our development of context-dependent allophonic hidden Markov models (HMMs) implemented in a 75 000-word speaker-dependent Gaussian-HMM recognizer. The context explored is the immediate left and/or right adjacent phoneme. To achieve reliable estimation of the model parameters, phonemes are grouped into classes based on their expected co-articulatory effects on neighboring phonemes. Only five separate preceding and following contexts are identified explicitly for each phoneme. By grouping the contexts we ensure that they occur frequently enough in the training data to allow reliable estimation of the parameters of the HMM representing the context-dependent units. Further improvement in the estimation reliability is obtained by tying the covariance matrices in the HMM output distributions across all contexts. Speech recognition experiments show that when a large amount of data (e.g. over 2500 words) is used to train context-dependent HMMs, the word recognition error rate is reduced by 33%, compared with the context-independent HMMs. For smaller amounts of training data the error reduction becomes less significant.

‚Äč