We propose and evaluate a new acoustic model that combines HMM and a special type of the hidden dynamic model (HDM) – a target-directed hidden trajectory model – into a single integrated model named HTHMM. The new model provides a computational model of coarticulation by representing the internal dynamics of human speech based on the hidden trajectory of the vocal-tract resonances. This paper focuses on the general structure of the new model and the EM training procedure. The corresponding MAP decoding algorithm and more detailed evaluation are given in [1].
Speech recognition experimental results on the Aurora2 task demonstrated that the new model, although using only context-independent phoneme units (no context-dependent parameters), is still slightly superior in word error rate to the corresponding crossword triphone HMM. This provides the evidence that the coarticulatory mechanism represented by the HTHMM via the model structure matches the traditional context-dependent modeling approach based on enumeration of model parameters