The earlier version of the hidden trajectory model (HTM) for speech dynamics which predicts the “static” cepstra as the observed acoustic feature is generalized to one which predicts joint static cepstra and their temporal differentials (i.e., delta cepstra). The formulation of this generalized HTM is presented in the generative-modeling framework, enabling efficient computation of the joint likelihood for both static and delta cepstral sequences as the acoustic features given the model. The parameter estimation techniques for the new model are developed and presented, giving closed-form estimation formulas after the use of vector Taylor series approximation. We show principled generalization from the earlier static-cepstra HTM to the new static/delta-cepstra HTM not only in terms of model formulations but also in terms of their respective analytical forms in (monophone) parameter estimation. Experimental results on the standard TIMIT phonetic recognition task demonstrate recognition accuracy improvement over the earlier best HTM system, both significantly better than state-of-the-art triphone HMM systems.

Index Terms- phonetic recognition, hidden trajectory modeling, delta cepstra, joint static/dynamic feature, generative modeling