Abstract

In the study reported in this paper, we investigate interactions of front-end feature extraction and back-end classification techniques in hidden Markov model-based (HMMbased) speech recognition. The proposed model focuses on dimensionality reduction of the mel-warped discrete fourier transform (DFT) feature space subject to maximal preservation of speech classification information, and aims at finding an optimal linear transformation on the mel-warped DFT according to the minimum classification error (MCE) criterion. This linear transformation, along with the HMM parameters, are automatically trained using the gradient descent method to minimize a measure of overall empirical error counts. A further generalization of the model allows integration of the discriminatively derived state-dependent transformation with the construction of dynamic feature parameters. Experimental results show that state-dependent transformation on mel-warped DFT features is superior in performance to the mel-frequency cepstral coefficients (MFCC’s). An error rate reduction of 15% is obtained on a standard 39-class TIMIT phone classification task, in comparison with the conventional MCE-trained HMM using MFCC’s that have not been subject to optimization during training.