We investigate the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective. Recently, we had shown that for speaker-independent transcription of phone calls (NIST RT03S Fisher data), CD-DNN-HMMs reduced the word error rate by as much as one third—from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs with HLDA features, to 18.5%—using 300+ hours of training data (Switchboard), 9000+ tied triphone states, and up to 9 hidden network layers.

In this paper, we evaluate the effectiveness of feature transforms developed for GMM-HMMs—HLDA, VTLN, and fMLLR—applied to CD-DNN-HMMs. Results show that HLDA is subsumed (expected), as is much of the gain from VTLN (not expected): Deep networks learn vocal-tract length invariant struc- tures to a significant degree. Unsupervised speaker adaptation with discriminatively estimated fMLLR-like transforms works (as hoped for) nearly as well as fMLLR for GMM-HMMs.

We also improve model training by a discriminative pretraining procedure, yielding a small accuracy gain due to a better internal feature representation.