The Context-Dependent Deep-Neural-Network HMM, or CDDNN-HMM, is a recently proposed acoustic-modeling technique for HMM-based speech recognition that can greatly outperform conventional Gaussian-mixture based HMMs. For example, a CD-DNN-HMM trained on the 2000h Fisher corpus achieves 14.4% word error rate on the Hub5’00-FSH speakerindependent phone-call transcription task, compared to 19.6% obtained by a state-of-the-art, conventional discriminatively trained GMM-based HMM.

That CD-DNN-HMM, however, took 59 days to train on a modern GPGPU—the immense computational cost of the minibatch based back-propagation (BP) training is a major roadblock. Unlike the familiar Baum-Welch training for conventional HMMs, BP cannot be ef?ciently parallelized across data.

In this paper we show that the pipelined approximation to BP, which parallelizes computation with respect to layers, is an ef?cient way of utilizing multiple GPGPU cards in a single server. Using 2 and 4 GPGPUs, we achieve a 1.9 and 3.3 times end-to-end speed-up, at parallelization ef?ciency of 0.95 and 0.82, respectively, at no loss of recognition accuracy