Abstract

The recent success of deep neural networks (DNNs) in speech recognition can be attributed largely to their ability to extract a specific form of high-level features from raw acoustic data for subsequent sequence classification or recognition tasks. Among the many possible forms of DNN features, what forms are more useful than others and how effective these DNN features are in connection with the different types of downstream sequence recognizers remained unexplored and are the focus of this paper. We report our recent work on the construction of a diverse set of DNN features, including the vectors extracted from the output layer and from various hidden layers in the DNN. We then apply these features as the inputs to four types of classifiers to carry out the identical sequence classification task of phone recognition. The experimental results show that the features derived from the top hidden layer of the DNN perform the best for all four classifiers, especially for the autoregressive-moving-average (ARMA) version of a recurrent neural network. The feature vector derived from the DNN’s output layer performs slightly worse but better than any of the hidden layers in the DNN except the top one.