Abstract

In deep learning research, often the focus has been on the input 4 feature representation while the output representation tends to 5 receive much less attention. In this paper, three largely separate case studies are provided to argue for the importance of learning output representations. In these studies, three ways of designing and/or learning output representations for the deep-learning approach to speech recognition are discussed and analyzed. First, the very large number of output units in the current context-dependent (CD) deep neural net (DNN) based speech recognizers can be effectively reduced, without lowering recognition accuracy while improving decoding efficiency, by performing dimensionality reduction using low-rank approximation to large DNN output matrices. Second, the currently popular CD-DNN that uses “beads-on-a-string” or linear-sequence representations for linguistic speech units in the DNN output layer can be generalized to structured multi-linear or graph representations. Temporally overlapping linguistic “features” or symbols are used as a basis for such phonological design. Third, when a special type of deep networks, the deep convex network (DCN), is used as a representational model for speech acoustic patterns, the output units in each of the DCN modules are designed to be linear, enabling drastic simplification in learning the parameters of the entire network.