Abstract

For hidden Markov model (HMM) based speech recognition where the basic speech unit is smaller than the recognizer’s output unit, the standard full Baum-Welch re-estimation procedure for the HMM training is very costly in computation. This is hecause it requires evaluation of the HMM output densities and of the forward/backward probabilities in the entire region of the state-frame trellis. In this paper, we present an algorithm which exploits the fact that the entries of the trellis are essentially zero except near the block diagonal and hence achieves significant computational saving. The algorithm is evaluated in experiments with a large vocabulary word recognizer based on mixture-density HMM representation of phonemes. The

HMM parameters trained with the new algorithm are essentially identical to those trained with the full Baum-Welch algorithm in that the resulting HMMs have nearly the same likelihood values on the same set of training data. Identical word recognition accuracies are yielded using the HMMs trained with the two algorithms. However, the new algorithm is shown to he about an order of magnitude faster than the full Baum-Welch algorithm. 1. Introduction

One major class of HMM-based speech recognition problems involves the use of basic speech units which are smaller than the utterance to be recognized, such as connected word recognition (Rabiner, Wilpon & Juang, 1986), isolated word recognition using subword units (Deng et al., 1990; Deng, Lennig & Mermelstein, 1990), and continuous speech recognition using either word or subword units (Bahl, Jelinek & Mercer, 1983;

Lee et al., 1989). Despite apparently distinct task domains, a common characteristic exists within this class of problems. That is, the speech utterances which any particular recognition task concerns are modeled by concatenation of a sequence of HHMs each representing a corresponding constituent speech unit according to the rules which establish the composition of the speech utterance from these basic units. This paper addresses the problem of how to train the HMMs in this class of speech recognition efficiently.

The training procedures developed and utilized until recently for the HMMs representing a sequence of connected units can be classified into two major categories. The 0885-2308/91/030231+06 $03.00/O 0 1991 Academic Press Limited 232 L. Deng first which we call the segmented training contains two separate steps: estimate the boundaries of each constituent speech unit using a set of initial models, and apply the

Baum-Welsh algorithm (Baum, 1972) to estimate the HMM parameters for each unit given the boundary information. In practice, these two steps are often iterated to achieve better performance (Rabiner, Wilpon & Juang, 1986). There are two undesirable attributes associated with this procedure. First, the requirement of this two-step iteration incurs high computation cost relative to the otherwise efficient segmented training. Second, for speech recognition applications, both training and testing should be performed on the same optimization criterion. Thus, for the speech recognition systems where the total likelihood is used as the decoding criterion (Deng et al., in press b,

Deng et al., 1990), the segmented training is not a natural choice. This is particularly true when the requirement for the training size is stringent.

The second procedure which we call the (fully) relaxed training allows the sequence of speech units to map into all possible segmentations (Jelinek, 1976). This is implemented as follows. First, a series of HMMs are concatenated which correspond to the transcription of the training token. The composite HMM constructed in this way is then trained using the standard full Baun-Welch algorithm. It is well known that this relaxed training requires intensive computation, especially in the case of continuous mixture output densities with many mixture components as in the system described in Deng et al. (in press a).

In this paper the semi-relaxed training algorithm is described which overcomes the limitations of both the segmented training and the relaxed training. The experimental results with the semi-relaxed algorithm is reported, which demonstrated its computation efficiency and the effectiveness of the models trained using the algorithm. 2. The semi-relaxed training algorithm

The basis for the semi-relaxed algorithm is the fact that in the relaxed training, most of the entries in the state-frame trellis which correspond to the conditional probabilities of state transitions given the entire observation sequence are near zero. These entries are typically located off diagonal of the trellis. The computation of the output densities and of the forward/backward probabilities for these entries actually carried out by the relaxed training is unnecessary and can be avoided. The semi-relaxed algorithm we propose here impose a block-diagonal structure on the state-frame trellis in which the blocks overlap in the frame dimension. The number of blocks is the same as the number of the constituent speech units in the training token, as in the segmented training. The size of each block is determined by crude estimates of the segment boundaries of the constituent speech units plus the overlap sizes (left and right). The overlap of blocks is intended to compensate for possible inaccuracies in the segment boundary estimates. It should be emphasized that the algorithm does not require that estimates of the segment boundaries be obtained by computationally expensive Viterbi-like algorithms (Jelinek, 1976). Uniform segmentation of the speech units can work well so long as the overlap sizes are chosen to be greater than half of the longest speech unit’s length.