Abstract

In fluently spoken English utterances, the dynamic pattern in the spectral aspect of the acoustic signal strikes most speech re- searchers. Even within the vowel segments that have traditionally been characterized solely by the static acoustic properties, one observes the continually varying spectral patterns ubiquitously [ 131, [16]. The (dynamic) trajectories of speech have, albeit to a gross degree of approximation, been partially captured by the conventional hidden Markov model (HMM) via multiple left-to-right states, each representing a piecewise constant acoustic segment of speech. In order to move from the piecewise constant approximation to real patterns of speech often characterized by smooth, systematic, and continuous motions in the spectral domain, various classes of the nonstationary-state (or regressive-state) HMM’s have recently been developed [5], [7], [9]. For the parametric form of such models 151, [7], the conventional assumption for state-conditioned IID has been relaxed to include state-conditioned polynomial Gaussian means,’ regressive on the Markov-state sojoum time. All the above earlier development of the nonstationary-state HMM was focused on the algorithmic aspect of the model, and little attention was paid to the choice of the speech units suitable for the representation by the new model. The study of [7] was based on the conventional phonemic unit and that of [9] on the diphone unit, which have both been known to encounter overwhelming difficulties when the size of the speech- recognition vocabulary outgrows the amount of data used to train the recognizer’s parameters.

Parallel to the development of the regressive-state HMM, we have, over the past several years, also concentrated on the development of a feature-based statistical framework and of the related primitive units of speech aiming at a parsimonious and parametric description of context-dependent behaviors in fluent speech [6]. The focus of that development is a process of feature-defined lexicon compilation via elaborate construction of atomic speech units at the feature (subphonemic) level. Motivated by the theory of distinctive features [4] and by the principles from articulatory phonology 131, the speech recognizer with the phonological component designed from the feature-based atomic speech units demonstrated clear effectiveness in the phonetic classification tasks (TIMIT) involving all classes of English sounds. This effectiveness was achieved despite the fact that the acoustic-mapping component of the recognizer was simply the conventional, stationary-state HMM.

One unique and overwhelming advantage of the, feature-based speech recognition framework described in [6] is the endowment of each HMM state with explicit interpretations in terms of the underlying articulatory-feature constellation and of the manipulation of the associated articulatory structure responsible for generating the corresponding acoustic observation. A natural step to advance the earlier feature-based framework is to first identify the intrinsically “dynamic” HMM states according to the articulatory interpretation of the states, and, based on such identification, tao enhance the acoustic-mapping component of the recognizer from the conventional stationary-state HMM to the more general nonstationary-state counterpart. The study reported here is aimed at this goal.