Abstract

An overview of a statistical paradigm for speech recognition is given where phonetic and phonological knowledge sources, drawn from the current understanding of the global characteristics of human speech communication, are seamlessly integrated into the structure of a stochastic model of speech. A consistent statistical formalism is presented in which the submodels for the discrete, feature-based phonological process and the continuous, dynamic phonetic process in human speech production are computationally interfaced. This interface enables global optimization of a parsimonious set of model parameters that accurately characterize the symbolic, dynamic, and static components in speech production and explicitly separates distinct sources of the speech variability observable at the acoustic level. The formalism is founded on a rigorous mathematical basis, encompassing computational phonology, Bayesian analysis and statistical estimation theory, nonstationary time series and dynamic system theory, and nonlinear function approximation (neural network) theory. Two principal ways of implementing the speech model and recognizer are presented, one based on the trended hidden Markov model (HMM) or explicitly de®ned trajectory model, and the other on the statespace or recursively defined trajectory model. Both implementations build into their respective recognition and modeltraining algorithms a continuity constraint on the internal, production-affiliated trajectories across feature-defined phonological units. The continuity and the parameterized structure in the dynamic speech model permit a joint characterization of the contextual and speaking-style variations manifested in speech acoustics, thereby holding promises to overcome some key limitations of the current speech recognition technology Ó 1998 Elsevier Science B.V. All rights reserved.