A statistical generative model for the speech process is described that
embeds a substantially richer structure than the HMM currently in predominant use for
automatic speech recognition. This switching dynamic-system model generalizes and
integrates the HMM and the piece-wise stationary nonlinear dynamic system (state-
space) model. Depending on the level and the nature of the switching in the model
design, various key properties of the speech dynamics can be naturally represented in
the model. Such properties include the temporal structure of the speech acoustics, its
causal articulatory movements, and the control of such movements by the multidimen-
sional targets correlated with the phonological (symbolic) units of speech in terms of
overlapping articulatory features.
One main challenge of using this multi-level switching dynamic-system model for
successful speech recognition is the computationally intractable inference (decoding) on
the posterior probabilities of the hidden states. This leads to computationally intractable
optimal parameter learning (training). Several versions of Bayesian networks have been
devised with detailed dependency implementation specified to represent the switching
dynamic-system model of speech. We discuss the variational technique developed for
general Bayesian networks as a suboptimal solution to the decoding and learning prob-
lems. Some common operations of estimating phonological states’ switching times have
been shared between the variational technique and the human auditory function that
uses neural transient responses to detect temporal landmarks associated with phono-
logical features. This suggests that the variation-style learning may actually take place
in human speech perception under an encoding-decoding theory of speech communi-
cation which highlights the critical roles of modeling articulatory dynamics for speech
recognition and which forms a main motivation for the switching dynamic system model
described in this chapter.