Abstract

This paper presents a new technique for high-accuracy
tracking of vocal-tract resonances (which coincide with
formants for nonnasalized vowels) in natural speech. The technique
is based on a discretized nonlinear prediction function,
which is embedded in a temporal constraint on the quantized
input values over adjacent time frames as the prior knowledge for
their temporal behavior. The nonlinear prediction is constructed,
based on its analytical form derived in detail in this paper, as
a parameter-free, discrete mapping function that approximates
the “forward” relationship from the resonance frequencies and
bandwidths to the Linear Predictive Coding (LPC) cepstra of
real speech. Discretization of the function permits the “inversion”
of the function via a search operation. We further introduce the
nonlinear-prediction residual, characterized by a multivariate
Gaussian vector with trainable mean vectors and covariance
matrices, to account for the errors due to the functional approximation.
We develop and describe an expectation–maximization
(EM)-based algorithm for training the parameters of the residual,
and a dynamic programming-based algorithm for resonance
tracking. Details of the algorithm implementation for computation
speedup are provided. Experimental results are presented which
demonstrate the effectiveness of our new paradigm for tracking
vocal-tract resonances. In particular, we show the effectiveness of
training the prediction-residual parameters in obtaining high-accuracy
resonance estimates, especially during consonantal closure.