Abstract

This paper describes a unifying framework for both formant
tracking and speech synthesis using Hidden Markov Models
(HMM). The feature vector in the HMM is composed by the
first three formant frequencies, their bandwidths and their delta
with time. Speech is synthesized by generating the most likely
sequence of feature vectors from a HMM, trained with a set of
sentences from a given speaker. Higher formant tracking
accuracy can be achieved by finding the most likely formant
track given a distribution of the formants of every sound. This
data-driven formant synthesizer bridges the gaps between rulebased
formant synthesizers and concatenative synthesizers by
synthesizing speech that is both smooth and resembles the
speaker in the training data.

‚Äč