HMM-Based Smoothing for Concatenative Speech Synthesis

Proc. of the Int. Conf. on Spoken Language Processing |

This paper will focus on our recent efforts to further improve
the acoustic quality of the Whistler Text-to-Speech engine. We
have developed an advanced smoothing system that a small
pilot study indicates significantly improves quality. We
represent speech as being composed of a number of frames,
where each frame can be synthesized from a parameter vector.
Each frame is represented by a state in an HMM, where the
output distribution of each state is a Gaussian random vector
consisting of x and Dx. The set of vectors that maximizes the
HMM probability is the representation of the smoothed speech
output. This technique follows our traditional goal of
developing methods whose parameters are automatically
learned from data with minimal human intervention. The
general framework is demonstrated to be robust by maintaining
improved quality with a significant reduction in data.