Abstract

Whistler Text-to-Speech engine was designed so that we can
automatically construct the model parameters from training data
[7]. This paper will focus on recent improvements on prosody
and acoustic modeling, which are all derived through the use of
probabilistic learning methods. Whistler can produce synthetic
speech that sounds very natural and resembles the acoustic and
prosodic characteristics of the original speaker. The underlying
technologies used in Whistler can significantly facilitate the
process of creating generic TTS systems for a new language, a
new voice, or a new speech style. Whisper TTS engine supports
Microsoft Speech API [10] and requires less than 3 MB of
working memory.