Constructing Stylistic Synthesis Databases from Audio Books

Yong Zhao; Di Peng; Lijuan Wang; Min Chu; Yining Chen; Roger (Peng) Yu; Jun Guo

Constructing Stylistic Synthesis Databases from Audio Books

Yong Zhao ,
Di Peng ,
Lijuan Wang ,
Min Chu ,
Yining Chen ,
Roger (Peng) Yu ,
Jun Guo

INTERSPEECH 2006 | September 2006

Published by International Speech Communication Association

Download BibTex

In this paper, we explore how to construct stylistic TTS databases from audio books, in which a storyteller performs multiple roles. The goal is to identify and build a set of speech corpora, each of which not only portrays a representative voice style performed by the speaker, but also has sufficient sentences to synthesize natural speech using unit selection approach. We solve the problem in two procedures: first, by representing each role with Gaussian Mixture Models (GMM), all speech data are partitioned into a number of voice style clusters with a criterion that maximizes the likelihood of all utterances with respect to roles’ speaker models; then, pruning in terms of both acoustic and prosodic measures is followed to purify the clusters. The resulting 4 voice styles are subjectively interpreted as Neutral, Young, Elder and Adult, respectively. Perceptual experiments show that the proposed approach can synthesize speech with the recognizable voice styles with an average 72.5% identification rate, and the synthesized speech sounds better than those synthesized with utterances from a single role.