Voice Conversion with Neural Network

Established: March 24, 2014

Sequence Error (SE) Minimization Training of Neural Network for Voice Conversion

Neural network (NN) based voice conversion, which employs a nonlinear function to map the features from a source to a target speaker, has been shown to outperform GMM-based voice version approach. However, there are still limitations to be overcome in NN-based voice conversion: NN is trained on a frame error (FE) minimization criterion and the corresponding weights are adjusted to minimize the error squares over the whole source-target, stereo training data set. In this paper, we use the idea of sentence optimization based, minimum generation error (MGE) training in HMM-based TTS synthesis, and modify the frame error (FE) minimization to Sequence Error (SE) minimization in NN training for voice conversion. The conversion error over a training sentence from a source speaker to a target speaker is minimized via a gradient descent-based back propagation (BP) procedure. Experimental results show that the speech converted by the NN, which is first trained with frame error minimization and then refined with sequence error minimization, sounds subjectively better than the converted speech by NN trained with frame error minimization only. Scores on both naturalness and similarity to the target speaker are improved.

Some samples (click to play)

Source             Target              FE                                                       SE

BDL SLT BDL to SLT BDL to SLT
RMS SLT BDL to SLT BDL to SLT
SLT BDL SLT to BDL SLT to BDL
SLT BDL SLT to BDL SLT to BDL

 

SLT CLB SLT to CLB SLT to CLB
SLT CLB SLT to CLB RMS to BDL
CLB SLT CLB to SLT BDL to RMS
CLB SLT CLB to SLT BDL to RMS

 

RMS BDL RMS to BDL RMS to BDL
RMS BDL RMS to BDL RMS to BDL
BDL RMS BDL to RMS BDL to RMS
BDL RMS BDL to RMS BDL to RMS

 

People

  • Portrait of Frank Soong

    Frank Soong

    Principal Researcher and Manager, Speech Group