Robustness to Telephone Handset Distortion in Speaker Recognition by Discriminative Feature Design

  • Larry Heck ,
  • Yochai Konig ,
  • M. Kemal Sonmez ,
  • Mitch Weintraub

Speech Communication | , Vol 31: pp. 181-192

A deep neural network (deep learning) method is described for designing speaker recognition features that are robust to telephone handset distortion. The approach transforms features such as mel-cepstral features, log spectrum, and prosody-based features with a non-linear artificial neural network. The neural network is discriminatively trained to maximize speaker recognition performance specifically in the setting of telephone handset mismatch between training and testing. The algorithm requires neither stereo recordings of speech during training nor manual labeling of handset types either in training or testing. Results on the 1998 National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation corpus show relative improvements as high as 28% for the new multilayered perceptron (MLP)-based features as compared to a standard mel-cepstral feature set with cepstral mean subtraction (CMS) and handset-dependent normalizing impostor models.