A Comparison of Neural Network Feature Transforms for Speaker Diarization

Sree Harsha Yella; Andreas Stolcke

A Comparison of Neural Network Feature Transforms for Speaker Diarization

Sree Harsha Yella ,
Andreas Stolcke

Proc. Interspeech | September 2015

Published by ISCA - International Speech Communication Association

Download BibTex

Speaker diarization finds contiguous speaker segments in an audio stream and clusters them by speaker identity, without using a-priori knowledge about the number of speakers or enrollment data. Diarization typically clusters speech segments based on short-term spectral features. In prior work, we showed that neural networks can serve as discriminative feature transformers for diarization by training them to perform same/different speaker comparisons on speech segments, yielding improved diarization accuracy when combined with standard MFCC-based models. In this work, we explore a wider range of neural network architectures for feature transformation, by adding additional layers and nonlinearities, and by varying the objective function during training. We find that the original speaker comparison netwo rk can be improved by adding a nonlinear transform layer, and that further gains are possible by training the network to perform speaker classification rather than comparison. Overal l we achieve relative reductions in speaker error between 18% and 34% on a variety of test data from the AMI, ICSI, and NIST-RT corpora.