Abstract

Recently, we proposed an ensemble speaker and speaking
environment modeling (ESSEM) framework to characterize
speaker variability and speaking environments. In contrast to
multi-style training, ESSEM uses single-style training to prepare
multiple sets of environment-specific acoustic models. The
ensemble of these acoustic models forms a prior structure of the
environment for flexible prediction of unknown environment
during testing. In this study, we present methods to further improve
the precision for model characterization. We first study a weighted
N-best information technique to well utilize the N-best
transcription hypothesis in an unsupervised adaptation manner.
Next, we introduce cohort selection and environment space
adaptation techniques to online improve the resolution and
coverage of the prior structure. With an integration of the proposed
methods, we further improve the ESSEM performance over our
previous study. On the Aurora-2 task, ESSEM achieves an average
word error rate (WER) of 4.64%, corresponding to a 15.64%
relative WER reduction over our best baseline result (5.50% to
4.64% WER) obtained with multi-condition training.

‚Äč