In statistical learning theory, good generalization
capability refers to small performance degradation when the
model is evaluated on unseen testing data that are drawn from the
same distribution as the training data, i.e. on matched trainingtesting
case. Recently, soft-margin estimation (SME) method was
proposed to improve acoustic model’s generalization capability
for clean speech recognition and achieved success. In this paper,
we study the generalization capability of acoustic model for
robust speech recognition, where the training and testing data follow
different distributions (i.e. mismatched training-testing case).
From our analysis of noise effect on the log likelihood values of
noisy speech features, although mismatch exists between testing
and training data, it is still possible to achieve better robustness
by improving the acoustic model’s generalization capability
using SME. This is confirmed by our experimental study on
Aurora-2 and Aurora-3 tasks, where SME improves recognition
performance significantly for both matched and low/medium
mismatched testing cases. However, the improvement in severely
mismatched cases is relatively small. To alleviate the violation of
SME assumption about the same distribution for training and
testing data, we apply mean and variance normalization (MVN)
to process speech features prior to model training. Experimental
study shows that when training-testing mismatch is reduced,
SME delivers better performance improvement. We expect SME
to improve the robustness of speech recognition further when it
is combined with other robustness methods. Although this study
is on noisy speech recognition tasks, the method and discovery in
this paper have no assumption on the type of distortion, and can
be extended to deal with different types of distortions in other
machine learning applications.