Abstract

This paper addresses the problem of developing appropriate features for use in direct modeling approaches to speech recognition, such as those based on Maximum Entropy models or Segmental Conditional Random Fields. We propose a feature based on the detection of word-level templates which are discriminatively chosen based on a mutual information criterion. The templates for a word are derived directly from the MFCC feature vectors, based on self-similarity across examples. No pronunciation dictionary is used, and the resulting templates match closely to in-class examples and distantly to out-of-class examples. We utilize template detection events as input to a segmental CRF speech recognizer. We evaluate the entire scheme on a voice search task. The results show that the use of discriminative template based word detector streams improves the speech recognizer’s performance over the baseline HMM results.