Time-Frequency Features for Speech Recognition

Jasha Droppo

Time-Frequency Features for Speech Recognition

Jasha Droppo

May 2000

Download BibTex

Conventional speaker independent, continuous speech recognition systems are built upon assumptions that are, in general, not met. This dissertation focuses on one deficiency in particular, that the non-stationary speech signal is modeled as a single series of stationary spectral estimates. Time-frequency representations (TFRs) have the potential to be powerful features for nonstationary signals. Whereas short-term spectral estimates must make implicit time and frequency resolution tradeoffs, a single TFR simultaneously contains both short-term and long-term spectral estimates. Unfortunately, the proper way to harness this power is still a matter of debate. This dissertation proposes a class dependent time-frequency feature for speech recognition. The feature is automatically derived from time-frequency representations of speech signals by maximizing the discriminability within classes. A two-stage speech recognition system incorporating these representaions achieves 1:6% error rate. This is 39% lower than the best published result for the chosen task.