Time-Frequency Features for Speech Recognition
- Jasha Droppo
Conventional speaker independent, continuous speech recognition systems are built upon assumptions that are, in general, not met. This dissertation focuses on one deficiency in particular, that the non-stationary speech signal is modeled as a single series of stationary spectral estimates. Time-frequency representations (TFRs) have the potential to be powerful features for nonstationary signals. Whereas short-term spectral estimates must make implicit time and frequency resolution tradeoffs, a single TFR simultaneously contains both short-term and long-term spectral estimates. Unfortunately, the proper way to harness this power is still a matter of debate. This dissertation proposes a class dependent time-frequency feature for speech recognition. The feature is automatically derived from time-frequency representations of speech signals by maximizing the discriminability within classes. A two-stage speech recognition system incorporating these representaions achieves 1:6% error rate. This is 39% lower than the best published result for the chosen task.