Joint Encoding of the Waveform and Speech Recognition Features Using a Transform Codec

Xing Fan; Mike Seltzer; Jasha Droppo; Henrique S. Malvar; Alex Acero

Joint Encoding of the Waveform and Speech Recognition Features Using a Transform Codec

Xing Fan ,
Mike Seltzer ,
Jasha Droppo ,
Henrique S. Malvar ,
Alex Acero

International Conference on Acoustics, Speech and Signal Processing | May 2011

Published by Institute of Electrical and Electronics Engineers, Inc.

Download BibTex

We propose a new transform speech codec that jointly encodes a wideband waveform and its corresponding wideband and narrowband speech recognition features. For distributed speech recognition, wideband features are compressed and transmitted as side information. The waveform is then encoded in a manner that exploits the information already captured by the speech features. Narrowband speech acoustic features can be synthesized at the server by applying a transformation to the decoded wideband features. An evaluation conducted on an in-car speech recognition task show that at 16 kbps our new system typically shows essentially no impact in word error rate compared to uncompressed audio, whereas the standard transform codec produces up to a 20% increase in word error rate. In addition, good quality speech is obtained for playback and transcription, with PESQ scores ranging from 3.2 to 3.4.

© 2011 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.