Not All Frames Are Created Equal: Temporal Sparsity for Robust and Efficient ASR

September 10, 2010
Aren Jansen | Johns Hopkins University

Traditional frame-based speech recognition technologies build sequence models of temporally dense vector time series representations that account for the entirety of the speech signal. However, under non-stationary distortion, the burden of accounting for everything can propagate errors beyond the corrupted frames. I will advocate an alternative strategy where the speech signal is instead (i) transformed into a sparse set of temporal point patterns of the most salient acoustic events and (ii) decoded using explicit models of the temporal statistics of these patterns. Formalized under a point process model framework, the proposed sparse methods exhibit sufficiency for clean speech recognition, provide a new avenue to improve noise robustness, and hold potential for significantly increased computational efficiency over their frame-based counterparts.

Speaker Details

Aren Jansen is a Research Scientist at the Human Language Technology Center of Excellence and an Assistant Research Professor in the Center for Language and Speech Processing, both at Johns Hopkins University.
Aren received a B.A. in Physics from Cornell University in 2001. He received the M.S. degree in Physics as well as the M.S. and Ph.D. in Computer Science from the University of Chicago in 2003, 2005, and 2008, respectively. His interests lie in developing acoustic representations and models for speech technologies from the directions of machine learning and natural science.