Distant Speech Recognition: No Black Boxes Allowed

  • John McDonough | Spoken Language Systems, Saarland University, Saarbruecken, Germany

A complete system for distant speech recognition (DSR) typically consists of several distinct components. Among these are:

  • An array of microphone for far-field sound capture;
  • An algorithm for tracking the positions of the active speaker or speakers;
  • A beamforming algorithm for focusing on the desired speaker and suppressing noise, reverberation, and competing speech from other speakers;
  • A recognition engine to extract the most likely hypothesis from the output of the beamformer;
  • A speaker adaptation component for adapting to the characteristics of a given speaker as well as to channel effects;
  • Postfiltering to further enhance the beamformed output.

Moreover, several of these components are comprised of one or more subcomponents. While it is tempting to isolate and optimize each component individually, experience has proven that such an approach cannot lead to optimal performance. In this talk, we will discuss several examples of the interactions between the individual components of a DSR system. In addition, we will describe the synergies that become possible as soon as each component is no longer treated as a “black box”. To wit, instead of treating each component as having solely an input and an output, it is necessary to peal back the lid look inside. It is only then that it becomes apparent how the individual components of a DSR system can be viewed not as separate entities, but as the various organs of a complete body, and how optimal performance of such a system can be obtained.

Joint work with:
Kenichi Kumatani, Barbara Rauch, Friedrich Faubel, Matthias Wolfel, and Dietrich~Klakow

Speaker Details

John McDonough received the B.S and M.S. degrees from Rensselaer Polytechnic Institute in Troy, NY in 1989 and 1992, respectively. He received the Ph.D. degree from the Johns Hopkins University in Baltimore, MD in April 2000. His Ph.D. advisor was Prof. Fred Jelinek.From January 1993 until August 1997, John worked at the Bolt, Beranek, and Newman Corporation, Cambridge, MA, primarily on large-vocabulary speech recognition systems. From January 2000 until December 2006, he worked at the Interactive Systems Laboratories, University of Karlsruhe, Germany as a Researcher and Lecturer. Since 2007, he has been with the Institute for Computer Science and Engineering, Intelligent Sensor-Actuator Systems (ISAS), University of Karlsruhe, and also with Spoken Language Systems, Saarland University, Saarbruecken, Germany.John assembled and supervised the team responsible for collecting multimodal data at the University of Karlsruhe in connection with the EU project CHIL, Computers in the Human Interaction Loop. Dr. McDonough also led the University of Karlsruhe’s research effort for developing far-field ASR technology, and supervised the University of Karlsruhe’s participation in the audio technologies portion of the CHIL evaluation campaigns. From August 2006 until June 2007, Dr. McDonough led the team consisting of members from the University of Karlsruhe and Saarland University that developed a system for recognizing simultaneous or overlapping speech captured with eight-channel circular microphone arrays. This system achieved the lowest word error rate in the PASCAL Speech Separation Challenge, Part II.