Abstract

We describe how computer vision may be used in combination with an array microphone to improve speech recognition accuracy in the context of noise. Speech recognition systems are notoriously susceptible to interfering noise, especially when it is unwanted speech. A microphone array by itself can improve speech recognition accuracy significantly over a fixed microphone for computer users that cannot or prefer not to wear a headset. The improvement is accomplished by steering a beam of sensitivity toward the loudest sound and using the directional sensitivity of the array to improve the signal-to-noise ratio of the source. However, in a noisy environment, the loudest sound is not always the intended source, and unintended noise and speech will be picked up. Even when the beam is focused on the user, loud background noise and conversations not directed toward the computer still corrupt speech recognition. To overcome these problems we propose to use computer vision to help determine the location of the user, and infer whether he is talking to the computer. This information can be used to focus the microphone array beam on the user, filter out background noise not coming from the user, and suppress conversations not intended for the computer.