Sound Capture and Speech Enhancement

Established: July 1, 2002




An important part of design for devices that contain microphones and loudspeakers is the acoustical design of the sound capture system. Any enclosure changes the directivity patterns of the microphones and their frequency response. Even with a well-designed sound capture system, the signal gets distorted by room noise and reverberation. The goal of device design is to overcome the device, room, and noise effects, ultimately producing a clean audio signal good enough for people and machines to understand.


Acoustic echo reduction

acoustic echo cancellationAcoustic echo cancellation, a straightforward application of adaptive filters, is one of the oldest signal processing algorithms. A part of every speakerphone, it estimates the signal sent to the loudspeaker and captured by the microphone, and then subtracts it from the microphone channel. This results in a signal that contains only the speech in the room, which is called the near-end signal. For many years, stereo acoustic echo cancellation was not considered theoretically possible, with many scientists trying to find a solution good enough for engineering purposes. We solved this problem in 2011, by designing the first surround sound echo canceller in the industry, and then productizing it as part of Kinect for Xbox 360.

A typical audio pipeline includes another component: the echo suppressor. It works by applying a suppression gain, based on the estimation of the proportion of the echo residual to the desired signal. This non-linear processing is complementary to linear acoustic echo cancellation.

Microphone array processing

Given multiple microphones, called a microphone array, we can combine the signals from them by using a technology called beamforming. The resulting signal contains the speech coming from the desired direction and reduces noise and other speech signals coming from other directions, increasing the understandability of the words. The beamformer converts the microphone array into a directional microphone, and this also helps to reduce the reverberation in the desired signal. The listening direction can be electronically steered by the way we mix the signals from the microphone, pointing to the desired sound source when it changes its position or another person starts to talk. An integral part of the microphone array processor is the sound source localizer. It determines the direction of the dominant sound source and points the beam towards it. The sound source localizer needs to address both noise and reverberation challenges.

Complementary to the linear beamforming is the suppression gain-based spatial filtering. The gain is estimated based on the direction of the sound in every frequency bin in every frame – higher if it comes from the desired direction, lower if it is away from it.

This technology has been integrated into Microsoft RoundTable device, Kinect for Xbox, and Microsoft HoloLens for better capture of the speaker’s voice.

Noise reduction

In any given mixture of speech and noise, a noise suppressor is used to estimate and identify the clean speech signal. This occurs by applying a time-varying real gain to the complex value in each frequency bin of each audio frame. Traditionally, the estimation of the suppression gain is based on the statistical models of the noise and the speech signals. The gain is estimated based on the assumption of Gaussian distribution, which occurs in noise but not in the speech signal. In the past, using more complex and better models for the statistical distribution of the speech signal made the derivation of the suppression gain formula practically impossible.

With the advancement in machine learning (ML) and artificial intelligence (AI), we have a powerful and useful tool for implementing ML-based noise suppressors. After some encouraging initial attempts in the summer of 2016, we gained substantial progress by designing and evaluating several algorithms of noise suppressors by using deep neural networks.

Technology Transfers

Over the past few years, our group has transferred multiple algorithms and code for speech enhancement to Microsoft products. Notable examples include:

  • Microsoft HoloLens: speech enhancement audio processing pipeline for capturing the wearer’s voice and environmental audio.
  • Windows 10: speech enhancement audio pipeline, including support of microphone arrays with arbitrary geometry.
  • Kinect for Windows: the software development kit contains a light version of the audio pipeline for Kinect. Read more about the history of the device here.
  • Kinect for Xbox 360 and Kinect for Xbox One: speech enhancement audio pipeline. This was the first audio pipeline in the industry to support surround sound echo cancellation and hands-free distant speech recognition.
  • Microsoft Auto Platform: algorithms for speech enhancement.
  • Windows Vista: microphone array support for five preselected geometries.
  • Microsoft RoundTable device: algorithms for speech enhancement.