{"id":488189,"date":"2018-06-12T09:35:37","date_gmt":"2018-06-12T16:35:37","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=488189"},"modified":"2022-04-08T12:58:58","modified_gmt":"2022-04-08T19:58:58","slug":"sound-capture-speech-enhancement","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/sound-capture-speech-enhancement\/","title":{"rendered":"Sound Capture and Speech Enhancement"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background  has-background- card-background--full-bleed\">\n\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1920\" height=\"720\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/SOUNDC1.jpg\" class=\"attachment-full size-full\" alt=\"Sound capture and speech enhancement: frequency graphic\" style=\"object-position: 73% 53%\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/SOUNDC1.jpg 1920w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/SOUNDC1-300x113.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/SOUNDC1-1024x384.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/SOUNDC1-768x288.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/SOUNDC1-1536x576.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/06\/SOUNDC1-1600x600.jpg 1600w\" sizes=\"auto, (max-width: 1920px) 100vw, 1920px\" \/>\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 align-self-center\">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 id=\"sound-capture-and-speech-enhancement\" class=\"h2\">Sound Capture and Speech Enhancement<\/h1>\n\n\n\n<p>Producing clean audio signals in noisy environments<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<h2 id=\"summary\">Summary<\/h2>\n\n\n\n<p>An important part of design for devices that contain microphones and loudspeakers is the acoustical design of the sound capture system. Any enclosure changes the directivity patterns of the microphones and their frequency response. Even with a well-designed sound capture system, the signal gets distorted by room noise and reverberation. The goal of device design is to overcome the device, room, and noise effects, ultimately producing a clean audio signal good enough for people and machines to understand.<\/p>\n\n\n\n<h3 id=\"acoustic-echo-reduction\">Acoustic echo reduction<\/h3>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"alignleft\"><img loading=\"lazy\" decoding=\"async\" width=\"150\" height=\"150\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/acoustic-echo-cancellation-150x150.jpg\" alt=\"acoustic echo cancellation diagram\" class=\"wp-image-489029\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/acoustic-echo-cancellation-150x150.jpg 150w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/acoustic-echo-cancellation-180x180.jpg 180w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/acoustic-echo-cancellation-360x360.jpg 360w\" sizes=\"auto, (max-width: 150px) 100vw, 150px\" \/><\/figure><\/div>\n\n\n\n<p>Acoustic echo cancellation, a straightforward application of adaptive filters, is one of the oldest signal processing algorithms. Being part of every speakerphone, it estimates the signal sent to the loudspeaker and captured by the microphone, and then subtracts it from the microphone channel. This results in a signal that contains only the speech in the room, which is called the near-end signal. For many years, stereo acoustic echo cancellation was not considered theoretically possible, with many scientists trying to find a solution good enough for engineering purposes. We solved this problem in 2011, by designing the first surround sound echo canceller in the industry, and then productizing it as part of Kinect for Xbox 360.<\/p>\n\n\n\n<p>A typical audio pipeline includes another component: the echo suppressor. It works by applying a suppression gain, based on the estimation of the proportion of the echo residual to the desired signal. This non-linear processing is complementary to linear acoustic echo cancellation.<\/p>\n\n\n\n<h3 id=\"microphone-array-processing\">Microphone array processing<\/h3>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"alignright\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"269\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/microphone-array-processing-ARRAY-300x269.jpg\" alt=\"microphone array processing graphs\" class=\"wp-image-490541\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/microphone-array-processing-ARRAY-300x269.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/06\/microphone-array-processing-ARRAY.jpg 487w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/figure><\/div>\n\n\n\n<p>Given multiple microphones, called a microphone array, we can combine the microphone signals using a technology called beamforming. The resulting signal contains the speech coming from the desired direction and reduces noise, reverberation and other speech signals coming from other directions, increasing the understandability of the words. The beamformer converts the microphone array into a software-controlled directional microphone: The listening direction can be electronically steered by the way we mix the signals from the microphone, pointing to the desired sound source when it changes its position or another person starts to talk. An integral part of the microphone array processor is the sound source localizer: It determines the direction of either the dominant sound source or even multiple sound sources, and points a beam towards it. The sound source localizer needs to address both noise and reverberation challenges.<\/p>\n\n\n\n<p>Complementary to linear beamforming is suppression gain-based spatial filtering. The gain is estimated based on the direction of the sound in every frequency bin in every frame \u2013 higher if it comes from the desired direction, lower if it is away from it.<\/p>\n\n\n\n<p>This technology has been integrated into Microsoft RoundTable device, Kinect for Xbox, and Microsoft HoloLens for better capture of the speaker\u2019s voice.<\/p>\n\n\n\n<h3 id=\"noise-reduction\">Noise reduction<\/h3>\n\n\n\n<p>In any given mixture of speech and noise, a noise suppressor is used to estimate and identify the clean speech signal. Classic signal processing-based methods used assumptions on the statistical distributions and stationarity of speech and noise signals. While these techniques worked well in practice for stationary noise, novel advancements in machine learning (ML) and artificial intelligence (AI) provides us even more powerful tools for implementing ML-based noise suppressors. Although ML-based techniques achieve already impressive results, a lot of research is needed to find efficient and robust models for real-time processing on resource-constrained devices without degrading the audio quality.<\/p>\n\n\n\n<h2 id=\"technology-transfers\">Technology transfers<\/h2>\n\n\n\n<p>Over the past years, our group has transferred multiple algorithms and code for speech enhancement to Microsoft products. Notable examples include:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/Microsoft_HoloLens\" target=\"_blank\" rel=\"noopener noreferrer\">Microsoft HoloLens<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/hololens\" target=\"_blank\" rel=\"noopener noreferrer\">Microsoft HoloLens 2<\/a>: speech enhancement audio processing pipeline for capturing the wearer\u2019s voice and environmental audio.<\/li><li><a href=\"https:\/\/www.microsoft.com\/en-us\/windows\" target=\"_blank\" rel=\"noopener noreferrer\">Windows 10<\/a>: speech enhancement audio pipeline, including support of microphone arrays with arbitrary geometry.<\/li><li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/developer.microsoft.com\/en-us\/windows\/kinect\" target=\"_blank\" rel=\"noopener noreferrer\">Kinect for Windows<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: the software development kit contains a light version of the audio pipeline for Kinect. Read more about the history of the device <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/Kinect\" target=\"_blank\" rel=\"noopener noreferrer\">here<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/li><li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/marketplace.xbox.com\/en-US\/Product\/Kinect-for-Xbox-360\/66acd000-77fe-1000-9115-d8025858084b\" target=\"_blank\" rel=\"noopener noreferrer\">Kinect for Xbox 360<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/Kinect\" target=\"_blank\" rel=\"noopener noreferrer\">Kinect for Xbox One<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: speech enhancement audio pipeline. This was the first audio pipeline in the industry to support surround sound echo cancellation and hands-free distant speech recognition.<\/li><li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/Windows_Embedded_Automotive\" target=\"_blank\" rel=\"noopener noreferrer\">Microsoft Auto Platform<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: algorithms for speech enhancement.<\/li><li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/Windows_Vista\" target=\"_blank\" rel=\"noopener noreferrer\">Windows Vista<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: microphone array support for five preselected geometries.<\/li><li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/Microsoft_RoundTable\" target=\"_blank\" rel=\"noopener noreferrer\">Microsoft RoundTable<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> device: algorithms for speech enhancement.<\/li><\/ul>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"alignleft\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"169\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/xbox-300x169.jpg\" alt=\"Kinect for Xbox One\" class=\"wp-image-489044\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/xbox-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/xbox-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/xbox.jpg 800w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/xbox-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/xbox-343x193.jpg 343w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/figure><\/div>\n\n\n","protected":false},"excerpt":{"rendered":"<p>The goal of device design is to overcome the device, room, and noise effects, ultimately producing a clean audio signal good enough for people and machines to understand.<\/p>\n","protected":false},"featured_media":668841,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[243062,13551,13554],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-488189","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-audio-acoustics","msr-research-area-graphics-and-multimedia","msr-research-area-human-computer-interaction","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"2002-07-01","related-publications":[238132,162754,162757,163085,164058,164093,166572,167515,167965,168297,168299,161872,250184,372026,377132,574680,581593,618039,658839,754306,787132,810181,155575,155560,155562,155566,155567,155568,155569,155571,155572,155573,155574,155559,155577,155951,155954,156678,156679,158122,158123,159389,159878,160244],"related-downloads":[],"related-videos":[185671,185697,186630,186803,189076,544887,668202,742930],"related-groups":[144923],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Sebastian Braun","user_id":37688,"people_section":"Project contributors","alias":"sebraun"},{"type":"user_nicename","display_name":"Hannes Gamper","user_id":31943,"people_section":"Project contributors","alias":"hagamper"},{"type":"user_nicename","display_name":"David Johnston","user_id":31562,"people_section":"Project contributors","alias":"davidjo"},{"type":"user_nicename","display_name":"Ivan Tashev","user_id":32127,"people_section":"Project contributors","alias":"ivantash"},{"type":"guest","display_name":"Sahar  Hashemgeloogerdi","user_id":664350,"people_section":"Past interns","alias":""},{"type":"guest","display_name":"Sebastian Braun","user_id":664353,"people_section":"Past interns","alias":""},{"type":"guest","display_name":"Long Le","user_id":664377,"people_section":"Past interns","alias":""},{"type":"guest","display_name":"Felicia Lim","user_id":664359,"people_section":"Past interns","alias":""},{"type":"guest","display_name":"Jason Wung","user_id":664365,"people_section":"Past interns","alias":""},{"type":"guest","display_name":"Xing Li","user_id":664371,"people_section":"Past interns","alias":""}],"msr_research_lab":[199565,1161007],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/488189","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":22,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/488189\/revisions"}],"predecessor-version":[{"id":833833,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/488189\/revisions\/833833"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/668841"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=488189"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=488189"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=488189"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=488189"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=488189"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}