DNN Research Improves Bing Voice Search

Published June 17, 2013

Share this page

Posted by Rob Knies

We live in a society obsessed with speed. Whether it’s download times on a mobile phone or Usain Bolt’s time in the 100 meters, the faster the better. We also live during an era when accuracy has become not just preferable but essential. The technological marvels of the 21st century demand it.

Speed=good. Accuracy=good. Put them together, and you’ve got a leap forward, such as recent advancements in Bing Voice Search (opens in new tab) for Windows Phone (opens in new tab) that enable customers to get faster, more accurate results than ever before.

Those improvements come, in part, from contributions delivered via Microsoft Research’s work on deep neural networks (DNNs). Such networks are a computational framework for automatic pattern recognition that is inspired by the basic circuits of the human brain. Refinements in mathematical formulas, coupled with greater computational power and large data sets, enable DNNs to learn and edge noticeably closer than traditional speech technologies to humans’ ability to recognize speech and images.

Over the past few years, Frank Seide (opens in new tab) of Microsoft Research Asia (opens in new tab) and Li Deng (opens in new tab) and Dong Yu (opens in new tab) of Microsoft Research Redmond (opens in new tab) have been at the forefront of this advance, working with scientists from the Bing Speech team, including Jinyu Li and Jui-Ting Huang, to provide vast improvements in the speed and accuracy of Bing Voice Search.

Their success has been dramatic. With the judicious use of DNNs, that service has seen its speed double in recent weeks, and its word-error rate has improved by 15 percent. In addition, Bing Voice Search now performs significantly better amid noisy conditions.

&amp;amp;amp;amp;lt;a title="Bing Makes Voice Recognition on Windows Phone More Accurate and Twice As Fast" href="http://www.bing.com/videos/browse?mkt=en-us&amp;amp;amp;amp;amp;vid=5c9155cc-c40d-45ed-9ee0-64327142e1e5&amp;amp;amp;amp;amp;from=shareembed-syndication&amp;amp;amp;amp;amp;src=v5:embed:syndication:" target="_new" data-mce-href="http://www.bing.com/videos/browse?mkt=en-us&amp;amp;amp;amp;amp;vid=5c9155cc-c40d-45ed-9ee0-64327142e1e5&amp;amp;amp;amp;amp;from=shareembed-syndication&amp;amp;amp;amp;amp;src=v5:embed:syndication:"&amp;amp;amp;amp;gt;Video: Bing Makes Voice Search on Windows Phone More Accurate and Twice As Fast&amp;amp;amp;amp;lt;/a&amp;amp;amp;amp;gt;

None of this will surprise anybody lucky enough to have attended last year’s Computing in the 21st Century Conference (opens in new tab), held in Tianjin, China. As will be familiar to many, during that event, co-hosted by Microsoft Research Asia, Rick Rashid (opens in new tab), Microsoft chief research officer and head of Microsoft Research, ended his keynote address by speaking to the attendees in English. As he did, in real time, Bing Translator (opens in new tab), supplemented by the use of DNNs and Microsoft Research Asia work in mapping a person’s voice to another language, delivered a jaw-dropping revelation: Rashid speaking expertly translated Mandarin, grammatically correct and accurately intoned—in his own voice. The reception was rapturous.

In fact, that demonstration combined three different technologies: machine translation, text-to-speech conversion, and automatic speech recognition. The latter part represents the research breakthrough on which the work by Yu, Deng, Seide, Huang, and Li on DNNs began to pay off.

The DNN research enabled a new acoustic model and decoder for Bing Voice Search for Windows Phone. The decoding runtime worked like a charm in Tianjin.

The application of DNNs for speech recognition, building on recent advances by Geoffrey Hinton at the University of Toronto, is hardly a simple task. DNN models can contain hundreds of millions of parameters, representing patterns of the human voice, and are trained through a process developed by Microsoft Research scientists. Bing (opens in new tab)’s back-end infrastructure completes the pipeline, which results in an instantaneous user experience.

In May 2011, during the International Conference on Acoustics, Speech and Signal Processing, George Dahl of the University of Toronto and Microsoft Research colleagues Yu, Deng, and Alex Acero presented a paper titled Large Vocabulary Continuous Speech Recognition with Context-Dependent DBN-HMMs (opens in new tab) that reported an unprecedented, 16-percent error reduction using the DNN technology they developed during the previous summer on the business-critical Bing Voice Search task.

Subsequently, during Interspeech 2011, the 12th annual Conference of the International Speech Communication Association, Seide, Yu, and Gang Li (opens in new tab) of Microsoft Research Asia presented a paper called Conversational Speech Transcription Using Context-Dependent Deep Neural Networks (opens in new tab), detailing their DNN explorations, the most dramatic change in speech-recognition accuracy in more than two decades.

Yu, while working closely with Deng and Dahl during the summer of 2010, contributed the “context-dependent” work to enable DNNs to be applied to large-vocabulary speech recognition. Yu recalls being absolutely giddy with enthusiasm when he realized what he and his colleagues had achieved.

“I first realized the effect of the DNN when we successfully achieved significant error-rate reduction on the voice-search data set after implementing the context-dependent deep-neural-network hidden Markov model,” Yu smiles. “It was an exciting moment. I was so excited that I did not sleep that night. I realized that we had made a breakthrough and called Qiang Huo (opens in new tab) [a Microsoft Research Asia research manager who also has worked on speech recognition] late at night—daytime in China—to describe the ideas and results.”

Interestingly, the researchers also have discovered that DNNs can learn across languages. This is of critical importance, because speech recognizers must be trained on huge amounts of example speech data—thousands of hours of it—and the burden of transcribing such voluminous files can be reduced significantly when data from one language can help improve accuracy for another.

That’s just one example of how this research continues to evolve. As the scientists continue to explore the expanding frontiers of the DNN work and to collaborate accordingly, the DNN-fueled speech-recognition improvements can only continue. Broad-scale speech-to-speech translation, once simply a dream, suddenly seems an alluring possibility.

“Our result significantly advanced the state of the art, both in industry and in the academic community,” Yu says. “Now, most industrial automatic-speech-recognition systems are DNN-based. This also helped to popularize deep learning.

“Before our result, deep learning was only tested on small tasks and did not attract wide attention. I believe this is just the first step in advancing the state of the art. Many difficult problems may be attacked under this framework, which might lead to even greater advances.”

The new, improved Bing Voice Search represents yet another enhancement in the user experience customers can enjoy with a combination of Windows Phone 8 and Bing technologies, following November’s update that brought Bing Translator to the Windows Phone 8 platform (opens in new tab). Given the popularity of the latter, the enhancements to Bing Voice Search seem certain to captivate even more attention in the mobile-phone realm.

Microsoft Research Blog

Microsoft Research Newsletter