Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus announcements about noteworthy events, scholarships, and fellowships designed for academic and scientific communities.

Deep-Neural-Network Speech Recognition Debuts

June 14, 2012 | Posted by Microsoft Research Blog

Posted by Rob Knies

MAVIS logo
Last August, my colleague Janie Chang wrote a feature story titled Speech Recognition Leaps Forward that was published on the Microsoft Research website. The article outlined how Dong Yu, of Microsoft Research Redmond, and Frank Seide, of Microsoft Research Asia, had extended the state of the art in real-time, speaker-independent, automatic speech recognition.

Now, that improvement has been deployed to the world. Microsoft is updating the Microsoft Audio Video Indexing Service with new algorithms that enable customers to take advantage of the improved accuracy detailed in a paper Yu, Seide, and Gang Li, also of Microsoft Research Asia, delivered in Florence, Italy, during Interspeech 2011, the 12th annual Conference of the International Speech Communication Association.
The algorithms represent the first time a company has released a deep-neural-networks (DNN)-based speech-recognition algorithm in a commercial product.

It’s a big deal. The benefits, says Behrooz Chitsaz, director of Intellectual Property Strategy for Microsoft Research, are improved accuracy and faster processor timing.

He says that tests have demonstrated that the algorithm provides a 10- to 20-percent relative error reduction and uses about 30 percent less processing time than the best-of-breed speech-recognition algorithms based on so-called Gaussian Mixture Models.

Importantly, deep neural networks achieve these gains without the need for “speaker adaptation.” In comparison, today’s state-of-the-art technology operates in “speaker-adaptive” mode, in which an audio file is recognized multiple times, and after each time, the recognizer “tunes” itself a little more closely to the specific speaker or speakers in the file, so that the next time, it gets better—an expensive process.

The ultimate goal of automatic speech recognition, Chang’s story indicates, is out-of-the-box speaker-independent services that don’t require user training. Such services are critical in mobile scenarios, at call centers, and in web services for speech-to-speech translation. It’s difficult to overstate the impact that this technology will have as it rolls out across the breadth of Microsoft’s other services and applications that employ speech recognition.

Artificial neural networks are mathematical models of low-level circuits in the human brain. They have been in use for speech recognition for more than 20 years, but only a few years ago did computer scientists gain access to enough computing power to make it possible to build models that are fine-grained  and complex enough to show promise in automatic speech recognition.

An intern at Microsoft Research Redmond, George Dahl, now at the University of Toronto, contributed insights into the working of DNNs and experience in training them. His work helped Yu and teammates produce a paper called Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition.

In October 2010, Yu presented the paper during a visit to Microsoft Research Asia. Seide was intrigued by the research results, and the two joined forces in a collaboration that has scaled up the new, DNN-based algorithms to thousands of hours of training data.