Return to Blog Home
Microsoft Research Blog

Thinking outside-of-the-black-box of machine learning on the long quest to perfecting automatic speech recognition


Speech recognition is something we humans do remarkably well, which includes our ability to understand speech even in noisy multi-talker environments. While our natural sophistication at this is something we take for granted, speech recognition researchers continue to pursue refinements and improvements on the frontiers of the research space of automatic speech recognition. Significant technological progress that has been made over decades has shaped automatic speech recognition technology into its current form, which is already powering various Microsoft products, including Cortana, Skype Translator, Presentation Translator, Office Dictation, HoloLens, and Azure Cognitive Services. Yet, there is still a long way to go. Particularly challenging for humans – and almost impossible for machines – is zeroing in on one speaker in a noisy multi-talker environment. A pair of significant recent advances in the field coming out of Microsoft’s AI investments promises to get us even closer to the day in which AI speech recognition surpasses even the abilities of humans to process and understand the dynamic buzz of words in complex interactions and settings and to perhaps leverage speech in ways previously unimagined.

In papers to be presented at Interspeech 2018 in Hyderabad, India September 2-6, Microsoft AI researchers outline a pair of significant innovations in the area of overlapped speech recognition and in rethinking established methods of temporal modeling for automatic speech recognition.

Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks” by Microsoft AI and Research researchers Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, Xiong Xiao, and Fil Alleva, approaches the real-world problem of developing a far-field meeting transcription system that can recognize speech even when utterances of different speakers are overlapped. While automatic speech recognition technology has made significant progress in recent years thanks to deep learning, when it comes to dealing with speech overlaps, AI still can’t compete with humans, especially in the one realm where humans dominate: zeroing in on one speaker in a noisy multi-talker environment and understanding what the speaker is saying even when his or her voice is overlapped by the chatter of other speakers within earshot – what the researchers call the cocktail party problem. Current automatic speech recognition systems perform pretty badly when utterances of two or more speakers overlap.

The challenges that need to be overcome include an unknown and varying number of speakers, unknown speaker identities, unknown speech activity segments, and background noise and reverberation.

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

“Speech separation or overlapped speech recognition is paramount for far-field conversational speech recognition. It has a wide range of potential applications, such as meeting assistance and medical dialog transcription.” – Takuya Yoshioka

“In order to separate overlapped speech in real meeting audio, we have to solve two challenges in a speaker-independent fashion: overlap detection and speech separation. In our paper, we jointly addressed these problems by using a neural network and integrating it with traditional signal processing techniques in a cohesive way,” explained Takuya Yoshioka.

The team came up with a new signal processing module, the unmixing transducer, a novel signal processing module for converting multi-channel (multi-microphone-sourced) audio signals into a fixed number of separated speech streams and implemented it using a windowed BLSTM. A novel neural network architecture was proposed to effectively leverage beamforming capability. Significant gains in meeting transcription performance were obtained, especially in multi-talker segments, compared with a state-of-the-art neural network-based beamformer. The team emphasizes that the new method makes no assumptions regarding the total number of meeting attendees nor their identities.

In typical meetings, overlapped speaking segments account for 10+ percent of the speaking time. While this is far too significant to ignore, handling it requires a great care because the system now has to always consider the possibility of overlap. Otherwise, the system will end up with inserting a lot of redundant ‘ghost’ words between correct words.

In this system, the unmixing transducer continuously receives microphone signals and generates a fixed number of time-synchronous audio streams. The acoustic signal of each utterance found in the input “spurts” from one of the output channels. When the number of active speakers is fewer than that of the outputs, the extra channels generate zero-valued signals. The signal from each output channel is segmented and transcribed by a back-end speech recognizer connected to that channel.

Yoshioka applied the method to recordings of the team’s own meetings and to his surprise and delight it worked pretty well. The result was kind of unexpected if only because real-world overlapped speech recognition remained a persistent challenge in the community; previous methods had been tested only in simplified laboratory settings with none successful in real-world settings. “That was the moment I decided to bet on this approach,” said Yoshioka. The team has actively been pursuing the technology with performance continuously improving.

To their knowledge, it represents the first overlapped speech recognition system that has been demonstrated to work well for actual meetings with no prior assumptions.

“Speech separation or overlapped speech recognition is paramount for far-field conversational speech recognition,”, said Yoshioka. “It has a wide range of potential applications, such as meeting assistance and medical dialog transcription. As computers begin to sense the world better and get smarter, they will be able to provide us more effective assistance and help us focus on more important things.”

In the accompanying paper titled, “Layer Trajectory LSTM”, Microsoft AI researchers Jinyu Li and fellow researchers Changliang Liu and Yifan Gong, successfully reassessed the potential for innovation in traditional time-based LSTM networks. Jinyu Li described his conceptual approach saying, “Sometimes deep learning is treated as a black box and researchers just keep trying different model structures without taking a couple of steps back and thinking about why the models work – and what else might be possible.”

Traditional LSTM networks in recurrent neural networks (RNNs), well-suited to classifying and making predictions based on time series data such as speech, nevertheless still left room for improvement in advanced speech recognition. Traditionally, the AI takes speech and builds a layer-by-layer structure to get an abstraction of phonemes that models the time-speech signal much better. What Li’s team propose in their paper is to separate the tasks of temporal modeling and phoneme classification with time-based LSTM and layer-based LSTM, respectively. Because every layer of traditional time-based LSTM has its own information, a layer-trajectory LSTM could be built to scan all this information instead of the traditional method of just using the top layer of time-based information typically relied upon in traditional LSTM models. Layer Trajectory LSTM would not only just use top layer time-based LSTM, but all outputs from every layer, that is, using depth versus time.

“We’re excited about this breakthrough for how it significantly advances LSTM while consistently improving performance across every Microsoft speech recognition product.” – Jinyu Li

“Every layer of standard time-based LSTM has its own information; we built Layer Trajectory LSTM to scan all that untapped information for our phoneme classification and prediction instead of using only the top-layer time-based LSTM.

Like most of their peers in the space, the team had been using traditional time-based LSTM models and had performed many experiments aimed at improving performance but it was very challenging. Li had been devoting a lot of thinking to the shape of the problem and came to wonder if the real issue wasn’t that LSTM relied on a single time-based LSTM block to perform two very different tasks – temporal modeling of speech signals and a layer-by-layer handling of phonemes for classification on the layer axis. What if these two very different tasks should be done with the separate blocks in the model? It was a eureka moment and implementing it, he observed over the next few days of model training that it yielded very good accuracy.

With the two blocks now each having its own assigned tasks and clear goals, they no longer interfered with each other, explained Li. “We didn’t just blindly try different modeling structures; this innovation is based on very clear thinking on what kind of modelling speech recognition should use.”

How innovative is this? “It’s definitely new. The insight that modeling the time sequence and phonetic classification on separate axes extends the LSTM framework in an important new dimension that already has yielded a huge (10%) improvement in quality,” said Li.

Such task decoupling makes it possible to use modeling units other than LSTM for modeling layer dependency, opening a door for flexible model design.

“This is very good technology, not only for the meeting scenario, but for all Microsoft, far-field speaker applications,” said Li. “Cortana, Harman Kardon Invoke with Cortana by Microsoft, Skype Translator – all these products are experiencing the benefits of our research.”

At the upcoming Interspeech conference, Microsoft researchers and scientists will be presenting far more papers as listed below. We encourage you to look for these papers and meet the people behind them in Hyderabad September 2-6 and look forward to seeing this knowledge applied throughout the field in the coming months.

Microsoft @ Interspeech 2018

If you are in Hyderabad, please take time to chat with us and stop by our booth at location L1. And be sure to check our Interspeech event page.

September 3

17:50 Entity-Aware Language Model as an Unsupervised Re-ranker
We demonstrate an n-best reranking method to incorporate entity relationships from a knowledge-base into a language model without the need for difficult-to-obtain human annotated training data for the ranker.
Hall 1 Mohammad Sadegh Rasooli and Sarangarajan Parthasarathy

September 4

10:00 HoloCompanion: An MR Friend for Everyone
MR56 Annam Naresh, Rushabh Gandhi, Mallikarjuna Rao Bellamkonda, Mithun Das Gupta

10:00 Cycle-Consistent Speech Enhancement
Hall 4 Zhong Meng, Jinyu Li, Yifan Gong and Biing-Hwang (Fred) Juang

10:00 Effect of TTS Generated Audio on OOV Detection and Word Error Rate in ASR for
MR12_1 Low-resource Languages
Savitha Murthy, Dinkar Sitaram and Sunayana Sitaram

14:30 Paired Phone-Posteriors Approach to ESL Pronunciation Quality Assessment
We propose to incorporate paired phone-posteriors as input features into a neural net model for assessing an ESL learner’s pronunciation quality, which improves the evaluation quality of existing methods and gives learners more effective feedback
Hall 4 Yujia Xiao, Frank Soong and Wenping Hu

September 5

10:00 Layer Trajectory LSTM
We propose the layer trajectory LSTM (ltLSTM) which builds a layer-LSTM using all the layer outputs from a standard multi-layer time-LSTM. Compared with LSTM and variants which work layer-by-layer and time-by-time. ltLSTM joint optimization drives 9% error rate reduction, model design flexibility, and effective implementation.
Hall 3 Jinyu Li, Changliang Liu and Yifan Gong

10:00 A New Glottal Neural Vocoder for Speech Synthesis
We propose a novel neural network-based vocoder for synthesis, which generates high quality speech with good CPU cost, outperforming traditional glottal vocoders.
Hall 4 Yang Cui, Xi Wang, Lei He and Frank K. Soong

10:00 Homophone Identification and Merging for Code-switched Speech Recognition Brij
We propose a pronunciation-based approach to disambiguate and merge homophones in cross-transcribed multilingual text and a metric to measure authentic word error rate in code-switched speech recognition.
MR Mohan Lal Srivastava and Sunayana Sitaram

17:00 Improved Training for Online End-to-end Speech Recognition Systems
Hall4 Suyoun Kim, Michael Seltzer, Jinyu Li and Rui Zhao

September 6

10:00 Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks
A multi-channel neural network-based separation system is proposed. Previous methods which work in “laboratory settings” contrast with the proposed system enabling overlapped speech recognition in real unconstrained meetings.
Hall 3 Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, Xiong Xiao and Fil Alleva

10:00 Adversarial Feature-Mapping for Speech Enhancement
Hall 4 Zhong Meng, Jinyu Li, Yifan Gong and Biing-Hwang (Fred) Juang

10:00 What to Expect from Expected Kneser-Ney Smoothing
We describe practical extensions and applications of Kneser-Ney Smoothing on Expected Counts that allows for training of a KN LM that takes full advantage of fractional n-gram counts.
Hall 4 Michael Levit, Sarangarajan Parthasarathy and Shuangyu Chang

16:10 Investigations on Data Augmentation and Loss Functions for Deep Learning Based
Speech-Background Separation
We investigate a novel SNR-based loss functions and on-the-fly data augmentation for separation of speech from background audio and improve the best published result on CHiME-2 medium track database as a result.
Hall 1 Hakan Erdogan and Takuya Yoshioka