Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Sounding the Future: Microsoft Research brings its best to ICASSP 2018 in Calgary

May 14, 2018 | By Dimitrios Dimitriadis, Researcher

ICASSP 2018 Microsoft Research


Speech technology has come a long way since Alexander Graham Bell’s famous Mr. Watson – Come here – I want to see you became the first speech to be heard over the telephone in 1876. Today, speech technology has moved into realms such as VoIP, teleconferencing systems, home automation, and so on. Its importance has grown exponentially with the emergence of mobile and wearable devices and many existing and upcoming Microsoft services, devices and algorithms depend on these voice-based interfaces.

As far as things have come along, there is still a lot of inefficiency and the importance of high-performing speech-processing technologies has never been more apparent. Traditional signal processing algorithms that used to be the state-of-the-art – especially in speech-recognition and computer-vision – are facing performance plateaus. Also, a new class of algorithms that can learn directly from data and be robust in diverse and adverse application environments has emerged. The development of speech technologies has exploded due to advances in machine learning and AI. These advances have made voice interfaces more practical and useful, leading to easier and more efficient communication with the machines around us. Experts believe that speech applications are approaching a level of reliability at which everyday use will become second nature.


The 2018 International Conference on Acoustics, Speech and Signal Processing in Calgary, Canada is the world’s largest and most comprehensive technical conference focused on signal processing and its applications; ICASSP is the global event for presenting important developments in speech technology. The conference is sponsored by the IEEE Signal Processing Society and has been held annually since 1976. It features world-class speakers, tutorials, exhibits, a show and tell event and over 120 presentation and poster sessions. Microsoft’s presence was significant, with researchers presenting over 25 papers on ground-breaking, novel machine-learning methods for speech processing. This work significantly improves the odds of advancing speech technology quality in many backend services and devices.

At ICASSP, Microsoft offered a glimpse of future speech services – a world of lightly supervised training, enhanced robustness and more intuitive interaction with machines. Far-field ASR and voice control has become a lot more practical, now working reliably in noisy environments, for example, interacting across a room and being able to handle multiple speakers even when they speak simultaneously. Virtual assistants such as Microsoft Cortana offer a simpler way of accessing information, cueing up songs and building shopping lists, all using just your voice. As part of these applications, multimodal speech processing is gaining more attention. Several of the conference sessions were dedicated to such areas. Microsoft is well-placed, especially when considering the impressive size of the team dedicated to advancing the accuracy of speech recognition and improving the overall conversational interfaces.

It’s also worth noting that more and more research teams are moving away from doing only core ASR, broadening their focus to include areas such as multi-speaker ASR, language ID, and diarization, all of which are required to build end-to-end applications.

Sounding the Future

Natural Language understanding and Dialogue Systems are two of the next challenges in AI. The use of speech and image recognition to analyze inflections and facial expressions as part of a dialogue system will make machines interact more naturally with their human users. Although many researchers expect voice interfaces to become more natural, there is still a big challenge for AI because language interfaces are complex and domain-specific intelligence together with knowledge about effective human-machine interaction is required to respond. A number of significant Microsoft papers are being presented at ICASSP that advance the conversation in these areas, including “Improving End-of-Turn Detection in Spoken Dialogues by Detecting Speaker Intentions as a Secondary Task”, “The Microsoft 2017 Conversational Speech Recognition System”, “Domain and Speaker Adaptation for Cortana Speech Recognition”, “Sequence Modeling in Unsupervised Single-Channel Overlapped Speech Recognition” and “Towards Language-Universal End-to-End Speech Recognition”.

One of the hottest trends in machine learning is Generative Adversarial Networks. These systems consist of one neural network generating artificial data and another network trained to distinguish fake from real data. When combined, these two networks have the power to create realistic synthetic data that can be indistinguishable from real data. Papers like “Adversarial Teacher-Student Learning for Unsupervised Domain Adaptation”, “Speaker-Invariant Training via Adversarial Learning”, and “Adversarial Advantage Actor-critic Model for Task-Completion Dialogue Policy Learning”, attest to Microsoft’s pioneering efforts in GANs as applied to AI.

As previously noted, ICASSP covers a wide range of technologies paving the broader trends in machine learning. A large fraction of the ASR-related papers is dedicated to attention mechanisms, end-to-end modeling and sequence-to-sequence models. Microsoft has been using sequence-to-sequence systems for machine translation; in the case of ASR, there are still important problems to iron out. Nevertheless, Microsoft is advancing the field in these areas with papers like “Advancing Connectionist Temporal Classification with Attention Modeling”, “Advancing Acoustic-to-Word CTC Model”, and “Neural Sequential Malware Detection with Parameters”.

What’s Next?

Clearly core areas of speech technology like automatic speech recognition and text-to-speech synthesis have reached an impressive level of maturity. But there remain significant open questions around how to use voice modality to create more natural user interfaces. Much attention was devoted during the ICASSP sessions to far-field speech processing, diarization, speech separation and similar technical challenges. Microsoft’s interest in these areas is strong and reflected by the presentation of multiple papers in this area including “Developing Far-field Speaker System via Teacher-Student Learning”, “Exploring sequential characteristics in speaker bottleneck feature for text-dependent speaker verification”, and “Efficient Integration of Fixed Beamformers and Speech Separation Networks for Multi-channel Far-Field Speech Separation”.

Challenges across cognitive and behavioral sciences on how to design truly effective and efficient human-computer interaction scenarios remain. As part of these challenges, it is very likely that affective computing (such as emotion processing) will continue to gain momentum and most of the prominent problems will be solved. The challenge will ultimately be to combine such increasingly accurate sensing capabilities to improve and elevate the human-machine communication in both home and work environments.

Up Next

Two guys writing equations on a window in Asia

Artificial intelligence, Graphics and multimedia, Human language technologies

Growing a generation of computer scientists – Microsoft Research Asia at 20 and going beyond technical achievement

Microsoft Research Asia celebrates its 20th anniversary this year, and the milestone provided an occasion for many in the industry to reflect on an amazing journey, one not only replete with excellence and technological achievement, but also significant in its profound influence as it cultivated a generation of computer scientists and engineers, catalyzed collaboration between […]

Microsoft blog editor

Artificial intelligence, Human language technologies, Human-computer interaction

Thinking outside-of-the-black-box of machine learning on the long quest to perfecting automatic speech recognition

Speech recognition is something we humans do remarkably well, which includes our ability to understand speech even in noisy multi-talker environments. While our natural sophistication at this is something we take for granted, speech recognition researchers continue to pursue refinements and improvements on the frontiers of the research space of automatic speech recognition. Significant technological […]

Microsoft blog editor

Image of soundwaves

Human language technologies

Microsoft researchers achieve new conversational speech recognition milestone

Last year, Microsoft’s speech and dialog research group announced a milestone in reaching human parity on the Switchboard conversational speech recognition task, meaning we had created technology that recognized words in a conversation as well as professional human transcribers. After our transcription system reached the 5.9 percent word error rate that we had measured for […]

Xuedong Huang

Technical Fellow, Speech and Language