Speech and Dialog Research Group

Established: March 27, 2000

Research in speech recognition, language modeling, language understanding, spoken language systems and dialog systems.


Our goal is to fundamentally advance the state-of-the-art in speech and dialog technology. To achieve this, we are working in all aspects of machine learning, neural network modeling, signal processing, and dialog modeling. Recently, to support our work, we have developed the Microsoft Cognitive Toolkit (CNTK, formerly Computational Network Toolkit), which makes it easy to define complex neural network structures, and train them across multiple GPUs with unprecedented efficiency. You can find out more about this work by exploring the projects and individual home pages listed below.

In addition to advancing our basic scientific understanding of natural language processing and advancing the state of the art, our work finds an outlet in Microsoft products such as Cortana, Xbox, and the Project Oxford web services suite. We have developed two of the key services. LUIS (Language Understanding Intelligent Service) makes it very easy for a developer to add language understanding to applications. From a small number of examples, LUIS is able to determine a user’s intent when they talk or type. CRIS (Custom Recognition Intelligent Service) provides companies with the ability to deploy customized speech recognition. The developer uploads sample audio files and transcriptions, and the recognizer is customized to the specific circumstances. This can make recognition far better in unusual circumstances, such as recognition on a factory floor, or outdoors. At runtime, both LUIS and CRIS are accessed via web APIs.

The Speech & Dialog Group is managed by Xuedong Huang.

Previous projects

Former Members







An Introduction to Computational Networks and the Computational Network Toolkit
Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng Yao, Oleksii Kuchaiev, Yu Zhang, Frank Seide, Zhiheng Huang, Brian Guenter, Huaming Wang, Jasha Droppo, Geoffrey Zweig, Chris Rossbach, Jie Gao, Andreas Stolcke, Jon Currey, Malcolm Slaney, Guoguo Chen, Amit Agarwal, Chris Basoglu, Marko Padmilac, Alexey Kamenev, Vladimir Ivanov, Scott Cypher, Hari Parthasarathi, Bhaskar Mitra, Baolin Peng, Xuedong Huang, Microsoft Research, October 1, 2014, View abstract, Download PDF


Decoding Auditory Attention (in Real Time) with EEG
Edmund Lalor, Nima Mesgarani, Siddharth Rajaram, Adam O'Donovan, James Wright, Inyong Choi, Jonathan Brumberg, Nai Ding, Adrian KC Lee, Nils Peters, Sudarshan Ramenahalli, Jeffrey Pompe, Barbara Shinn-Cunningham, Malcolm Slaney, Shihab Shamma, in Proceedings of the 37th ARO MidWinter Meeting, Association for Research in Otolaryngology (ARO), February 17, 2013, View abstract, Download PDF


Generating Exact Lattices in the WFST Framework
Daniel Povey, Mirko Hannemann, Gilles Boulianne, Lukas Burget, Arnab Ghoshal, Milos Janda, Martin Karafiat, Stefan Kombrink, Petr Motlicek, Yanmin Qian, Korbinian Riedhammer, Karel Vesely, Ngoc Thang Vu, IEEE International Confrence on Acoustics, Speech, and Signal Processing (ICASSP), March 1, 2012, View abstract, Download PDF



Pitch Change Toolbox

October 2014

This Matlab toolbox implements the pitch-change algorithm described by Slaney, Shriberg and Huang in their Interspeech 2013 paper “Pitch-gesture modeling using subband autocorrelation change detection.” Calculating speaker pitch (or f0) is typically the first computational step in modeling tone and intonation for spoken language understanding. Usually pitch is treated as a fixed, single-valued quantity. The…

Size: 12 MB

    Click the icon to access this download

  • Website

Computational Network Toolkit

April 2014

MSR Identity Toolbox (Without Binaries)

October 2013

This is the MSR Identity Toolbox: A MATLAB toolbox for speaker-recognition research. This toolbox contains a collection of MATLAB tools and routines that can be used for research and development in speaker recognition. Version 1.0 of the Identity Toolbox provides code to implement both the conventional GMM-UBM and the state-of-the-art i-vector-PLDA based speaker-recognition strategies. It…

Size: 2 MB

    Click the icon to access this download

  • Website

MSR Identity Toolbox (With Binaries)

October 2013

This is the MSR Identity Toolbox: A MATLAB toolbox for speaker-recognition research. This toolbox contains a collection of MATLAB tools and routines that can be used for research and development in speaker recognition. Version 1.0 of the Identity Toolbox provides code to implement both the conventional GMM-UBM and the state-of-the-art i-vector-PLDA based speaker-recognition strategies. It…

Size: 52 MB

    Click the icon to access this download

  • Website



Human Parity in Speech Recognition

Established: December 1, 2015

This ongoing project aims to drive the state of the art in speech recognition toward  matching, and ultimately surpassing, humans, with a focus on unconstrained conversational speech.   The goal is a moving target as the scope of the task is broadened from high signal-to-noise speech between strangers (like in the Switchboard corpus) to include scenarios that make recognition more challenging, such as:  conversation among familiar speakers, multi-speaker meetings, and speech captured in noisy or distant-microphone environments.

From Captions to Visual Concepts and Back

Established: April 9, 2015

We introduce a novel approach for automatically generating image descriptions. Visual detectors, language models, and deep multimodal similarity models are learned directly from a dataset of image captions. Our system is state-of-the-art on the official Microsoft COCO benchmark, producing a BLEU-4 score of 29.1%. Human judges consider the captions to be as good as or better than humans 34% of the time.  

Eye Gaze and Face Pose for Better Speech Recognition

Established: October 2, 2014

We want to use eye gaze and face pose to understand what users are looking at, to what they are attending, and use this information to improve speech recognition. Any sort of language constraint makes speech recognition and understanding easier since the we know what words might come next. Our work has shown significant performance improvements in all stages of the speech-processing pipeline: including addressee detection, speech recognition, and spoken-language understanding.

Knowledge Graphs and Linked Big Data Resources for Conversational Understanding

Established: August 13, 2014

Interspeech 2014 Tutorial Web Page State-of-the-art statistical spoken language processing typically requires significant manual effort to construct domain-specific schemas (ontologies) as well as manual effort to annotate training data against these schemas. At the same time, a recent surge of activity and progress on semantic web-related concepts from the large search-engine companies represents a potential alternative to the manually intensive design of spoken language processing systems. Standards such as schema.org have been…

Dialog and Conversational Systems Research

Established: March 14, 2014

Conversational systems interact with people through language to assist, enable, or entertain. Research at Microsoft spans dialogs that use language exclusively, or in conjunctions with additional modalities like gesture; where language is spoken or in text; and in a variety of settings, such as conversational systems in apps or devices, and situated interactions in the real world. Projects Spoken Language Understanding

Meeting Recognition and Understanding

Established: July 30, 2013

In most organizations, staff spend many hours in meetings. This project addresses all levels of analysis and understanding, from speaker tracking and robust speech transcription to meaning extraction and summarization, with the goal of increasing productivity both during the meeting and after, for both participants and nonparticipants. The Meeting Recognition and Understanding project is a collection of online and offline spoken language understanding tasks. The following functions could be performed both on- and offline, but…

Spoken Language Understanding

Established: May 1, 2013

Spoken language understanding (SLU) is an emerging field in between the areas of speech processing and natural language processing. The term spoken language understanding has largely been coined for targeted understanding of human speech directed at machines. This project covers our research on SLU tasks such as domain detection, intent determination, and slot filling, using data-driven methods. Projects Deeper Understanding: Moving beyond shallow targeted understanding towards building domain independent SLU models. Scaling SLU: Quickly bootstrapping SLU…

Recurrent Neural Networks for Language Processing

Established: November 23, 2012

This project focuses on advancing the state-of-the-art in language processing with recurrent neural networks. We are currently applying these to language modeling, machine translation, speech recognition, language understanding and meaning representation. A special interest in is adding side-channels of information as input, to model phenomena which are not easily handled in other frameworks. A toolkit for doing RNN language modeling with side-information is in the associated download. Sample word vectors for use with this toolkit…

Speech Technology for Corpus-based Phonetics

Established: March 1, 2011

This project aims to develop new tools for phonetics research on large speech corpora without requiring traditional phonetic annotations by humans.  The idea is to adapt tools from speech recognition to replace the costly and time-consuming annotations usually required for phonetics research. This project is funded by an NSF grant "New tools and methods for very-large-scale phonetics research" to UPenn and SRI, with a Microsoft researcher as a consultant.

Understand User’s Intent from Speech and Text

Established: December 17, 2008

Understanding what users like to do/need to get is critical in human computer interaction. When natural user interface like speech or natural language is used in human-computer interaction, such as in a spoken dialogue system or with an internet search engine, language understanding becomes an important issue. Intent understanding is about identifying the action a user wants a computer to take or the information she/he would like to obtain, conveyed in a spoken utterance or…

Whisper: Windows Highly Intelligent Speech Recognizer

Established: December 16, 2008

Our first ASR system We are trying to perfect the ability of computers to recognize human speech by building speech and language models that are accurate, efficient, and easy to use. Our goal is to make human-computer interaction more natural. Our speech recognition engine, code-named Whisper (Windows Highly Intelligent SPEech Recognizer), offers state-of-the-art speaker-independent continuous speech recognition. The Whisper speech engine has been shipped by our Speech Products Group as part of the SAPI SDK, which…

Voice Search: Say What You Want and Get It

Established: December 15, 2008

In the Voice Search project, we envision a future where you can ask your cellphone for any kind of information and get it. With a small cellphone, there is a heavy tax on traditional keyboard based information entry, and we believe it can be significantly more convenient to communicate by voice. Our work focuses on making this communication more reliable, and able to cover the full range of information needed in daily life.

Acoustic Modeling

Established: January 29, 2004

Acoustic modeling of speech typically refers to the process of establishing statistical representations for the feature vector sequences computed from the speech waveform. Hidden Markov Model (HMM) is one most common type of acoustuc models. Other acosutic models include segmental models, super-segmental models (including hidden dynamic models), neural networks, maximum entropy models, and (hidden) conditional random fields, etc. Acoustic modeling also encompasses "pronunciation modeling", which describes how a sequence or multi-sequences of fundamental speech units (such as phones or…

Speech Enabled Language Tags (SALT)

Established: January 29, 2004

SALT is an XML based API that brings speech interactions to the Web. Starting as a research project that aims at applying the Web interaction model for spoken dialog, SALT has evolved into an industry standard with more than 70 companies and universities participating in advancing the standard. The design philosophy and brief overview of SALT is published here. Microsoft has announced many SALT based products, including SALT aware desktop IE and pocket IE add-ins.…

Language Modeling for Speech Recognition

Established: January 29, 2004

Did I just say "It's fun to recognize speech?" or "It's fun to wreck a nice beach?" It's hard to tell because they sound about the same. Of course, it's a lot more likely that I would say "recognize speech" than "wreck a nice beach." Language models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. This lets the recognizer make the right guess when two different sentences…

Multimodal Conversational User Interface

Established: January 29, 2004

Researchers in the Speech Technology group at Microsoft are working to allow the computer to travel through our living spaces as a handy electronic HAL pal that answers questions, arrange our calendars, and send messages to our friends and family. Most of us use computers to create text, understand numbers, view images, and send messages. There's only one problem with this marvelous machine. Our computer lives on a desktop, and though we command it with…

Speaker Identification (WhisperID)

Established: January 29, 2004

When you speak to someone, they don't just recognize what you say: they recognize who you are. WhisperID will let computers do that, too, figuring out who you are by the way you sound. Home PC Security. In your home, Speaker Identification will make it easier for you to log into your computer, just by saying "Log me in!" Office PC Security. In your office, Speaker ID can add an extra level of protection to…

Personalized Language Model for improved accuracy

Established: January 29, 2004

Traditionally speech recognition systems are built with models that are an average of many different users. A speaker-independent model is provided that works reasonably well for a large percentage of users. But the accuracy can be improved if the acoustic model is personalized to the given user. We have built a service that constantly looks at the user's sent emails to personalize the language model and we've observed a 30% reduction in error rate for…

Automatic Grammar Induction

Established: February 19, 2002

Automatic learning of speech recognition grammars from example sentences to ease the development of spoken language systems. Researcher Ye-Yi Wang wants to have more time for vacation, so he is teaching his computer to do some work for him. Wang has been working on Spoken Language Understanding for the MiPad project since he was hired to Microsoft Research. He has developed a robust parser and the understanding grammars for several projects. "Grammar development is painful…

Noise Robust Speech Recognition

Established: February 19, 2002

Techniques to improve the robustness of automatic speech recognition systems to noise and channel mismatches Robustness of ASR Technology to Background Noise You have probably seen that most people using a speech dictation software are wearing a close-talking microphone. So, why has senior researcher Li Deng been trying to get rid of close-talking microphones? Close-talking microphones pick up relatively little background noise and speech recognition systems can obtain decent accuracy with them. If you are…


Established: February 19, 2002

Your Pad or MiPad It only took one scientist mumbling at a monitor to give birth to the idea that a computer should be able to listen, understand, and even talk back. But years of effort haven't gotten us closer to the Jetson dream: a computer that listens better than your spouse, better than your boss, and even better than your dog Spot. Using state-of-the-art speech recognition, and strengthening this new science with pen input,…

Whistler Text-to-Speech Engine

Established: November 5, 2001

The talking computer HAL in the 1968 film "2001-A Space Odyssey" had an almost human voice, but it was the voice of an actor, not a computer. Getting a real computer to talk like HAL has proven one of the toughest problems posed by "2001." Microsoft's contribution to this field is "Whistler" (Windows Highly Intelligent STochastic taLkER), a trainable text-to-speech engine which was released in 1998 as part of the SAPI4.0 SDK, and then as…


Speech Recognition Leaps Forward

By Janie Chang, Writer, Microsoft Research During Interspeech 2011, the 12th annual Conference of the International Speech Communication Association being held in Florence, Italy, from Aug. 28 to 31, researchers from Microsoft Research will present work that dramatically improves the potential of real-time, speaker-independent, automatic speech recognition. Dong Yu, researcher at Microsoft Research Redmond, and Frank Seide, senior researcher and research manager with Microsoft Research Asia, have been spearheading this work, and their teams have collaborated…

August 2011

Microsoft Research Blog

Kinect Audio: Preparedness Pays Off

By Rob Knies, Senior Editor, Microsoft Research It always helps to be prepared. Just ask Ivan Tashev. A principal software architect in the Speech group at Microsoft Research Redmond, Tashev played an integral role in developing the audio technology that enabled Kinect for Xbox 360 to become the fastest-selling consumer-electronics device ever, with eight million units sold in its first 60 days on the market. Kinect represents part of Microsoft’s deep investment in natural user…

April 2011

Microsoft Research Blog

Making Car Infotainment Simple, Natural

By Rob Knies, Managing Editor, Microsoft Research You’re steering with your left hand while your right is punching car-stereo buttons in eager search of that amazing new Lady Gaga song. Your mobile phone rings, and as you adjust your headset—hands-free, naturally—the driver in front of you slams on his brakes … Sound familiar? For drivers, such a scenario is almost commonplace. These days, the automobile is tricked out with all sorts of conveniences, designed to…

November 2009

Microsoft Research Blog