Speech and Dialog Research Group

Established: March 27, 2000

Research in speech recognition, language modeling, language understanding, spoken language systems and dialog systems.


Our goal is to fundamentally advance the state-of-the-art in speech and dialog technology. To achieve this, we are working in all aspects of machine learning, neural network modeling, signal processing, and dialog modeling. Recently, to support our work, we have developed the Microsoft Cognitive Toolkit (CNTK, formerly Computational Network Toolkit), which makes it easy to define complex neural network structures, and train them across multiple GPUs with unprecedented efficiency. You can find out more about this work by exploring the projects and individual home pages listed below.

In addition to advancing our basic scientific understanding of natural language processing and advancing the state of the art, our work finds an outlet in Microsoft products such as Cortana, Xbox, and the Project Oxford web services suite. We have developed two of the key services. LUIS (Language Understanding Intelligent Service) makes it very easy for a developer to add language understanding to applications. From a small number of examples, LUIS is able to determine a user’s intent when they talk or type. CRIS (Custom Recognition Intelligent Service) provides companies with the ability to deploy customized speech recognition. The developer uploads sample audio files and transcriptions, and the recognizer is customized to the specific circumstances. This can make recognition far better in unusual circumstances, such as recognition on a factory floor, or outdoors. At runtime, both LUIS and CRIS are accessed via web APIs.

The Speech & Dialog Group is managed by Xuedong Huang.

Previous projects

Former Members







An Introduction to Computational Networks and the Computational Network Toolkit
Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng Yao, Oleksii Kuchaiev, Yu Zhang, Frank Seide, Zhiheng Huang, Brian Guenter, Huaming Wang, Jasha Droppo, Geoffrey Zweig, Chris Rossbach, Jie Gao, Andreas Stolcke, Jon Currey, Malcolm Slaney, Guoguo Chen, Amit Agarwal, Chris Basoglu, Marko Padmilac, Alexey Kamenev, Vladimir Ivanov, Scott Cypher, Hari Parthasarathi, Bhaskar Mitra, Baolin Peng, Xuedong Huang, Microsoft Research, October 1, 2014, View abstract, Download PDF


Decoding Auditory Attention (in Real Time) with EEG
Edmund Lalor, Nima Mesgarani, Siddharth Rajaram, Adam O'Donovan, James Wright, Inyong Choi, Jonathan Brumberg, Nai Ding, Adrian KC Lee, Nils Peters, Sudarshan Ramenahalli, Jeffrey Pompe, Barbara Shinn-Cunningham, Malcolm Slaney, Shihab Shamma, in Proceedings of the 37th ARO MidWinter Meeting, Association for Research in Otolaryngology (ARO), February 17, 2013, View abstract, Download PDF


Generating Exact Lattices in the WFST Framework
Daniel Povey, Mirko Hannemann, Gilles Boulianne, Lukas Burget, Arnab Ghoshal, Milos Janda, Martin Karafiat, Stefan Kombrink, Petr Motlicek, Yanmin Qian, Korbinian Riedhammer, Karel Vesely, Ngoc Thang Vu, IEEE International Confrence on Acoustics, Speech, and Signal Processing (ICASSP), March 1, 2012, View abstract, Download PDF



Pitch Change Toolbox

October 2014

This Matlab toolbox implements the pitch-change algorithm described by Slaney, Shriberg and Huang in their Interspeech 2013 paper “Pitch-gesture modeling using subband autocorrelation change detection.” Calculating speaker pitch (or f0) is typically the first computational step in modeling tone and intonation for spoken language understanding. Usually pitch is treated as a fixed, single-valued quantity. The…

Size: 12 MB

    Click the icon to access this download

  • Website

Computational Network Toolkit

April 2014

MSR Identity Toolbox (Without Binaries)

October 2013

This is the MSR Identity Toolbox: A MATLAB toolbox for speaker-recognition research. This toolbox contains a collection of MATLAB tools and routines that can be used for research and development in speaker recognition. Version 1.0 of the Identity Toolbox provides code to implement both the conventional GMM-UBM and the state-of-the-art i-vector-PLDA based speaker-recognition strategies. It…

Size: 2 MB

    Click the icon to access this download

  • Website

MSR Identity Toolbox (With Binaries)

October 2013

This is the MSR Identity Toolbox: A MATLAB toolbox for speaker-recognition research. This toolbox contains a collection of MATLAB tools and routines that can be used for research and development in speaker recognition. Version 1.0 of the Identity Toolbox provides code to implement both the conventional GMM-UBM and the state-of-the-art i-vector-PLDA based speaker-recognition strategies. It…

Size: 52 MB

    Click the icon to access this download

  • Website



Human Parity in Speech Recognition

Established: December 1, 2015

This ongoing project aims to drive the state of the art in speech recognition toward  matching, and ultimately surpassing, humans, with a focus on unconstrained conversational speech.   The goal is a moving target as the scope of the task is…

Speech Technology for Corpus-based Phonetics

Established: March 1, 2011

This project aims to develop new tools for phonetics research on large speech corpora without requiring traditional phonetic annotations by humans.  The idea is to adapt tools from speech recognition to replace the costly and time-consuming annotations usually required for phonetics…

From Captions to Visual Concepts and Back

Established: April 9, 2015

We introduce a novel approach for automatically generating image descriptions. Visual detectors, language models, and deep multimodal similarity models are learned directly from a dataset of image captions. Our system is state-of-the-art on the official Microsoft COCO benchmark, producing a…

Eye Gaze and Face Pose for Better Speech Recognition

Established: October 2, 2014

We want to use eye gaze and face pose to understand what users are looking at, to what they are attending, and use this information to improve speech recognition. Any sort of language constraint makes speech recognition and understanding easier…

Dialog and Conversational Systems Research

Established: March 14, 2014

Conversational systems interact with people through language to assist, enable, or entertain. Research at Microsoft spans dialogs that use language exclusively, or in conjunctions with additional modalities like gesture; where language is spoken or in text; and in a variety…

Meeting Recognition and Understanding

Established: July 30, 2013

In most organizations, staff spend many hours in meetings. This project addresses all levels of analysis and understanding, from speaker tracking and robust speech transcription to meaning extraction and summarization, with the goal of increasing productivity both during the meeting…

Spoken Language Understanding

Established: May 1, 2013

Spoken language understanding (SLU) is an emerging field in between the areas of speech processing and natural language processing. The term spoken language understanding has largely been coined for targeted understanding of human speech directed at machines. This project covers…

Recurrent Neural Networks for Language Processing

Established: November 23, 2012

This project focuses on advancing the state-of-the-art in language processing with recurrent neural networks. We are currently applying these to language modeling, machine translation, speech recognition, language understanding and meaning representation. A special interest in is adding side-channels of information…

Understand User’s Intent from Speech and Text

Established: December 17, 2008

Understanding what users like to do/need to get is critical in human computer interaction. When natural user interface like speech or natural language is used in human-computer interaction, such as in a spoken dialogue system or with an internet search…

Whisper: Windows Highly Intelligent Speech Recognizer

Established: December 16, 2008

Our first ASR system We are trying to perfect the ability of computers to recognize human speech by building speech and language models that are accurate, efficient, and easy to use. Our goal is to make human-computer interaction more natural.…

Voice Search: Say What You Want and Get It

Established: December 15, 2008

In the Voice Search project, we envision a future where you can ask your cellphone for any kind of information and get it. With a small cellphone, there is a heavy tax on traditional keyboard based information entry, and we…

Speaker Identification (WhisperID)

Established: January 29, 2004

When you speak to someone, they don't just recognize what you say: they recognize who you are. WhisperID will let computers do that, too, figuring out who you are by the way you sound. Home PC Security. In your home,…

Personalized Language Model for improved accuracy

Established: January 29, 2004

Traditionally speech recognition systems are built with models that are an average of many different users. A speaker-independent model is provided that works reasonably well for a large percentage of users. But the accuracy can be improved if the acoustic…

Multimodal Conversational User Interface

Established: January 29, 2004

Researchers in the Speech Technology group at Microsoft are working to allow the computer to travel through our living spaces as a handy electronic HAL pal that answers questions, arrange our calendars, and send messages to our friends and family.…

Speech Enabled Language Tags (SALT)

Established: January 29, 2004

SALT is an XML based API that brings speech interactions to the Web. Starting as a research project that aims at applying the Web interaction model for spoken dialog, SALT has evolved into an industry standard with more than 70…

Language Modeling for Speech Recognition

Established: January 29, 2004

Did I just say "It's fun to recognize speech?" or "It's fun to wreck a nice beach?" It's hard to tell because they sound about the same. Of course, it's a lot more likely that I would say "recognize speech"…

Acoustic Modeling

Established: January 29, 2004

Acoustic modeling of speech typically refers to the process of establishing statistical representations for the feature vector sequences computed from the speech waveform. Hidden Markov Model (HMM) is one most common type of acoustuc models. Other acosutic models include segmental models, super-segmental models…

Noise Robust Speech Recognition

Established: February 19, 2002

Techniques to improve the robustness of automatic speech recognition systems to noise and channel mismatches Robustness of ASR Technology to Background Noise You have probably seen that most people using a speech dictation software are wearing a close-talking microphone. So,…


Established: February 19, 2002

Your Pad or MiPad It only took one scientist mumbling at a monitor to give birth to the idea that a computer should be able to listen, understand, and even talk back. But years of effort haven't gotten us closer…

Automatic Grammar Induction

Established: February 19, 2002

Automatic learning of speech recognition grammars from example sentences to ease the development of spoken language systems. Researcher Ye-Yi Wang wants to have more time for vacation, so he is teaching his computer to do some work for him. Wang…

Whistler Text-to-Speech Engine

Established: November 5, 2001

The talking computer HAL in the 1968 film "2001-A Space Odyssey" had an almost human voice, but it was the voice of an actor, not a computer. Getting a real computer to talk like HAL has proven one of the…


Speech Recognition Leaps Forward

By Janie Chang, Writer, Microsoft Research During Interspeech 2011, the 12th annual Conference of the International Speech Communication Association being held in Florence, Italy, from Aug. 28 to 31, researchers from Microsoft Research will present work that dramatically improves the potential…

August 2011

Microsoft Research Blog

Kinect Audio: Preparedness Pays Off

By Rob Knies, Senior Editor, Microsoft Research It always helps to be prepared. Just ask Ivan Tashev. A principal software architect in the Speech group at Microsoft Research Redmond, Tashev played an integral role in developing the audio technology that…

April 2011

Microsoft Research Blog

Making Car Infotainment Simple, Natural

By Rob Knies, Managing Editor, Microsoft Research You’re steering with your left hand while your right is punching car-stereo buttons in eager search of that amazing new Lady Gaga song. Your mobile phone rings, and as you adjust your headset—hands-free,…

November 2009

Microsoft Research Blog