Speech and Dialog Research Group

Established: March 27, 2000

Research in speech recognition, language modeling, language understanding, spoken language systems and dialog systems.


Our goal is to fundamentally advance the state-of-the-art in speech and dialog technology. To achieve this, we are working in all aspects of machine learning, neural network modeling, signal processing, and dialog modeling. Recently, to support our work, we have developed the Computational Network Toolkit (CNTK), which makes it easy to define complex neural network structures, and train them across multiple GPUs with unprecedented efficiency. You can find out more about this work by exploring the projects and individual home pages listed below.

In addition to advancing our basic scientific understanding of natural language processing, our work finds an outlet in Microsoft products such as Cortana, Xbox, and the Project Oxford web services suite. We have developed two of the key services. LUIS (Language Understanding Intelligent Service) makes it very easy for a developer to add language understanding to applications. From a small number of examples, LUIS is able to determine a user’s intent when they talk or type. CRIS (Custom Recognition Intelligent Service) provides companies with the ability to deploy customized speech recognition. The developer uploads sample audio files and transcriptions, and the recognizer is customized to the specific circumstances. This can make recognition far better in unusual circumstances, such as recognition on a factory floor, or outdoors. At runtime, both LUIS and CRIS are accessed via web APIs.

The Speech & Dialog Group is managed by Geoffrey Zweig.

Previous projects

  • Language Understanding: Don’t  just recognize the words a user spoke, but understand what they mean.
  • Meeting Recognition and Understanding: Make meetings more useful using speech recognition and understanding technology.
  • Noise Robustness: How do we make the system work when background noise is present?
  • Voice search. Users can search for information such as a business from your phone.
  • Automatic Grammar Induction: How do create grammars to ease the development of spoken language systems?
  • (MiPad) Multimodal Interactive Pad. Our first multimodal prototype.
  • SALT (Speech Enabled Language Tags): A markup language for the multimodal web
  • Intent Understanding. Not recognize the words the user says, but understand what they mean.
  • Multimodal Conversational User Interface
  • Personalized Language Model for improved accuracy
  • (Whisper) Speech Recognition. Our previous dictation-oriented speech recognition project is a state-of-the-art general-purpose speech recognizer.
  • (WhisperID) Speaker Identification . Who is doing the talking?
  • Speech Application Programming Interface (SAPI) Development Toolkit. The Whisper speech recognizer can be used by developers to produce applications using speech recognition

Former Members

  • Asela Gunawardana
  • Kuansan Wang
  • Hsiao-Wuen Hon
  • XD Huang
  • Mei-Yuh Hwang
  • Fil Alleva
  • Li Jiang
  • Mike Plumpe



Decoding Auditory Attention (in Real Time) with EEG

Edmund Lalor, Nima Mesgarani, Siddharth Rajaram, Adam O'Donovan, James Wright, Inyong Choi, Jonathan Brumberg, Nai Ding, Adrian KC Lee, Nils Peters, Sudarshan Ramenahalli, Jeffrey Pompe, Barbara Shinn-Cunningham, Malcolm Slaney, Shihab Shamma

February 2013

Association for Research in Otolaryngology (ARO)


Pitch Change Toolbox

October 2014

This Matlab toolbox implements the pitch-change algorithm described by Slaney, Shriberg and Huang in their Interspeech 2013 paper “Pitch-gesture modeling using subband autocorrelation change detection.” Calculating speaker pitch (or f0) is typically the first computational step in modeling tone and intonation for spoken language understanding. Usually pitch is treated as a fixed, single-valued quantity. The…

Size: 12 MB

    Click the icon to access this download

  • Website

Computational Network Toolkit

April 2014



From Captions to Visual Concepts and Back

Established: April 9, 2015

We introduce a novel approach for automatically generating image descriptions. Visual detectors, language models, and deep multimodal similarity models are learned directly from a dataset of image captions. Our system is state-of-the-art on the official Microsoft COCO benchmark, producing a…

Eye Gaze and Face Pose for Better Speech Recognition

Established: October 2, 2014

We want to use eye gaze and face pose to understand what users are looking at, to what they are attending, and use this information to improve speech recognition. Any sort of language constraint makes speech recognition and understanding easier…

Dialog and Conversational Systems Research

Established: March 14, 2014

Conversational systems interact with people through language to assist, enable, or entertain. Research at Microsoft spans dialogs that use language exclusively, or in conjunctions with additional modalities like gesture; where language is spoken or in text; and in a variety…

Meeting Recognition and Understanding

Established: July 30, 2013

In most organizations, staff spend many hours in meetings. This project addresses all levels of analysis and understanding, from speaker tracking and robust speech transcription to meaning extraction and summarization, with the goal of increasing productivity both during the meeting…

Recurrent Neural Networks for Language Processing

Established: November 23, 2012

This project focuses on advancing the state-of-the-art in language processing with recurrent neural networks. We are currently applying these to language modeling, machine translation, speech recognition, language understanding and meaning representation. A special interest in is adding side-channels of information…

Language Modeling for Speech Recognition

Established: January 29, 2004

Did I just say "It's fun to recognize speech?" or "It's fun to wreck a nice beach?" It's hard to tell because they sound about the same. Of course, it's a lot more likely that I would say "recognize speech"…

Acoustic Modeling

Established: January 29, 2004

Acoustic modeling of speech typically refers to the process of establishing statistical representations for the feature vector sequences computed from the speech waveform. Hidden Markov Model (HMM) is one most common type of acoustuc models. Other acosutic models include segmental models, super-segmental models…

Whistler Text-to-Speech Engine

Established: November 5, 2001

The talking computer HAL in the 1968 film "2001-A Space Odyssey" had an almost human voice, but it was the voice of an actor, not a computer. Getting a real computer to talk like HAL has proven one of the…