Cognitive Services Research

Cognitive Services Research


Show previous publications


Artificial Neural Network Features for Speaker Diarization

The relation of eye gaze and face pose: Potential impact on speech recognition

An Introduction to Computational Networks and the Computational Network Toolkit

Neural Network Models for Lexical Addressee Detection

Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine

Highly Accurate Phonetic Segmentation Using Boundary Correction Models and System Fusion


Statistical Modeling of the Speech Signal

Dual stage probabilistic voice activity detector

Reverberated Speech Signal Separation Based on Regularized Subband Feedforward ICA and Instantaneous Direction of Arrival


Commute UX: Voice Enabled In-car Infotainment System

Unified Framework for Single Channel Speech Enhancement


Sound Capture System and Spatial Filter for Small Devices

Data Driven Beamformer Design for Binaural Headset

Robust Design of Wideband Loudspeaker Arrays

An EM-based Probabilistic Approach for Acoustic Echo Suppression


Commute UX: Telephone Dialog System for Location-based Services

Robust Location Understanding in Spoken Dialog Systems Using Intersections

Robust Adaptive Beamforming Algorithm Using Instantaneous Direction of Arrival with Enhanced Noise Suppression Capability

Microphone Array Post-Filter Using Incremental Bayes Learning to Track the Spatial Distribution of Speech and Noise


Microphone Array Post-Processor Using Instantaneous Direction of Arrival

Suppression Rule for Speech Recognition Friendly Noise Suppressors


A Compact Multi-Sensor Headset for Hands-Free Communication

Microphone Array for Headset with Spatial Noise Suppressor

Reverberation Reduction for Improved Speech Recognition

Reverberation Reduction for Better Speech Recognition

News & features

News & features

Show previous projects

Current Projects

News & features

News & features


The mission of the Cognitive Services Research group (CSR) is to make fundamental contributions to advancing the state of the art of the most challenging problems in speech, language, and vision—both within Microsoft and the external research community.

We conduct cutting edge research in all aspects of spoken language processing and computer vision. This includes audio-visual fusion; visual-semantic reasoning; federated learning; speech recognition; speech enhancement; speaker recognition and diarization; machine reading comprehension; text summarization; multilingual language modeling; and related topics in natural language processing, understanding, and generation; as well as face forgery detection; object detection and segmentation; dense pose, head, and mask tracking, action recognition; image and video captioning; and other topics in image and real-time video understanding. We leverage large-scale GPU and CPU clusters as well as internal and public data sets to develop world-leading deep learning technologies for forward-looking topics such as audio-visual far-field meeting transcription, automatic meeting minutes generation, and multi-modal dialog systems. We publish our research on public benchmarks, such as our breakthrough human parity performances on the Switchboard conversational speech recognition task and Stanford’s Conversational Question Answering Challenge (CoQA).

In addition to expanding our scientific understanding of speech, language, and vision, our work finds outlets in Microsoft products such as Azure Cognitive Services, HoloLens, Teams, Windows, Office, Bing, Cortana, Skype Translator, Xbox, and more.

The Cognitive Services Research group is managed by Michael Zeng.

For more information on our vision research or recent progress leveraging knowledge and language, please see the pages for our Computer Vision and Knowledge and Language teams.


Current members

Speech and Dialog alumni

Speech and Dialog

The former Speech and Dialog Research Group (SDRG) was responsible for fundamental advances in speech and language technologies, including speech recognition, language modeling, language understanding, spoken language systems and multi-modal dialog systems. Contributions included the breakthrough human parity performances on the Switchboard conversational speech recognition task and Stanford’s Conversational Question Answering Challenge (CoQA). SDRG merged with the Azure computer vision group in 2020 to form the Cognitive Services Research Group.

Former members

Computer Vision

Azure Computer Vision Research (ACVR) group is part of the Cognitive Services Research (CSR) group, focusing on cutting edge research in computer vision to advance the state of the art and develop the next generation framework for visual recognition.

The problems that we are interested in include image classification; object detection and segmentation; motion analysis and object tracking; dense pose, head, and mask tracking, action recognition; image generation; real-time video understanding; visual representation learning; multi-modality representation learning; and unsupervised/self-supervised/contrastive learning. We leverage large-scale GPU and CPU clusters as well as internal and public data sets to develop world-leading deep learning technologies for core vision problems and generic visual representation that can be customized to a wide range of downstream tasks and real applications.

The team also runs Project Florence, with a focus on developing universal backbones with shared representations for a wide spectrum of visual categories, aiming at accelerating Microsoft vision product shipping using state-of-the-art large-scale deep learning models.


CSR organizes the Distinguished Talk Series to host discussions with leaders in academia and industry. If you’re interested in giving a talk, please contact Chenguang Zhu (






Prof. Ashton Anderson University of Toronto 4/09/2021 The Cultural Structure of Online Platforms
Prof. Aditya Grover Facebook AI Research/UCLA 3/18/2021 Transformer Language Models as Universal Computation Engines
Prof. Diyi Yang Georgia Tech 2/18/2021 Language Understanding in Social Context: Theory and Practice
Prof. Song Han MIT 1/21/2021 Putting AI on a Diet: TinyML and Efficient Deep Learning
Prof. Tianqi Chen Carnegie Mellon University 1/15/2021 Elements of Learning Systems
Prof. Xiang Ren University of Southern California 12/18/2020 Label Efficient Learning with Human Explanations
Prof. Jiajun Wu Stanford University 11/19/2020 Neuro-Symbolic Visual Concept Learning
Prof. Fei Liu University of Central Florida 10/30/2020 Toward Robust Abstractive Multi-Document Summarization and Information Consolidation
Prof. Vivian Yun-Nung Chen National Taiwan University 10/2/2020 Are Your Dialogue Systems Robust and Scalable?
Prof. Meng Jiang University of Notre Dame 9/10/2020 Scientific Knowledge Extraction: New Tasks and Methods