The mission of the Cognitive Services Research group (CSR) is to make fundamental contributions to advancing the state of the art of the most challenging problems in speech, language, and vision—both within Microsoft and the external research community.

We conduct cutting edge research in all aspects of spoken language processing and computer vision. This includes audio-visual fusion; visual-semantic reasoning; federated learning; speech recognition; speech enhancement; speaker recognition and diarization; machine reading comprehension; text summarization; multilingual language modeling; and related topics in natural language processing, understanding, and generation; as well as face forgery detection; object detection and segmentation; dense pose, head, and mask tracking, action recognition; image and video captioning; and other topics in image and real-time video understanding. We leverage large-scale GPU and CPU clusters as well as internal and public data sets to develop world-leading deep learning technologies for forward-looking topics such as audio-visual far-field meeting transcription, automatic meeting minutes generation, and multi-modal dialog systems. We publish our research on public benchmarks, such as our breakthrough human parity performances on the Switchboard conversational speech recognition task and Stanford’s Conversational Question Answering Challenge (CoQA).

In addition to expanding our scientific understanding of speech, language, and vision, our work finds outlets in Microsoft products such as Azure Cognitive Services, HoloLens, Teams, Windows, Office, Bing, Cortana, Skype Translator, Xbox, and more.

The Cognitive Services Research group is managed by Michael Zeng.


Speech and Dialog

The former Speech and Dialog Research Group (SDRG) was responsible for fundamental advances in speech and language technologies, including speech recognition, language modeling, language understanding, spoken language systems and multi-modal dialog systems. Contributions included the breakthrough human parity performances on the Switchboard conversational speech recognition task and Stanford’s Conversational Question Answering Challenge (CoQA). SDRG merged with the Azure computer vision group in 2020 to form the Cognitive Services Research Group.

Computer Vision

The former Computer Vision Research Group (CVRG) oversaw research in core computer vision tasks including object detection, object tracking, human understanding, and cross-modal pretraining . CVRG merged with the Speech and Dialog Research Group in 2020 to form the Cognitive Services Research Group.


CSR organizes the Distinguished Talk Series to host discussions with leaders in academia and industry.






Prof. Meng Jiang University of Notre Dame 9/10/2020 Scientific Knowledge Extraction: New Tasks and Methods
Prof. Vivian Yun-Nung Chen National Taiwan University 10/2/2020 Are Your Dialogue Systems Robust and Scalable?
Prof. Fei Liu University of Central Florida 10/30/2020 Toward Robust Abstractive Multi-Document Summarization and Information Consolidation
Prof. Jiajun Wu Stanford University 11/19/2020 Neuro-Symbolic Visual Concept Learning
Prof. Xiang Ren University of Southern California 12/18/2020 Label Efficient Learning with Human Explanations
Prof. Tianqi Chen Carnegie Mellon University 1/15/2021 Elements of Learning Systems
Prof. Song Han MIT 1/21/2021 Putting AI on a Diet: TinyML and Efficient Deep Learning
Prof. Diyi Yang Georgia Tech 2/18/2021 Language Understanding in Social Context: Theory and Practice
Prof. Aditya Grover Facebook AI Research/UCLA 3/18/2021 Transformer Language Models as Universal Computation Engines