
Publications
Show previous publications
2014
Artificial Neural Network Features for Speaker Diarization
The relation of eye gaze and face pose: Potential impact on speech recognition
An Introduction to Computational Networks and the Computational Network Toolkit
Neural Network Models for Lexical Addressee Detection
Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine
Highly Accurate Phonetic Segmentation Using Boundary Correction Models and System Fusion
2010
Statistical Modeling of the Speech Signal
Dual stage probabilistic voice activity detector
2009
Commute UX: Voice Enabled In-car Infotainment System
Unified Framework for Single Channel Speech Enhancement
2008
Sound Capture System and Spatial Filter for Small Devices
Data Driven Beamformer Design for Binaural Headset
Robust Design of Wideband Loudspeaker Arrays
An EM-based Probabilistic Approach for Acoustic Echo Suppression
2007
Commute UX: Telephone Dialog System for Location-based Services
Robust Location Understanding in Spoken Dialog Systems Using Intersections
2006
Microphone Array Post-Processor Using Instantaneous Direction of Arrival
Suppression Rule for Speech Recognition Friendly Noise Suppressors
2005
A Compact Multi-Sensor Headset for Hands-Free Communication
Microphone Array for Headset with Spatial Noise Suppressor
Reverberation Reduction for Improved Speech Recognition
Reverberation Reduction for Better Speech Recognition
News & features
News & features
Show previous projects
- Language Understanding: Don’t just recognize the words a user spoke, but understand what they mean.
- Noise Robustness: How do we make the system work when background noise is present?
- Voice search: Users can search for information such as a business from your phone.
- Automatic Grammar Induction: How do create grammars to ease the development of spoken language systems?
- (MiPad) Multimodal Interactive Pad: Our first multimodal prototype.
- SALT (Speech Enabled Language Tags): A markup language for the multimodal web
- From Captions to Visual Concepts and Back: Image captioning and understanding
- Intent Understanding: Not recognize the words the user says, but understand what they mean.
- Multimodal Conversational User Interface
- Personalized Language Model for improved accuracy
- Recurrent Neural Networks for Language Processing
- Speech Technology for Computational Phonetics and Reading Assessment
- (Whisper) Speech Recognition: Our previous dictation-oriented speech recognition project is a state-of-the-art general-purpose speech recognizer.
- (WhisperID) Speaker Identification: Who is doing the talking?
- Speech Application Programming Interface (SAPI) Development Toolkit: The Whisper speech recognizer can be used by developers to produce applications using speech recognition
Current Projects
News & features
News & features
Overview
The mission of the Cognitive Services Research group (CSR) is to make fundamental contributions to advancing the state of the art of the most challenging problems in speech, language, and vision—both within Microsoft and the external research community.
We conduct cutting edge research in all aspects of spoken language processing and computer vision. This includes audio-visual fusion; visual-semantic reasoning; federated learning; speech recognition; speech enhancement; speaker recognition and diarization; machine reading comprehension; text summarization; multilingual language modeling; and related topics in natural language processing, understanding, and generation; as well as face forgery detection; object detection and segmentation; dense pose, head, and mask tracking, action recognition; image and video captioning; and other topics in image and real-time video understanding. We leverage large-scale GPU and CPU clusters as well as internal and public data sets to develop world-leading deep learning technologies for forward-looking topics such as audio-visual far-field meeting transcription, automatic meeting minutes generation, and multi-modal dialog systems. We publish our research on public benchmarks, such as our breakthrough human parity performances on the Switchboard conversational speech recognition task and Stanford’s Conversational Question Answering Challenge (CoQA).
In addition to expanding our scientific understanding of speech, language, and vision, our work finds outlets in Microsoft products such as Azure Cognitive Services, HoloLens, Teams, Windows, Office, Bing, Cortana, Skype Translator, Xbox, and more.
The Cognitive Services Research group is managed by Michael Zeng.
People
Current members
Xiyang Dai
Senior Researcher
Mei Gao
Research SDE II
Nick Gonsalves
Research SDE II
Bin (Leo) Hsiao
Senior Research SDE
Kenichi Kumatani
Principal Researcher
Canrun Li
Research SDE
Leo Shen
Senior Research SDE
Manthan Thakker
Research SDE
Zhen Xiao
Principal Architect
Yichong Xu
Senior Researcher
Speech and Dialog alumni
Jasha Droppo
Hakan Erdogan
Asela Gunawardana
Senior Researcher
Li Jiang
Distinguished Engineer
Microsoft Cloud & AI
Sungjin Lee
Abdelrahman Mohamad
Researcher
Mike Seltzer
Dong Yu
Principal Researcher
Speech and Dialog
The former Speech and Dialog Research Group (SDRG) was responsible for fundamental advances in speech and language technologies, including speech recognition, language modeling, language understanding, spoken language systems and multi-modal dialog systems. Contributions included the breakthrough human parity performances on the Switchboard conversational speech recognition task and Stanford’s Conversational Question Answering Challenge (CoQA). SDRG merged with the Azure computer vision group in 2020 to form the Cognitive Services Research Group.
Former members
- Xie Chen
- Jasha Droppo
- Hakan Erdogan
- Asela Gunawardana
- Hsiao-Wuen Hon
- Mei-Yuh Hwang
- Li Jiang
- Y. C. Ju
- Sungjin Lee
- Abdelrahman Mohamed
- Frank Seide
- Mike Seltzer
- Andreas Stolcke
- Kuansan Wang
- Jason Williams
- Wayne Xiong
- Dong Yu
- Geoff Zweig
Computer Vision
The former Computer Vision Research Group (CVRG) oversaw research in core computer vision tasks including object detection, object tracking, human understanding, and cross-modal pretraining . CVRG merged with the Speech and Dialog Research Group in 2020 to form the Cognitive Services Research Group.
Talks
CSR organizes the Distinguished Talk Series to host discussions with leaders in academia and industry. If you’re interested in giving a talk, please contact Chenguang Zhu (chezhu@microsoft.com).
Presenter |
Affiliation |
Date |
Title |
Prof. Meng Jiang | University of Notre Dame | 9/10/2020 | Scientific Knowledge Extraction: New Tasks and Methods |
Prof. Vivian Yun-Nung Chen | National Taiwan University | 10/2/2020 | Are Your Dialogue Systems Robust and Scalable? |
Prof. Fei Liu | University of Central Florida | 10/30/2020 | Toward Robust Abstractive Multi-Document Summarization and Information Consolidation |
Prof. Jiajun Wu | Stanford University | 11/19/2020 | Neuro-Symbolic Visual Concept Learning |
Prof. Xiang Ren | University of Southern California | 12/18/2020 | Label Efficient Learning with Human Explanations |
Prof. Tianqi Chen | Carnegie Mellon University | 1/15/2021 | Elements of Learning Systems |
Prof. Song Han | MIT | 1/21/2021 | Putting AI on a Diet: TinyML and Efficient Deep Learning |
Prof. Diyi Yang | Georgia Tech | 2/18/2021 | Language Understanding in Social Context: Theory and Practice |
Prof. Aditya Grover | Facebook AI Research/UCLA | 3/18/2021 | Transformer Language Models as Universal Computation Engines |