Portrait of Xiaodong He

Xiaodong He

Principal Researcher


Xiaodong He is a Principal Researcher in the Deep Learning Technology Center of Microsoft Research, Redmond, WA, USA. He is also an Affiliate Professor in the Department of Electrical Engineering at the University of Washington (Seattle), serves in doctoral supervisory committees. His research interests are mainly in artificial intelligence areas including deep learning, natural language, computer vision, speech, information retrieval, and knowledge representation.

He has published more than 100 papers in ACL, EMNLP, NAACL, CVPR, SIGIR, WWW, CIKM, NIPS, ICLR, ICASSP, Proc. IEEE, IEEE TASLP, IEEE SPM, and other venues. He received several awards including the Outstanding Paper Award at ACL 2015. He has led the development of the MSR-NRC-SRI entry and the MSR entry that won the No. 1 Place in the 2008 NIST Machine Translation Evaluation and the 2011 IWSLT Evaluation (Chinese-to-English), respectively. He is also a co-inventor of the DSSM (20132014a, 2014b), which is broadly applied to language, vision, IR and knowledge representation tasks. More recently, he and colleagues developed the MSR image captioning system that achieves the highest score in the Turing test and won the first prize, tied with Google, at the COCO Captioning Challenge 2015. His work was reported by Communications of the ACM in January 2016. He is leading the image captioning effort now is part of Microsoft Cognitive Services and CaptionBot. The work was widely covered in media including Business InsiderTechCrunchForbes, The Washington Post, CNN, BBC. The services also support applications such as Seeing AI, Microsoft Word and PowerPoint, etc.

He has held editorial positions on several IEEE Journals, served as an area chair for NAACL-HLT 2015, and served in the organizing committee/program committee of major speech and language processing conferences. He is an elected member of the IEEE SLTC for the term of 2015-2017. He is a senior member of IEEE and a member of ACL. He was elected as the Chair of the IEEE Seattle Section in 2016.

He received a bachelor degree from Tsinghua University (Beijing) in 1996, MS degree from Chinese Academy of Sciences (Beijing) in 1999, and the PhD degree from the University of Missouri – Columbia in 2003.


Invited talks, tutorials, and code release


Vision and Language Intelligence

Established: June 28, 2017

This project aims at driving disruptive advances in vision and language intelligence. We believe future breakthroughs in multimodal intelligence will empower smart communications between humans and the world and enable next-generation scenarios such as a universal chatbot and intelligent augmented reality. To these ends, we are focusing on understanding, reasoning, and generation across language and vision, and creation of intelligent services, including vision-to-text captioning, text-to-vision generation, and question answering/dialog about images and videos.

Deep Learning for Machine Reading Comprehension

Established: September 1, 2016

The goal of this project is to teach a computer to read and answer general questions pertaining to a document. We recently released a large scale MRC dataset, MS MARCO.  We developed a ReasoNet model to mimic the inference process of human readers. With a question in mind, ReasoNets read a document repeatedly, each time focusing on different parts of the document until a satisfying answer is found or formed. The extension of ReasoNet (ReasoNet-Memory)…


Established: August 1, 2016

Embed objects of any modality (e.g. images, queries, sentences, etc.) to the same semantic vector space.

MS-Celeb-1M: Challenge of Recognizing One Million Celebrities in the Real World

Established: June 29, 2016

MSR Image Recognition Challenge (IRC) @ACM Multimedia 2016 Import Dates/Updates: New! We are hosting new challenges at ICCV 2017. Visit MsCeleb.org for more details. Participants information disclosed in "Team Information" section below 6/21/2016: Evaluation Result Announced in "Evaluation Result " section below. 6/17/2016: Evaluation finished. 14 teams finished the grand challenge! 6/13/2016: Evaluation started. 6/13/2016: Dry run finished, 14 out of 19 teams passed, see details in "Update Details" below 6/10/2016: Dry run update 3: 8 teams…

From Captions to Visual Concepts and Back

Established: April 9, 2015

We introduce a novel approach for automatically generating image descriptions. Visual detectors, language models, and deep multimodal similarity models are learned directly from a dataset of image captions. Our system is state-of-the-art on the official Microsoft COCO benchmark, producing a BLEU-4 score of 29.1%. Human judges consider the captions to be as good as or better than humans 34% of the time.


Established: January 30, 2015

The goal of this project is to develop a class of deep representation learning models. DSSM stands for Deep Structured Semantic Model, or more general, Deep Semantic Similarity Model. DSSM, developed by the MSR Deep Learning Technology Center(DLTC), is a deep neural network (DNN) modeling technique for representing text strings (sentences, queries, predicates, entity mentions, etc.) in a continuous semantic space and modeling semantic similarity between two text strings (e.g., Sent2Vec). DSSM has wide applications including information retrieval…

Spoken Language Understanding

Established: May 1, 2013

Spoken language understanding (SLU) is an emerging field in between the areas of speech processing and natural language processing. The term spoken language understanding has largely been coined for targeted understanding of human speech directed at machines. This project covers our research on SLU tasks such as domain detection, intent determination, and slot filling, using data-driven methods. Projects Deeper Understanding: Moving beyond shallow targeted understanding towards building domain independent SLU models. Scaling SLU: Quickly bootstrapping SLU…

Recurrent Neural Networks for Language Processing

Established: November 23, 2012

This project focuses on advancing the state-of-the-art in language processing with recurrent neural networks. We are currently applying these to language modeling, machine translation, speech recognition, language understanding and meaning representation. A special interest in is adding side-channels of information as input, to model phenomena which are not easily handled in other frameworks. A toolkit for doing RNN language modeling with side-information is in the associated download. Sample word vectors for use with this toolkit…




Visual Storytelling
Ting-Hao (Kenneth) Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende, Michel Galley, Margaret Mitchell, in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT), 2016, June 13, 2016, View abstract, Download PDF, View external link


From Captions to Visual Concepts and Back
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John Platt, Larry Zitnick, Geoffrey Zweig, in The proceedings of CVPR, IEEE – Institute of Electrical and Electronics Engineers, June 1, 2015, View abstract, Download PDF

















Link description

Deep Learning for Text Processing


August 4, 2014


Li Deng, Eric Xing, Xiaodong He, Jianfeng Gao, Christopher Manning, Paul Smolensky, and Jeff A Bilmes


MSR, Carnegie Mellon University, Microsoft Research, Redmond, MSR Redmond, Stanford, Johns Hopkins University, University of Washington


Recent media coverage

News and events

Invited talks & tutorials

Selected work on Deep Learning and applications to NLP, Vision, SLU, IR, and Knowledge Representation

Academic services

  • Member of the IEEE Speech and Language Processing Technical Committee 2015-2017
  • Area Chair, Spoken Language Processing, NAACL 2015
  • Associate Editor, IEEE Signal Processing Letters since 2014
  • Member of the Organizing Committee, Chair of Special Sessions, IEEE ICASSP 2013
  • Associate Editor, IEEE Signal Processing Magazine since 2012
  • Guest Editor, Special Issue on Continuous-space and related methods in natural language processing, in IEEE Transactions on Audio, Speech, and Language Processing, 2014
  • Guest Editor, Special Issue on Large-Scale Optimization for Audio, Speech, and Language Processing, in IEEE Transactions on Audio, Speech, and Language Processing, 2013
  • Lead Guest Editor, Special Issue on Statistical Learning Methods for Speech and Language Processing, in IEEE Journal of Selected Topics in Signal Processing, 2010
  • Co-Chair, NIPS 2008 Workshop on Speech and Language: Learning-Based Methods and Systems, Whistler, BC, Canada, 2008
  • Grant Reviewer: Swiss National Science Foundation
  • Program Committee Member: ACL, NAACL, EMNLP, COLING, AAAI
  • Reviewer: IEEE Transactions on Speech and Audio Processing, Proceedings of the IEEE, IEEE Signal Processing Magazine, IEEE Signal Processing Letters, IEEE Transactions on Computer, Speech Communication, Pattern Recognition, Pattern Recognition Letters, ICASSP, Interspeech, NIPS

Honors and awards

  • ACL 2015 Outstanding Paper Award
  • 1st Prize, MS COCO Captioning Challenge 2015
  • No. 1 Place, Chinese to English MT track, 2011 IWSLT Evaluation
  • No. 1 Place, Chinese to English common data track, 2008 NIST MT Evaluation
  • ICASSP 2011 Best Student Paper Award (co-author)
  • IEEE senior member since 2008
  • Microsoft Gold Star Award, 2005
  • Microsoft Patent awards, 2005-2014
  • Microsoft Technology Transfer Award, 2009, 2014

Special issues

NIPS 2008 workshop

The NIPS 2008 workshop on Speech and Language: Learning-based Methods and Systems covers a variety of advanced topics in the Speech and Language Processing area. More details can be found at the workshop’s homepage NIPS08 WSL(a)