Vision and Language Intelligence

Established: June 28, 2017

This project aims at driving disruptive advances in vision and language intelligence. We believe future breakthroughs in multimodal intelligence will empower smart communications between humans and the world and enable next-generation scenarios such as a universal chatbot and intelligent augmented reality. To these ends, we are focusing on understanding, reasoning, and generation across language and vision, and creation of intelligent services, including vision-to-text captioning, text-to-vision generation, and question answering/dialog about images and videos.


Talks and Tutorials

Invited Talks

  1. Multimodal Learning for Image Captioning and Visual Question Answering (invited talk at UC Berkeley, BVLC), Xiaodong He, invited talk at UC Berkeley, BVLC, April 1, 2016, View abstract, Download PDF
  2. Towards Human-level Quality Image Captioning: Deep Semantic Learning of Text and Images (Invited Talk), Xiaodong He, Invited Talk at INNS Deep Learning Workshop, August 1, 2015, View abstract, Download PDF
  3. Deep Semantic Learning: Teach machines to understand text, image, and knowledge graph (Invited talk at CVPR DeepVision workshop), Invited talk at CVPR DeepVision workshop, Xiaodong He, June 1, 2015, View abstract, Download PDF



  1. Communications of the ACM interviewed Fei-Fei Li,  Rob Fergus,  Richard Zemel and Xiaodong He on recent progress in computer vision and language processing, interview highlighted in “Seeing More Clearly” in the Jan. 2016 issue of CACM.
  2. This project leads to the image captioning cloud service now is part of Microsoft Cognitive Services and CaptionBot. The work was widely covered in media including Business InsiderTechCrunchForbes, The Washington Post, CNN, BBC. The services also support applications such as Seeing AI, Microsoft Word and PowerPoint, etc. Lots of fun stories are also shared at EngadgetGizmodoThe TelegraphDaily Mail, The Guardian, Mashable, and more.
  3. Business Insider reported our deep image question answering work in “Microsoft Research creates a multi-step reasoning computer.” Also covered by ZDNet, eWeek, and others. Paper is presented as Oral presentation at CVPR2016.
  4. Our MSR entry won the 1st Prize, tied with Google, at the MS COCO Captioning Challenge 2015, achieved the highest score in the Turing Test among all submissions. More details in the CVPR paper , demo, relevant talk, and recent media coverage by Microsoft blog, TechNet, SlashGear, Engadget, ventureBeat, androidHeadlines.