Deep Attention Mechanism for Multimodal Intelligence: Perception, Reasoning, and Expression across Vision and Language

December 12, 2017
Xiaodong He | Microsoft Research

We have long envisioned that machines one day can perform human-like perception, reasoning, and expression across multiple modalities including vision and language, which will augment and transform the ways humans communicate with each other and with the real world. With this vision, I’ll use three tasks as examples to demonstrate recent progress in multimodal intelligence, including image-to-language generation, visual question answering, and language-to-image synthesis. I’ll discuss the open problems behind these tasks that we are thrilled to solve, including image and language understanding, joint reasoning across both modalities, and expressing abstract concepts by natural language or image generation. I’ll also discuss the deep attention mechanisms recently developed to address these challenging problems, and analyze the interpretability and controllability in learning algorithms, which are of fundamental importance to general intelligence.

Speaker Details

Xiaodong He is a Researcher of Microsoft Research, Redmond, WA, USA. He is also an Affiliate Professor in Electrical Engineering at the University of Washington, Seattle, WA, USA. His research interests include deep learning, information retrieval, natural language understanding, machine translation, computer vision, and speech recognition. Dr. He has published a book and more than 70 technical papers in these areas, and has given tutorials at international conferences in these fields. In benchmark evaluations, he and his colleagues have developed entries that obtained No. 1 place in the 2008 NIST Machine Translation Evaluation (NIST MT) and the 2011 International Workshop on Spoken Language Translation Evaluation (IWSLT), both in Chinese-English translation, respectively. He serves as Associate Editor of IEEE Signal Processing Magazine and IEEE Signal Processing Letters, as Guest Editors of IEEE TASLP for the Special Issue on Continuous-space and related methods in natural language processing, and Area Chair of NAACL2015. He also served as GE for several IEEE Journals, and served in organizing committees and program committees of major speech and language processing conferences in the past. He is a senior member of IEEE and a member of ACL.

- Emre Kiciman
  
  Partner Research Manager
Research Area
- Algorithms

Watch Next

Dion2: A new simple method to shrink matrix in Muon
March 3, 2026
Anson Ho,

Kwangjun Ahn
Bridging Neurotechnology with Immersive Systems: Getting BCIs outside of the lab?
February 12, 2026
Hakim Si-Mohammed
Microsoft Research India 2025 Highlights
December 31, 2025
Microsoft Research India - The evolution
March 1, 2025
Venkat Padmanabhan,

P. Anandan,

Rick Rashid

, et. al.
Microsoft Research India - The lab culture
March 1, 2025
P. Anandan,

Indrani Medhi Thies,

B. Ashok

, et. al.
GenAI for Supply Chain Management: Present and Future
February 14, 2025
Georg Glantschnig,

Beibin Li,

Konstantina Mellou

, et. al.
Using Optimization and LLMs to Enhance Cloud Supply Chain Operations
December 2, 2024
Beibin Li,

Konstantina Mellou,

Ishai Menache

, et. al.
AI for Business Transformation: Lessons from Healthcare
September 3, 2024
Gretchen Huizinga,

Peter Lee,

Vijay Mital
AI for Business Transformation: Multimodal Models
September 3, 2024
Gretchen Huizinga,

Peter Lee,

Vijay Mital
AI for Business Transformation: The Business of Data
September 3, 2024
Gretchen Huizinga,

Peter Lee,

Vijay Mital

Deep Attention Mechanism for Multimodal Intelligence: Perception, Reasoning, and Expression across Vision and Language

Speaker Details

Research Area

Watch Next