Deep Attention Mechanism for Multimodal Intelligence: Perception, Reasoning, and Expression across Vision and Language
- Xiaodong He | Microsoft Research
We have long envisioned that machines one day can perform human-like perception, reasoning, and expression across multiple modalities including vision and language, which will augment and transform the ways humans communicate with each other and with the real world. With this vision, I’ll use three tasks as examples to demonstrate recent progress in multimodal intelligence, including image-to-language generation, visual question answering, and language-to-image synthesis. I’ll discuss the open problems behind these tasks that we are thrilled to solve, including image and language understanding, joint reasoning across both modalities, and expressing abstract concepts by natural language or image generation. I’ll also discuss the deep attention mechanisms recently developed to address these challenging problems, and analyze the interpretability and controllability in learning algorithms, which are of fundamental importance to general intelligence.
Speaker Details
-
-
Emre Kiciman
Partner Research Manager
-
-
Watch Next
-
-
-
From Microfarms to the Moon: A Teen Innovator’s Journey in Robotics
- Pranav Kumar Redlapalli
-
-
Microsoft Research India - The lab culture
- P. Anandan,
- Indrani Medhi Thies,
- B. Ashok
-
GenAI for Supply Chain Management: Present and Future
- Georg Glantschnig,
- Beibin Li,
- Konstantina Mellou
-
Using Optimization and LLMs to Enhance Cloud Supply Chain Operations
- Beibin Li,
- Konstantina Mellou,
- Ishai Menache
-
-
-