November 14, 2024

2024 MSR Asia TAB Workshop: Spatial Intelligence for Embodied AI

Location: Beijing, China

Time	Session	Talk	Speaker
9:00 – 9:10	Opening		Lily Sun (host) Baining Guo
9:10 – 10:00	Keynote	Learning to Interact: Challenges and Opportunities for Embodied AI The transformative advancements we’ve witnessed in generative AI present unprecedented opportunities for enabling AI agents to interact with the physical world on our behalf. With private sector investments into robotic platforms on the rise, we see a frontier of exciting research challenges that must be solved to unlock the full promise of embodied AI (EAI). In this talk, we will share our current thinking on an EAI roadmap along with associated research problems that we see MSR as well-positioned to address. We will also provide an overview of the partnerships and assets that MSR Accelerator is investing in to help propel EAI research at MSR.	Ashley Llorens Andrey Kolobov Vivan Amin
10:00 – 11:00	Foundation Models for Robotics	Exploring Interactive Robotics: Teleoperation and Dexterous Grasping with Live Demos This presentation delves into two critical aspects of interactive robotics: 1) human-robot interaction, with a focus on teleoperation, 2) robot-environment interaction, emphasizing reinforcement learning and policy distillation for dexterous robotic grasping. Both segments feature live demos to showcase practical implementations and the latest advancements in these domains.	Lily Sun (host) Fangyun Wei Yaobo Liang Jianlong Fu Jianwei Yang
		Developing Vision-Language-Action foundation models for Embodied AI In recent years, developing large vision-language-action(VLA) models has significantly advanced robotic manipulation, enabling robots to execute complex tasks described by natural language instructions. However, current approaches primarily focus on extending vision-language models or diffusion models to VLA models. In this work, we carefully examine and design the three sub-models of vision, language and action, creating SOTA foundations models for robotics.
		GenRobot: Creating End-to-end Service Robots with Large Action Foundation Models In our project, we are developing a versatile and efficient service robot designed to assist the elderly and individuals in need with their daily tasks. The robot is capable of performing actions such as picking up a water cup and opening doors, with plans for more advanced interactions in the future. By leveraging our advanced PIE (Planning-Imagination-Execution) foundation model, we have achieved promising results in manipulation tasks, demonstrating its effectiveness in handling everyday objects. A key innovation in our approach is the generalizable robot manipulation demonstration. Once pre-trained, our robot foundation model can be adapted to various new objects, environments, and different robot platforms using few-shot learning techniques. This capability allows the robot to quickly learn and adapt to its surroundings, enhancing its utility and effectiveness in real-world scenarios. By integrating large action foundation models, we aim to create a service robot that not only performs tasks efficiently but also interacts meaningfully with people, ultimately improving the quality of life.
		Towards Multimodal Agentic Model that Can See, Talk and Act Recently, large multimodal models (LMMs) have shown remarkable capability for comprehending visual contents and following language instructions. Trained with a large number of image-text pairs, followed by sophisticated image-based instruction tuning, existing LMMs can understand static image well but still significantly fall short of understanding the temporal dynamics and causal-effect in the physical world, let alone the sense of acting and planning for embodied AI tasks. In this talk, I will talk about our most recent efforts to build agentic multimodal models that can not only understand image and texts, but also take actions and planning in the physical world. Particularly, I will first introduce TraceVLA, a new vision-language-action model built on top of large multimodal models (e.g., Phi-3-V) and leverages simple yet effective visual trace prompting to augment the spatial-temporal awareness for robot manipulation. Later, I will introduce LatentVLA, which explores a simple way of learning vision-language-action policy from raw video data. In the end, I will cover a joint project within deep learning group, aiming at building generalist multimodal agent models for a wide range of embodied AI tasks. In this project, we draw the connection between conventional LMMs and VLAs, and explore how to move beyond action labels and leverage large-scale vision-language data for training action model in a scalable way.
11:00-11:20	Coffee Break
11:20-11:50	Real-world impact and systems	Automating Data Center Maintenance with Robotics Data centers are critical infrastructures in today’s digital society, yet their maintenance predominantly depends on manual operations. This reliance stems from the complexity and reliability requirements of operations, as well as the challenging environment of computer rooms, making automation a formidable task. Recent advancements in artificial intelligence, however, present new opportunities for automating these processes. Data from our data center partners reveal that over half of all malfunctions are caused by contamination on optical fiber connectors, which can be mitigated through meticulous cleaning—a repetitive and tedious task. In collaboration with the MSRC, we are exploring the automation of optical fiber cleaning using robotics. In this talk, I will discuss our latest progress on this project and share our insights into the future of data center automation.	Mawo Kamakura (host) Yizhong Zhang Aaron Weissbart (remote)
11:20-11:50	Real-world impact and systems	Lessons from the Field: Engineering a Robotics Platform on Azure with Real-World Customer Insights In designing a general-purpose robotics platform, real-world customer needs present both challenges and unique opportunities. This talk distills the lessons learned from working closely with customers, partners, and MSR across industries, uncovering the requirements for a scalable robotics solution on Azure. We will delve into the process of aligning platform capabilities with practical applications—leveraging Azure’s high-performance computing for robotic simulation, reinforcement learning training, teleoperation, and seamless data integration. Join us to explore how insights from the field inform a resilient, Azure-supported platform, built to power the next generation of adaptable robotics solutions.
11:50 – 13:00	Lunch Break
13:00 – 13:45	World Modeling and Planning	IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI We introduce Image-GOal Representations (IGOR), aiming to learn a unified, semantically consistent action space across human and various robots. Through this unified latent action space, IGOR enables knowledge transfer among large-scale robot and human activity data. We achieve this by compressing visual changes between an initial image and its goal state into latent actions. IGOR allows us to generate latent action labels for internet-scale video data. This unified latent action space enables the training of foundation policy and world models across a wide variety of tasks performed by both robots and humans. We believe IGOR opens new possibilities for human-to-robot knowledge transfer and control.	Jianlong Fu (host) Li Zhao Tim Pearce Dongqi Han
		Scaling Laws for Pre-Training Agents and World Models Progress in AI in the early 2020’s has largely been driven by increasing model size, dataset size, and training compute. Whilst conceptually simple, the importance of this practice has led to an emerging subfield studying the science of scaling. This talk considers how a precise understanding of scaling can guide the design of embodied AI systems, for instance in optimally trade-off model and dataset size.
		Efficient and flexible diffusion planning for robotic tasks How to efficient while flexibly make decisions is a central problem for biological animals and robots. It involves processing sensory data, learning from interactions, making plans and executing actions that aim to fulfill specific objectives. In this talk I will share our recent work on leveraging diffusion models for decision making, in particular using them as powerful planners for simulated robotic tasks. Diffusion models are powerful neural architectures that can generate high-dimensional data of complex distribution with guidance to fulfill specific goals. We advance diffusion models to (1) generate efficient behavior by learning from offline dataset (2) perform flexible planning for goals not involved in training. Our findings support a promising integration between diffusion models and embodied intelligence.
13:45 – 14:00	Research Incubation	Learning Spatial Intelligence from Nature’s blueprint Spatial intelligence—the capacity to comprehend, manipulate, and navigate the physical environment, is crucial in both natural and artificial systems. In nature, diverse organisms, from ants to birds, demonstrate complex spatial reasoning through strategies shaped by millions of years of Cambrian evolution. This talk explores how principles underlying spatial intelligence in nature can inform and inspire advancements in embodied artificial intelligence and robotics. By studying nature, we can design AI models with these resilient mechanisms inspired by “nature’s blueprint”.	Jianlong Fu (host) Ade Famoti
14:00 – 15:00	Panel Discussion	Embodied AI: Challenges and Opportunities	Jiaolong Yang (host) Ade Famoti Baining Guo Jianfeng Gao Yasuyuki Matsushita
15:00 – 15:05	Closing		Baining Guo Jiang Bian