background pattern

Microsoft Research Asia StarTrack Scholars

Microsoft Research Asia StarTrack Scholars 2026: Crafting Spatial and Embodied Foundation Models for AI 

已发布

We live in a three-dimensional physical world, and Spatial Intelligence stands as a critical frontier in the evolution of AI, demanding not only a deep understanding of three-dimensional environments but also the capacity to act effectively within them. Unlike existing foundation models that focus on static data in language or images, spatially intelligent systems require dynamic reasoning in the 3D world—predicting how an environment changes and choosing actions to accomplish meaningful goals. Our vision is to develop foundation models that unify perception, reasoning, and action, enabling digital agents and real-world robotics to generalize across diverse environments and tasks with minimal adaptation. 

graphical user interface
Figure: Spatial AI foundation models support spatial perception, reasoning, and interaction capabilities 

 If you are fascinated by spatial intelligence — the crucial ability of AI to deeply understand and dynamically act in three-dimensional environments — we cordially invite you to join the Microsoft Research Asia StarTrack Scholars Program. Applications are now open for the 2025 program. For more details and to submit your registration, visit our official website: Microsoft Research Asia StarTrack Scholars Program – Microsoft Research.

Build Foundation Models for Spatial Perception and Understanding 

The development of spatial AI foundation models necessitates a sophisticated approach to perceiving and understanding three-dimensional environments. These models transcend traditional limitations by integrating advanced techniques from several key research areas, including but not limited to Spatial-Reasoning-Enhanced Large Language Model (LLM) and Vision-Language-Models (VLM), 3D World Models, and fundamental 3D reconstruction and generation models. By weaving together the capabilities of these foundation models, we aim to create systems that not only perceive but also intelligently reason under the dynamic complexities of the three-dimensional world. 

From Spatial Reasoning to Embodied Physical Mastery 

Building upon spatial reasoning, our research will also explore the transition to embodied AI with action capabilities. As the next AI/AGI wave, Embodied AI innovations promise to revolutionize various industries and significantly impact human life. Our exploration involves developing large, generalizable Vision-Language-Action (VLA) models that allow embodied systems to interpret and interact with different environments. We will delve into robot manipulation tasks, exploring how machines can physically engage with diverse objects and environments using nuanced understanding and precision. Reinforcement Learning will also play a crucial role, as it provides a framework for systems to learn optimal actions through trial and error, adapting to new scenarios and tasks. Our approach aims to create agents capable of fluidly transitioning from perception to action, achieving mastery in diverse and dynamic environments.  

chart, pie chart
Figure: Embodied AI foundation models seamlessly integrating vision, language, and action capabilities 

Tackling the Grand Challenge of Data Scarcity and Heterogeneity 

Despite the promising potential of spatial and embodied AI, the field faces a significant hurdle compared to the progress seen in LLMs. Data scarcity remains a grand challenge, hindering our ability to replicate scaling laws. Existing 3D data and robotic action data are not only orders of magnitude smaller than text and images but also face limited diversity and high heterogeneity. Our approach to overcome these challenges includes crafting large-scale 3D vision-language datasets, leveraging Internet-scale video for robotic manipulation VLA pretraining, and developing latent action models trained on diverse video sources for embodiment-agnostic action learning. Through these initiatives, we aim to enhance the diversity and volume of data available and address the heterogeneity challenge, enabling more robust training and broader generalization capabilities for spatial intelligence models. 

diagram
Figure: Data scarcity blocks the construction of large Spatial and Embodied AI models  

StarTrack Scholar Tong Zhang: A Journey Exploring the Fusion of 3D Understanding and Large Models

In 2025, Dr. Tong Zhang, an assistant professor from the University of Chinese Academy of Sciences, joined the StarTrack program and embarked on an in-depth collaboration with the Jiaolong team at Microsoft Research Asia, focusing on the intersection of large language models and 3D visual understanding—Spatial AI. 

Tong Zhang’s expertise in geometric processing and visual algorithms complemented Jiaolong’s accumulated experience in depth estimation and systematic frameworks. With the support of MSRA interns Sicheng, Dengyu, Wuyue, and students from the Chinese Academy of Sciences, the team shared data pipelines and computing resources, collaborating at a high frequency to advance the project. Tong Zhang recalls: “Our Tuesday and Thursday group meetings, plus casual chats over meals, often sparked fresh perspectives. The engineering prowess of Jiaolong’s team was eye-opening for someone with an academic background like mine—everyone was professional yet down-to-earth, and the collaboration was incredibly efficient!” 

Their collaboration centered on evaluating existing Spatial AI methods and constructing a more generalizable paradigm for 3D understanding. To address the challenge of 3D data scarcity, the team generated static and dynamic scene datasets by rendering existing meshes; to tackle model context inconsistencies, they designed fair evaluation protocols to ensure reliable cross-method comparisons. Reflecting on a key breakthrough, Tong Zhang could barely contain his excitement: “When we discovered that the Merge2Depth model, submitted by Jiaolong’s team to NeurIPS in May, outperformed existing methods in outdoor dynamic scene depth estimation, I immediately decided to build on it as the core, integrating geometric priors to create a dynamic reasoning module. This not only validated our hypothesis but also dramatically boosted the performance of the entire pipeline!” The project is now sprinting toward a CVPR 2026 submission, demonstrating the immense potential of injecting traditional 3D expertise into large models. 

During the StarTrack program, Tong Zhang actively participated in MSRA academic activities, such as the July StarTrack Scholar Forum and regular group meetings, quickly integrating into the ecosystem and broadening his horizons. He marveled: “MSRA’s global footprint showed me the possibilities of collaboration with labs in Zurich, Hong Kong, Singapore, and beyond. The international atmosphere strengthened my confidence in multi-regional partnerships.” 

The intense three-month collaboration opened a window for Tong Zhang into industry-grade research, allowing him to experience the win-win value of academia-industry partnerships. In his view, the strength of such collaborations lies in complementarity and division of labor: academia focuses on original algorithms, while industry excels in engineering validation and open-source delivery. He suggested that the StarTrack program consider long-term mechanisms: “We should better understand each other’s needs and maintain stable, ongoing exchanges.” 

As a former scholar, Tong Zhang welcomes more idealistic scholars to join. He advises young researchers to seize this opportunity: “Alignment of research directions is key—communicate deeply with the mentor in advance; once on board, actively participate in activities, prioritize joint cultivation, and focus on long-term mechanisms post-visit. MSRA is the most deeply rooted enterprise research institute in China; collaborating with seasoned researchers can give your work systematic impact.” He emphasized that the StarTrack program is not just a short-term visit but an accelerator for young scholars to connect with industry resources and achieve cross-disciplinary breakthroughs. 

As the visit concluded, the collaboration is just beginning to flourish. In the future, MSRA welcomes more young scholars with shared research interests to gather here and explore the mysteries of scientific research. 

Potential Research Topics for StarTrack Scholars 

We invite scholars to explore a range of exciting research topics within the StarTrack program, including but not limited to: 

  • Spatial LLM/VLM 
  • 3D Vision Foundation Models 
  • Robotic VLA Models 
  • Dexterous Hand Manipulation 
  • Latent Action Learning 
  • Data and Benchmark for Spatial AI 
  • 2D/3D World Models 
  • Real-World Reinforcement Learning 

Microsoft Research Asia StarTrack Scholars advocates an open attitude, encouraging dialogue and joint experimentation with researchers from various disciplines to discover viable solutions. Now visit our official website to know more: Microsoft Research Asia StarTrack Scholars Program – Microsoft Research

Theme Team

  • Baining Guo, Technical Fellow, Microsoft Research Asia 
  • Jiang Bian, Partner Research Manager, Microsoft Research Asia 
  • Jiaolong Yang, Principal Research Manager, Microsoft Research Asia 
  • Li Zhao, Principal Researcher, Microsoft Research Asia 
  • Hao Chen, Senior Research PM, Microsoft Research Asia 
  • Yaobo Liang, Senior Researcher, Microsoft Research Asia 
  • Yu Deng, Senior Researcher, Microsoft Research Asia 
  • Sicheng Xu, Senior Researcher, Microsoft Research Asia 
  • Chuheng Zhang, Researcher, Microsoft Research Asia
  • Kaixin Wang, Senior Researcher, Microsoft Research Asia 

If you have any questions, please email Ms. Yanxuan Wu, program manager of the Microsoft Research Asia StarTrack Scholars Program, at v-yanxuanwu@microsoft.com

继续阅读

查看所有博客文章