{"id":1155413,"date":"2025-12-04T04:13:17","date_gmt":"2025-12-04T12:13:17","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=1155413"},"modified":"2026-02-24T11:07:32","modified_gmt":"2026-02-24T19:07:32","slug":"physics-guided-vision-language-world-models-for-agentic-4d-scene-understanding","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/physics-guided-vision-language-world-models-for-agentic-4d-scene-understanding\/","title":{"rendered":"Physics-Guided Vision-Language World Models for Agentic 4D Scene Understanding"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background  has-background- card-background--full-bleed\">\n\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1920\" height=\"721\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/11\/Physics-Guided-Vision-Language-World-Models-for-Agentic-4D-Scene-Understanding_Banner-1920x721-1.jpg\" class=\"attachment-full size-full\" alt=\"background pattern\" style=\"\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/11\/Physics-Guided-Vision-Language-World-Models-for-Agentic-4D-Scene-Understanding_Banner-1920x721-1.jpg 1920w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/11\/Physics-Guided-Vision-Language-World-Models-for-Agentic-4D-Scene-Understanding_Banner-1920x721-1-300x113.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/11\/Physics-Guided-Vision-Language-World-Models-for-Agentic-4D-Scene-Understanding_Banner-1920x721-1-1024x385.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/11\/Physics-Guided-Vision-Language-World-Models-for-Agentic-4D-Scene-Understanding_Banner-1920x721-1-768x288.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/11\/Physics-Guided-Vision-Language-World-Models-for-Agentic-4D-Scene-Understanding_Banner-1920x721-1-1536x577.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/11\/Physics-Guided-Vision-Language-World-Models-for-Agentic-4D-Scene-Understanding_Banner-1920x721-1-1600x600.jpg 1600w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/11\/Physics-Guided-Vision-Language-World-Models-for-Agentic-4D-Scene-Understanding_Banner-1920x721-1-240x90.jpg 240w\" sizes=\"auto, (max-width: 1920px) 100vw, 1920px\" \/>\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 \">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 class=\"wp-block-heading\" id=\"physics-guided-vision-language-world-models-for-agentic-4d-scene-understanding\">Physics-Guided Vision-Language World Models for Agentic 4D Scene Understanding<\/h1>\n\n\n\n<p><\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<p>This project develops a unified framework for physically grounded world modelling that combines video-based temporal prediction with Gaussian Splatting for photorealistic 3D representation. A Physics Vision-Language Model translates natural-language instructions into transformations that respect physical constraints, enabling interpretable and goal-directed control in dynamic scenes. By integrating perception, prediction, and action in a Vision-Language-Action loop, the research aims to advance agentic AI systems capable of transparent, physics-aware reasoning\u2014supporting applications in robotics, simulation, and education.<\/p>\n\n\n\n<p>This research is conducted via&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/academic-program\/agentic-ai-research-and-innovation\/\">The Agentic AI Research and Innovation&nbsp;<\/a>(AARI) Initiative which focuses on the next frontier of agentic systems through&nbsp;<em>Grand Challenges<\/em>&nbsp;with the academic community and Microsoft Research.<\/p>\n\n\n","protected":false},"excerpt":{"rendered":"<p>This project develops a unified framework for physically grounded world modelling that combines video-based temporal prediction with Gaussian Splatting for photorealistic 3D representation. A Physics Vision-Language Model translates natural-language instructions into transformations that respect physical constraints, enabling interpretable and goal-directed control in dynamic scenes. By integrating perception, prediction, and action in a Vision-Language-Action loop, the [&hellip;]<\/p>\n","protected":false},"featured_media":1155699,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1155413","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"guest","display_name":"Benjamin  Busam","user_id":1157218,"people_section":"Section name 0","alias":""},{"type":"guest","display_name":"Hyunjun Jung","user_id":1157220,"people_section":"Section name 0","alias":""},{"type":"guest","display_name":"Alican Karaomer","user_id":1162806,"people_section":"Section name 0","alias":""},{"type":"guest","display_name":"Mert  Kiray","user_id":1157224,"people_section":"Section name 0","alias":""},{"type":"guest","display_name":"Steven Kuang","user_id":1162808,"people_section":"Section name 0","alias":""},{"type":"guest","display_name":"Weihang Li","user_id":1162810,"people_section":"Section name 0","alias":""},{"type":"user_nicename","display_name":"Sarah Parisot","user_id":43638,"people_section":"Section name 0","alias":"sarahparisot"},{"type":"user_nicename","display_name":"Sergio Valcarcel Macua","user_id":42507,"people_section":"Section name 0","alias":"sergiov"},{"type":"guest","display_name":"Dani Velikova","user_id":1157222,"people_section":"Section name 0","alias":""}],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1155413","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":8,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1155413\/revisions"}],"predecessor-version":[{"id":1162811,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1155413\/revisions\/1162811"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1155699"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1155413"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1155413"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1155413"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1155413"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1155413"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}