{"id":1147868,"date":"2025-08-15T13:33:51","date_gmt":"2025-08-15T20:33:51","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=1147868"},"modified":"2025-08-20T09:36:43","modified_gmt":"2025-08-20T16:36:43","slug":"mindjourney","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/mindjourney\/","title":{"rendered":"MindJourney"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background  has-background-auburn card-background--inset-right\">\n\t\t\t\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 \">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 class=\"wp-block-heading\" id=\"mindjourney\">MindJourney<\/h1>\n\n\n\n<p>Test-time scaling with world models for spatial reasoning<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-is-mindjourney\">What is MindJourney?<\/h2>\n\n\n\n<p>MindJourney is a framework that equips AI agents with a \u201csimulation loop\u201d to explore hypothetical 3D viewpoints before answering spatial reasoning questions\u2014tackling a core limitation of vision-language models (VLMs), which recognize objects well in 2D images but struggle to infer the interactive 3D world behind them. Given a spatial reasoning query, a world model (a camera-controllable video generator) renders candidate novel views along short action sequences; a Spatial Beam Search steers exploration toward the most promising trajectories; and an off-the-shelf VLM ranks and integrates the most informative observations to produce an answer\u2014without extra training. The approach yields stronger performance on 3D spatial-reasoning benchmarks and points toward safer, more capable embodied agents that can reason beyond the visible frame, with potential applications in robotics, smart homes, and accessibility.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MindJourney_pipeline_1400x788-1024x576.jpg\" alt=\"MindJourney pipeline diagram\" class=\"wp-image-1147897\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MindJourney_pipeline_1400x788-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MindJourney_pipeline_1400x788-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MindJourney_pipeline_1400x788-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MindJourney_pipeline_1400x788-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MindJourney_pipeline_1400x788-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MindJourney_pipeline_1400x788-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MindJourney_pipeline_1400x788-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MindJourney_pipeline_1400x788-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MindJourney_pipeline_1400x788-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/08\/MindJourney_pipeline_1400x788.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-fill-github\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/github.com\/UMass-Embodied-AGI\/MindJourney\" target=\"_blank\" rel=\"noreferrer noopener\">Source code<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-outline is-style-outline--1\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/umass-embodied-agi.github.io\/MindJourney\/\" target=\"_blank\" rel=\"noreferrer noopener\">External project<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-outline is-style-outline--2\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/arxiv.org\/abs\/2507.12508\" target=\"_blank\" rel=\"noreferrer noopener\">Read the paper<\/a><\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading h3\" id=\"acknowledgements\">Acknowledgements<\/h2>\n\n\n\n<p>The project is by Yuncong Yang (Research Intern), Reuben Tan (Senior Researcher), Swadheen Shukla (Principal Program Manager), and Jianfeng Gao (Distinguished Scientist). We thank all the external collaborators: Jiageng Liu (University of Massachusetts Amherst), Zheyuan Zhang (JHU), Siyuan Zhou (Hong Kong University of Science and Technology), Jianwei Yang (work done at MSR), Yilun Du (Harvard), and Chuang Gan (University of Massachusetts Amherst).<\/p>\n\n\n\n<div style=\"height:30px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n","protected":false},"excerpt":{"rendered":"<p>MindJourney is a framework that equips AI agents with a \u201csimulation loop\u201d to explore hypothetical 3D viewpoints before answering spatial reasoning questions\u2014tackling a core limitation of vision-language models (VLMs), which recognize objects well in 2D images but struggle to infer the interactive 3D world behind them.<\/p>\n","protected":false},"featured_media":1147897,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13562],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1147868","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-computer-vision","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[1147869],"related-downloads":[],"related-videos":[1148003],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[1147961],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Jianfeng Gao","user_id":32246,"people_section":"Core research team","alias":"jfgao"},{"type":"user_nicename","display_name":"Swadheen Shukla","user_id":38248,"people_section":"Core research team","alias":"swads"},{"type":"user_nicename","display_name":"Reuben Tan","user_id":43827,"people_section":"Core research team","alias":"tanreuben"},{"type":"guest","display_name":"Yuncong Yang","user_id":1147889,"people_section":"Core research team","alias":""},{"type":"guest","display_name":"Yilun Du","user_id":1147894,"people_section":"Collaborators","alias":""},{"type":"guest","display_name":"Chuang Gan","user_id":1147895,"people_section":"Collaborators","alias":""},{"type":"guest","display_name":"Jiageng Liu","user_id":1147890,"people_section":"Collaborators","alias":""},{"type":"guest","display_name":"Jianwei Yang","user_id":1147893,"people_section":"Collaborators","alias":""},{"type":"guest","display_name":"Zheyuan Zhang","user_id":1147891,"people_section":"Collaborators","alias":""},{"type":"guest","display_name":"Siyuan Zhou","user_id":1147892,"people_section":"Collaborators","alias":""}],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1147868","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":8,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1147868\/revisions"}],"predecessor-version":[{"id":1148047,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1147868\/revisions\/1148047"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1147897"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1147868"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1147868"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1147868"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1147868"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1147868"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}