{"id":1135867,"date":"2025-04-04T13:00:00","date_gmt":"2025-04-04T20:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=1135867"},"modified":"2025-04-30T17:51:15","modified_gmt":"2025-05-01T00:51:15","slug":"whamm-real-time-world-modelling-of-interactive-environments","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/whamm-real-time-world-modelling-of-interactive-environments\/","title":{"rendered":"WHAMM! Real-time world modelling of interactive environments."},"content":{"rendered":"\n<p>Today we are making available an interactive real-time gameplay experience in Copilot Labs. Head over to this <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aka.ms\/muse-quakeii-whamm\">link<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> to play an AI rendition of Quake II gameplay, powered by Muse.<\/p>\n\n\n\n<div align=\"center\" class=\"is-layout-flex\">\n<video controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/vid1_square.mp4\" width=\"300px\">5 second video of Quake II gameplay generated in real-time by WHAMM in response to the user&#8217;s controller inputs.\n<\/video>\n<video controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/vid2_square.mp4\" width=\"300px\">5 second video of Quake II gameplay generated in real-time by WHAMM in response to the user&#8217;s controller inputs.<\/video>\n<video controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/vid3_square.mp4\" width=\"300px\">5 second video of Quake II gameplay generated in real-time by WHAMM in response to the user&#8217;s controller inputs.<\/video>\n<\/div>\n\n\n\n<p><strong>Example generations from our WHAMM model showcasing Quake II gameplay.<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-are-we-doing\">What are we doing?<\/h2>\n\n\n\n<p>Muse is our family of world models for video games at Microsoft. Following on from our announcement of <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/introducing-muse-our-first-generative-ai-model-designed-for-gameplay-ideation\/\">Muse<\/a> in February, and the World and Human Action Model (WHAM) that was recently published in <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.nature.com\/articles\/s41586-025-08600-3\">Nature<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we introduce a real-time playable extension of our model. Our approach: WHAMM which stands for <span style=\"text-decoration: underline\">W<\/span>orld and <span style=\"text-decoration: underline\">H<\/span>uman <span style=\"text-decoration: underline\">A<\/span>ction <span style=\"text-decoration: underline\">M<\/span>askGIT <span style=\"text-decoration: underline\">M<\/span>odel (pronounced WHAM, the M is silent \u2013 yes this is intentionally silly) allows generating visuals much faster than WHAM. This means that you can interact with the model through keyboard\/controller actions and see the effects of your actions immediately, essentially allowing you to play inside the model.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-has-changed\">What has changed?<\/h2>\n\n\n\n<p>Since the release of WHAM-1.6B, our first WHAM trained on Bleeding Edge, we have changed and improved on a number of aspects that affect the overall experience.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>First and foremost we have improved the speed of generation. WHAMM is able to generate images at 10+ frames a second, enabling real-time video generation. In contrast WHAM-1.6B can generate about 1 image a second.<\/li>\n\n\n\n<li>The WHAMM recipe successfully transferred to a new game: Quake II<strong>. (<\/strong>We teased an earlier variant of the WHAMM model trained on Bleeding edge <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/youtu.be\/4OVcVG52hGA\">here<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>). In comparison to Bleeding Edge, Quake II is a faster-paced first person shooter game which plays very differently.<\/li>\n\n\n\n<li>Transferring to a new game was made possible by substantially reducing the quantity of data that we required for training WHAMM. This was achieved through more intentional data collection and curation, resulting in only 1 week of data being used for WHAMM training. This is a substantial decrease from the 7 years of gameplay that we used to train WHAM-1.6B. This was made possible by working with professional game testers to collect the data, and by focusing on a single level with intentional gameplay ensuring we collected enough high quality and diverse data.<\/li>\n\n\n\n<li>Lastly, we doubled the resolution of WHAMM\u2019s output, increasing it to 640&#215;360 \u2013 WHAM-1.6B &nbsp;used 300&#215;180. We found this was possible with only minor modifications to the image encoder\/decoder, but resulted in a large bump in perceived quality of the overall experience. To accomplish this, we simply increased the patch size of the ViT to 20 (up from 10) which allowed us to keep the number of tokens roughly the same.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"whamm-architecture\">WHAMM architecture<\/h2>\n\n\n\n<p>In order to enable a real-time experience, we changed our modelling strategy. Moving from an autoregressive LLM-like setup, where WHAM-1.6B would generate 1 token at a time, to a MaskGIT [2] setup allows us to generate all of the tokens for an image in as many generations as we want.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"589\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/wham_overview-1024x589.png\" alt=\"WHAM overview. WHAM first tokenises our gameplay data of image, action, image, action, etc sequences into a longer sequence of tokens. Then we train a decoder-only transformer to predict the next token in the sequence. Left: We tokenise each image using a ViT-VQGAN. Right: We train a transformer on the resulting sequence of tokens. Please refer to our blog post and the Nature article for more details on WHAM.\" class=\"wp-image-1135893\" style=\"width:715px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/wham_overview-1024x589.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/wham_overview-300x173.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/wham_overview-768x442.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/wham_overview-240x138.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/wham_overview.png 1104w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Figure 1: WHAM overview. WHAM first tokenises our gameplay data of image, action, image, action, etc sequences into a longer sequence of tokens. Then we train a decoder-only transformer to predict the next token in the sequence. Left: We tokenise each image using a ViT-VQGAN [3]. Right: We train a transformer on the resulting sequence of tokens. Please refer to our <\/strong><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/introducing-muse-our-first-generative-ai-model-designed-for-gameplay-ideation\/\"><strong>blog post<\/strong><\/a><strong> and the <\/strong><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.nature.com\/articles\/s41586-025-08600-3\"><strong>Nature article<\/strong><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><strong> for more details on WHAM.<\/strong><\/p>\n\n\n\n<p>An overview of the WHAM setup is shown in Figure 1. On the left, we utilise a Vit-VQGAN [3] to tokenise the image. On the right, we model that new sequence of tokens utilising a decoder-only transformer. Much like an LLM, it is trained to predict the next token in the sequence. Please refer to our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.nature.com\/articles\/s41586-025-08600-3\">paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> [1] for more details. <\/p>\n\n\n\n<p>An overview of the WHAMM architecture is shown in Figure 2. Shown on the left, exactly like WHAM, we first tokenise the image. For this specific setting each 640&#215;360 image is turned into 576 tokens (for WHAM each 300&#215;180 was turned into 540 tokens). Since WHAM generates 1 token at a time, generating the 540 tokens necessary to turn into an image can take a long time. In contrast, a MaskGIT-style setup can generate all of the tokens for an image in as few forward passes as we want. This enables us to generate the image tokens fast enough to facilitate a real-time experience. Typically, in a MaskGIT setup you would start with all of the tokens for an image masked and then produce predictions for each and every one of them. We can think of this as producing a rough and ready first pass for the image. We would then re-mask some of those tokens, predict them again, re-mask, and so on. This iterative procedure allows us to gradually refine our image prediction. However, since we have tight constraints on the time we can take to produce an image, we are very limited in how many passes we can do through a big transformer.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"403\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/whamm_overview-1024x403.png\" alt=\"WHAMM overview. Left: We tokenise each image using a ViT-VQGAN, exactly like WHAM. Middle: The Backbone transformer takes in the context, the 9 previous image-action pairs, and predicts the tokens for the next image. Right: The Refinement transformer iteratively refines the image token predictions by repeatedly masking and predicting them.\" class=\"wp-image-1135894\" style=\"width:784px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/whamm_overview-1024x403.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/whamm_overview-300x118.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/whamm_overview-768x302.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/whamm_overview-240x95.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/whamm_overview.png 1280w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Figure 2: WHAMM overview. Left: We tokenise each image using a ViT-VQGAN, exactly like WHAM. Middle: The Backbone transformer takes in the context, the 9 previous image-action pairs, and predicts the tokens for the next image. Right: The Refinement transformer iteratively refines the image token predictions by repeatedly masking and predicting them [2].<\/strong><\/p>\n\n\n\n<p>To work around this, we adopt a two-stage setup for WHAMM. First, we have the \u201cBackbone\u201d transformer (~500M parameters), shown in the middle of Figure 2. This module takes as input the context (in our case this is the tokens of the 9 previous image-action pairs) and produces an initial prediction for all the tokens of the image. Next, shown on the right of Figure 2, we have a separate \u201cRefinement\u201d transformer which is responsible for refining our initial predictions for the image tokens. This module is both smaller in terms of size (~250M parameters), and also takes in substantially fewer tokens as input, allowing it to run much faster. This allows us to do many iterative MaskGIT steps to ensure a better final prediction. To ensure the Refinement module has the necessary information from the context available, instead of directly conditioning on the context tokens (as the Backbone transformer does) it takes as input a smaller set of \u201cconditioning\u201d tokens from the output of the bigger backbone transformer (shown in pink in Figure 2).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"quake-ii-whamm\">Quake II WHAMM<\/h2>\n\n\n\n<p>The fun part is then being able to play a simulated version of the game inside the model. After the release of WHAM-1.6B only 6 short weeks ago, we immediately launched into this project on training WHAMM for a new game. This entailed both gathering the data from scratch, refining our earlier WHAMM prototypes, and then the actual training of both the image encoder\/decoder and the WHAMM models.<\/p>\n\n\n\n<p>A concerted effort by the team resulted in both planning out what data to collect (what game, how should the testers play said game, what kind of behaviours might we need to train a world model, etc), and the actual collection, preparation, and cleaning of the data required for model training.<\/p>\n\n\n\n<p>Much to our initial delight we were able to play inside the world that the model was simulating. We could wander around, move the camera, jump, crouch, shoot, and even blow-up barrels similar to the original game.&nbsp; Additionally, since it features in our data, we can also discover some of the secrets hidden in this level of Quake II.<\/p>\n\n\n\n<div align=\"center\">\n<video controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/quakeii_long_square.mp4\" width=\"400px\">1 minute long video generated by WHAMM in real-time of Quake II gameplay generated in response to the user&#8217;s controller actions. The generated video shows the user navigating through one of the secret areas and ends with blowing up barrels to unlock another area.\n<\/video>\n<\/div>\n\n\n\n<p><strong>Figure 3: A video from our internal research prototype portal demonstrating one of the \u201csecret\u201d areas in Quake II\u2019s first level.<\/strong><\/p>\n\n\n\n<p>We can also insert images into the models\u2019 context and have those modifications <em>persist<\/em> in the scene.<\/p>\n\n\n\n<div align=\"center\">\n<video controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/be_powercell_crop.mp4\" width=\"600px\">\nShort clip demonstrating insertion of a power cell into the world generated by a version of WHAMM trained on Bleeding Edge. The power cell is dragged onto the visuals, and then is assimilated into the world allowing the user to then interact with it.\n<\/video>\n<\/div>\n\n\n\n<p><strong>Figure 4: An example of inserting an object into the world and then being able to interact with it. This is from the end of this <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/youtu.be\/4OVcVG52hGA\">video<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, showing power cell insertion into a WHAMM trained on Bleeding Edge.<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"limitations\">Limitations<\/h2>\n\n\n\n<p>Whilst we feel it is incredibly fun to play a simulated version of the game inside the model, there are of course limitations and shortcomings of our current approach.<\/p>\n\n\n\n<p>The most important of which is <strong>this is a generative model<\/strong>. Thus, we are learning an approximation to the real environment whose data it was trained on. We do not intend for this to fully replicate the actual experience of playing the original Quake II game. This is intended to be a research exploration of what we are able to build using current ML approaches. Think of this as <em>playing the model<\/em> as opposed to playing the game.<\/p>\n\n\n\n<p><strong>Enemy interactions. <\/strong>The interactions with enemy characters is a big area for improvement in our current WHAMM model. Often, they will appear fuzzy in the images and combat with them (damage being dealt to both the enemy\/player) can be incorrect. Whilst the entire experience is not 100% faithful to the original environment, this aspect of it is particularly noticeable since the enemies are one of the primary things the player will <em>interact <\/em>with.<\/p>\n\n\n\n<p><strong>Context length. <\/strong>In our current model the context length is 0.9 seconds of gameplay (9 frames at 10fps). This means that the model can and will forget about objects that go out of view for longer than this. This can also be a source of fun, whereby you can defeat or spawn enemies by looking at the floor for a second and then looking back up. Or it can let you teleport around the map by looking up at the sky and then back down. These are some examples of <em>playing the model<\/em>.<\/p>\n\n\n\n<p><strong>Counting. <\/strong>The health value is not always super reliable. In particular, counting doesn\u2019t always work fantastically. This can affect the interactions with the health packs and with enemies.<\/p>\n\n\n\n<p><strong>Scope of the experience is limited. <\/strong>At the moment, WHAMM is only trained on a single part of a single level of Quake II. If you reach the end of the level (going down the elevator), then the generations freeze because we stopped recording data at that point and restarted the level.<\/p>\n\n\n\n<p><strong>Latency. <\/strong>Making WHAMM widely available for anybody to try at scale has introduced noticeable latency into the actions.  &nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"future-work\">Future work<\/h2>\n\n\n\n<p>This WHAMM model is an early exploration of real-time generated gameplay experiences. As a team we are excited about exploring what new kinds of interactive media could be made possible by these kinds of models. We highlight the limitations above not to take away from the fun of the experience, but to bring attention to areas in which future models could be improved, enabling new kinds of interactive experiences and empowering game creators to bring to life the stories they wish to tell.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"contributions\">Contributions<\/h2>\n\n\n\n<p>This was a big joint-team effort involving Game Intelligence, Xbox Gaming AI, and Xbox Certification Team. These contributions focus just on the data and model training pipeline.<\/p>\n\n\n\n<p><strong>Model Training.<br><\/strong>Tabish Rashid. Victor Fragoso. Chuyang Ke.<\/p>\n\n\n\n<p><strong>Data\/Infrastructure.<\/strong><br>Yuhan Cao. Dave Bignell. Shanzheng Tan. Lukas Sch\u00e4fer. Sarah Parisot. Abdelhak Lemkhenter. Chris Lovett. Pallavi Choudhury. Raluca Stevenson. Sergio Valcarcel Macua. Andrew Donnelly.<\/p>\n\n\n\n<p><strong>Advisory.<\/strong><br>Daniel Kennett. Andrea Trevi\u00f1o Gavito.<\/p>\n\n\n\n<p><strong>Project Management.<\/strong><br>Linda Wen. Jason Entenmann.<\/p>\n\n\n\n<p><strong>Project Leadership.<br><\/strong>Katja Hofmann. Haiyan Zhang.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"references\">References<\/h3>\n\n\n\n<p><em>[1] Kanervisto, Anssi, et al. &#8220;World and Human Action Models towards gameplay ideation.&#8221;&nbsp;Nature&nbsp;638.8051 (2025): 656-663.<br>[2] Chang, Huiwen, et al. &#8220;Maskgit: Masked generative image transformer.&#8221;&nbsp;Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition. 2022.<br>[3] Yu, Jiahui, et al. &#8220;Vector-quantized Image Modeling with Improved VQGAN.&#8221;&nbsp;International Conference on Learning Representations 2022.<\/em><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today we are making available an interactive real-time gameplay experience in Copilot Labs. Head over to this link (opens in new tab) to play an AI rendition of Quake II gameplay, powered by Muse. 5 second video of Quake II gameplay generated in real-time by WHAMM in response to the user&#8217;s controller inputs. 5 second [&hellip;]<\/p>\n","protected":false},"author":41784,"featured_media":1135894,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":0,"msr_hide_image_in_river":null,"footnotes":""},"research-area":[13556,13551],"msr-locale":[268875],"msr-post-option":[269148,269142],"class_list":["post-1135867","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-graphics-and-multimedia","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-include-in-river"],"msr_assoc_parent":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1135867","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/41784"}],"version-history":[{"count":16,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1135867\/revisions"}],"predecessor-version":[{"id":1135947,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1135867\/revisions\/1135947"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1135894"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1135867"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1135867"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1135867"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1135867"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}