{"id":1141975,"date":"2025-06-12T20:16:48","date_gmt":"2025-06-13T03:16:48","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=1141975"},"modified":"2025-06-13T08:56:20","modified_gmt":"2025-06-13T15:56:20","slug":"maag-a-new-framework-for-consistent-ai-generated-games","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/maag-a-new-framework-for-consistent-ai-generated-games\/","title":{"rendered":"MaaG: A new framework for consistent AI-generated games"},"content":{"rendered":"\n<p>World models are a key concept in AI, used to simulate how agents behave in virtual environments and enable immersive, interactive experiences. They\u2019re not only transforming game and media generation, they\u2019re also opening new frontiers for using AI in complex, dynamic settings.<\/p>\n\n\n\n<p>One emerging trend is generative games, where game environments are created frame by frame using neural networks. Microsoft\u2019s MUSE system, for example, can generate scenes from the game <em>Bleeding Edge<\/em> using deep learning models.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2512\" height=\"612\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-1.png\" alt=\"\u5fae\u8f6f\u63d0\u51fa\u7684 MUSE \u6a21\u578b\u5229\u7528\u795e\u7ecf\u7f51\u7edc\u751f\u6210\u6e38\u620f\u300a\u55dc\u8840\u8fb9\u7f18\uff08Bleeding Edge\uff09\u300b\u7684\u753b\u9762\" class=\"wp-image-1136176\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-1.png 2512w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-1-300x73.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-1-1024x249.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-1-768x187.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-1-1536x374.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-1-2048x499.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-1-240x58.png 240w\" sizes=\"auto, (max-width: 2512px) 100vw, 2512px\" \/><figcaption class=\"wp-element-caption\">Figure 1. Microsoft\u2019s MUSE generates frames from Bleeding Edge using neural networks.<\/figcaption><\/figure>\n\n\n\n<p>Yet beneath the visual polish, generative games often contain noticeable inconsistencies. Background elements may disappear or shift abruptly after minor player actions, like a form of short-term memory loss. These disruptions highlight one of the field\u2019s biggest challenges: maintaining consistency.<\/p>\n\n\n\n<p>In response, researchers from Microsoft Research Asia, the Hong Kong University of Science and Technology, and the University of Chinese Academy of Sciences have introduced a new framework called <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/model-as-a-game-on-numerical-and-spatial-consistency-for-generative-games\/\">Model as a Game (MaaG)<\/a>. This approach addresses two core inconsistencies in generative games: numerical and spatial.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"defining-the-problem-numerical-and-spatial-consistency\">Defining the problem: Numerical and spatial consistency<\/h2>\n\n\n\n<p>Numerical consistency refers to the logical accuracy of score updates based on game events. For example, if an action yields a +1 score, the result should reflect that exact change. Spatial consistency, by contrast, means the environment remains visually coherent when players revisit previously explored areas.<\/p>\n\n\n\n<p>To examine these issues in a controlled setting, the team created a minimalist 2D game called <em>Traveler<\/em>. In it, a small black block moves left and right. As it passes through empty spaces, colorful buildings are randomly generated, and the score increases by one.<\/p>\n\n\n\n<p>Despite its simplicity, <em>Traveler<\/em> clearly reveals the limitations of current generative models. Notably, the game was generated using large language models (LLMs) and built with Pygame, a set of Python modules for writing video games. It also supports frame-by-frame data export with synchronized numerical states, offering a strong foundation for research.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2632\" height=\"464\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-2.png\" alt=\"chart, bar chart\" class=\"wp-image-1136177\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-2.png 2632w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-2-300x53.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-2-1024x181.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-2-768x135.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-2-1536x271.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-2-2048x361.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-2-240x42.png 240w\" sizes=\"auto, (max-width: 2632px) 100vw, 2632px\" \/><figcaption class=\"wp-element-caption\">Figure 2. In <em>Traveler<\/em>, a moving block generates buildings and scores, exposing consistency challenges.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"inside-the-maag-framework-numerical-and-spatial-modules\">Inside the MaaG framework: Numerical and spatial modules<\/h2>\n\n\n\n<p>The MaaG framework uses a numerical module and a spatial module to enhance the Diffusion Transformer (DiT) architecture. Together, they work to ensure that generative models do more than just produce images, they also recognize and follow game logic.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2590\" height=\"804\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-3.png\" alt=\"diagram\" class=\"wp-image-1136178\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-3.png 2590w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-3-300x93.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-3-1024x318.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-3-768x238.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-3-1536x477.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-3-2048x636.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-3-240x75.png 240w\" sizes=\"auto, (max-width: 2590px) 100vw, 2590px\" \/><figcaption class=\"wp-element-caption\">Figure 3: MaaG incorporates numerical (red, left) and spatial (blue, right) modules to improve consistency.<\/figcaption><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Numerical module:<\/strong> At the core of this module is LogicNet, a compact, trainable network that determines whether specific in-game events should occur. For example, it decides if a +1 score event should be triggered in <em>Traveler<\/em>.<br>LogicNet doesn\u2019t perform arithmetic itself. Instead, the updated score is calculated externally, converted into special numerical tokens, and reinjected into the DiT model using the TextDiffuser-2 approach. This design offloads computation from the generative model, significantly improving numerical consistency.<\/li>\n\n\n\n<li><strong>Spatial module:<\/strong> This component introduces External Map, a persistent memory mechanism that stores all previously explored scenes, such as building colors and locations. Before rendering a new frame, the model consults this map to retrieve surrounding context, including areas outside the current field of view, supporting visual continuity.<br>After generating a frame, it uses a sliding window matching algorithm to align the local environment with the external map and updates it in real time. It\u2019s as if the model has both GPS and a world atlas, keeping the environment consistent as the player moves.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"showcasing-the-results-traveler-pong-and-pac-man\">Showcasing the results: <em>Traveler<\/em>, <em>Pong<\/em>, and <em>Pac-Man<\/em><\/h2>\n\n\n\n<p>Unlike traditional games that rely on graphics engines, generative games synthesize each frame using neural networks. The following video demonstrates the MaaG framework in action across <em>Traveler<\/em>, <em>Pong<\/em>, and <em>Pac-Man<\/em>\u2014showing how the framework keeps the scenes visually consistent as gameplay unfolds.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-video\"><video height=\"1080\" style=\"aspect-ratio: 1080 \/ 1080;\" width=\"1080\" controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-video-1-1.mp4\"><\/video><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-video\"><video height=\"1080\" style=\"aspect-ratio: 1080 \/ 1080;\" width=\"1080\" controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-video-2-1.mp4\"><\/video><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-video\"><video height=\"128\" style=\"aspect-ratio: 128 \/ 128;\" width=\"128\" controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-video-3.mp4\"><\/video><\/figure>\n<\/div>\n<\/div>\n\n\n\n<p><em>Generative games differ from traditional games that rely on game engines for rendering; instead, each frame in a generative game is directly synthesized by a neural network. The videos above present a sequence of examples from three such games \u2014 Traveler, Pong, and Pac-Man \u2014 shown from left to right.<\/em><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1692\" height=\"1420\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-4.png\" alt=\"MaaG \u5728\u591a\u79cd\u6e38\u620f\u4e2d\u663e\u8457\u63d0\u5347\u4e86\u4e00\u81f4\u6027\uff0c\u89e3\u51b3\u4e86\u57fa\u7ebf\u5206\u6570\u6ce2\u52a8\u548c\u573a\u666f\u7a81\u53d8\u95ee\u9898\uff0c\u5e76\u5177\u5907\u826f\u597d\u7684\u7075\u6d3b\u6027\u4e0e\u901a\u7528\u6027\u3002\" class=\"wp-image-1136180\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-4.png 1692w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-4-300x252.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-4-1024x859.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-4-768x645.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-4-1536x1289.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-4-214x180.png 214w\" sizes=\"auto, (max-width: 1692px) 100vw, 1692px\" \/><figcaption class=\"wp-element-caption\">Figure 4. MaaG resolves issues like score fluctuations and scene glitches across multiple games, enhancing both flexibility and coherence.<\/figcaption><\/figure>\n\n\n\n<p>Table 1 presents qualitative results demonstrating that MaaG effectively mitigates common issues in baseline models, such as erratic score changes and sudden visual transitions. Thanks to its modular architecture, MaaG is highly adaptable. Developers can adjust LogicNet\u2019s rules and modify the dimensions of the spatial map to support a wide range of 1D and 2D games.<\/p>\n\n\n\n<p>The system also allows creators to predefine or dynamically update the external map during gameplay, offering more control over the gaming environment than previous systems like GameGAN.<\/p>\n\n\n\n<p>Despite introducing new logic and spatial modules, MaaG maintains a low inference latency of approximately 0.015 seconds, preserving gameplay fluidity.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1760\" height=\"436\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-5.png\" alt=\"table\" class=\"wp-image-1136179\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-5.png 1760w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-5-300x74.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-5-1024x254.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-5-768x190.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-5-1536x381.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/04\/maag-5-240x59.png 240w\" sizes=\"auto, (max-width: 1760px) 100vw, 1760px\" \/><figcaption class=\"wp-element-caption\">Table 1: MaaG improves key metrics\u2014numerical consistency (NumCon), spatial consistency (SpaCon), action recognition accuracy (ActAcc), and FID\/FVD quality scores\u2014across all tested games.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"pushing-the-boundaries-of-ai-driven-game-generation\">Pushing the boundaries of AI-driven game generation<\/h2>\n\n\n\n<p>MaaG offers major improvements, though it still has limitations in repetitive environments, where spatial alignment can break down. Still, this framework represents a step forward in addressing the consistency challenges that have long plagued generative games.<\/p>\n\n\n\n<p>The work shows that by decoupling numeric logic and spatial memory from the core pixel-generation process and incorporating these elements as explicit conditions, AI can generate game worlds that are both visually compelling and mechanically coherent.<\/p>\n\n\n\n<p>Looking ahead, the team plans to expand MaaG into more complex 2D and 3D environments and explore more robust strategies for ensuring spatial consistency. With continued advances in approaches like MaaG, AI-generated, highly playable, and logically sound game worlds are rapidly becoming a reality.<\/p>\n\n\n\n<p><strong>References<\/strong><\/p>\n\n\n\n<p>[1] &nbsp;Kanervisto, Anssi, et al. 2025. \u201cWorld and Human Action Models towards gameplay ideation.\u201d <em>Nature<\/em>, 656\u2013663.<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/ora.ox.ac.uk\/objects\/uuid:519b4d38-1ee2-4c1b-95a0-ed116a149bf3\">https:\/\/ora.ox.ac.uk\/objects\/uuid:519b4d38-1ee2-4c1b-95a0-ed116a149bf3<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>World models are a key concept in AI, used to simulate how agents behave in virtual environments and enable immersive, interactive experiences. They\u2019re not only transforming game and media generation, they\u2019re also opening new frontiers for using AI in complex, dynamic settings. One emerging trend is generative games, where game environments are created frame by [&hellip;]<\/p>\n","protected":false},"author":34512,"featured_media":1136185,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":199560,"msr_hide_image_in_river":null,"footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-post-option":[269148,269142],"class_list":["post-1141975","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-include-in-river"],"msr_assoc_parent":{"id":199560,"type":"lab"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1141975","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/34512"}],"version-history":[{"count":5,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1141975\/revisions"}],"predecessor-version":[{"id":1141992,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1141975\/revisions\/1141992"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1136185"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1141975"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1141975"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1141975"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1141975"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}