{"id":726529,"date":"2021-02-17T10:23:31","date_gmt":"2021-02-17T18:23:31","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=726529"},"modified":"2022-04-28T07:32:27","modified_gmt":"2022-04-28T14:32:27","slug":"designer-centered-reinforcement-learning","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/designer-centered-reinforcement-learning\/","title":{"rendered":"Designer-centered reinforcement learning"},"content":{"rendered":"\n<figure class=\"wp-block-image alignwide size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_no_logo-2.gif\" alt=\"An animation titled \u201cAdding style to RL agent navigation\u201d plays through three versions of agent behavior in a navigation task. Each shows the agent, represented by a robot icon, moving around two walls to reach a goal, represented by a trophy icon. The first version applies only a task reward, and the agent path to the goal is too close to the wall. The second applies an excess shaping reward, and the path is too far from the wall. The third balances reward components, and the path is more central. \"\/><\/figure>\n\n\n\n<p>In video games, nonplayer characters, bots, and other game agents help bring a digital world and its story to life. They can help make the mission of saving humanity feel urgent, transform every turn of a corner into a gamer\u2019s potential demise, and intensify the rush of driving behind the wheel of a super-fast race car. These agents are meticulously designed and preprogrammed to contribute to an immersive player experience.<\/p>\n\n\n<table style=\"float: right; width: 50%; margin: 15px; text-align: center; border: 1px solid #000000; border-collapse: collapse; border-spacing: inherit;\">\n<tbody>\n<tr style=\"height: 24px;\">\n<td style=\"background-color: #000000; padding: 5px 30px; border: inherit; height: 24px;\"><span style=\"color: #ffffff;\"><strong>Join Us<\/strong><\/span><\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"padding: 5px 30px; border: inherit; height: 23px;\"> This work was undertaken during an internship at Microsoft Research Cambridge. If you\u2019re interested in exploring similar real-world challenges and developing actionable user-focused solutions, visit the lab\u2019s <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/lab\/microsoft-research-cambridge\/internships\/?\"> internship page<\/a> for details on internships in deep RL for games and other research areas. <\/td>\n<\/tr>\n<tr style=\"height: 23px;\">\n<td style=\"padding: 5px 30px; border: inherit; height: 23px;\">For insights into AI and gaming research, register for the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/event\/aiandgaming2020\/\">Microsoft AI and Gaming Research Summit 2021 (February 23\u201324)<\/a>, and for career opportunities in RL, check out the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/theme\/machine-intelligence\/#!opportunities\">open positions with the Machine Intelligence theme at Microsoft Research Cambridge<\/a> and other<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/theme\/reinforcement-learning-group\/#!opportunities\"> opportunities across Microsoft Research. <\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n\n\n<p>Now, what if these same agents could learn to behave in lifelike and interesting ways <em>without <\/em>a developer having to hardcode every possible natural behavior in each scenario? Imagine agents in an action game learning a variety of offensive strategies to challenge a protagonist or agents in an adventure game learning how to support the player in unlocking information about an unfamiliar environment. Reinforcement learning (RL), in which agents learn how to act when they must sequentially take actions over time, provides a framework for achieving that. Through RL, agents can be trained to devise their <em>own <\/em>solutions to tasks, transforming the role of game designers from defining behavior to defining tasks and letting the agents learn. Such a shift has the potential to lead to surprising responses, possibly ones a game designer may not have even imagined, helping to create more engaging characters and worlds.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/project-paidia\/\" target=\"_self\" aria-label=\"Project Paidia: a Microsoft Research & Ninja Theory Collaboration\" data-bi-type=\"annotated-link\" data-bi-cN=\"Project Paidia: a Microsoft Research & Ninja Theory Collaboration\" class=\"annotations__list-thumbnail\" >\n\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"172\" height=\"96\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/ProjPaidia_AI_GameIntell-graphic_1044x450-343x193.png\" class=\"mb-2\" alt=\"Project Paidia - game intelligence round robot character\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/ProjPaidia_AI_GameIntell-graphic_1044x450-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/ProjPaidia_AI_GameIntell-graphic_1044x450-800x450.png 800w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/ProjPaidia_AI_GameIntell-graphic_1044x450-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/08\/ProjPaidia_AI_GameIntell-graphic_1044x450-640x360.png 640w\" sizes=\"auto, (max-width: 172px) 100vw, 172px\" \/>\t\t\t\t<\/a>\n\t\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Project <\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/project-paidia\/\" data-bi-cN=\"Project Paidia: a Microsoft Research & Ninja Theory Collaboration\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Project Paidia: a Microsoft Research & Ninja Theory Collaboration<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>Reinforcement learning is already showing promising results. For example, we\u2019ve demonstrated <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.youtube.com\/watch?v=dcngdjfhGXI&feature=youtu.be\">agents\u2019 ability to effectively collaborate with each other in the Ninja Theory game<em> Bleeding Edg<\/em>e<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> as part of the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/project-paidia\/\">Project Paidia<\/a> research collaboration, which ultimately seeks to enable teamwork between agents and human players (for an RL overview, visit our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/innovation.microsoft.com\/en-us\/exploring-project-paidia\">Project Paidia website and interactive experience<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>). At the same time, many experts feel the use of RL in the commercial game industry is still far below its ultimate potential. The reasons why are numerous, including the need for a certain level of expertise to execute the technology. From our previous <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/its-unwieldy-and-it-takes-a-lot-of-time-challenges-and-opportunities-for-creating-agents-in-commercial-games\/\">research into the experiences of game agent creators<\/a>, we\u2019ve come to realize that for RL techniques to be used in the game industry, we need to design them with potential users and their existing workflows and requirements in mind. In recent work, we focus on three specific challenges:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>exercising authorial control when it comes to specifying the aesthetic style of game agents<\/li><li>balancing multiple design constraints, specifically task completion and behaving in a desired style<\/li><li>developing RL tools and infrastructure that are more meaningful from a designer perspective, allowing designers to make desired changes without formal engineering training<\/li><\/ul>\n\n\n\n<p>In this work, we establish the first steps toward a <em>designer-centered approach to RL<\/em>, making it easier for designers to specify an agent\u2019s style through preference learning, automatically, robustly combined reward signals that satisfy varied design constraints, and a contextually meaningful workflow.<\/p>\n\n\n\n<p>We show our results in a navigation task, as navigation is one of the most fundamental agent capabilities. In our experiments, we start with an agent that&#8217;s rewarded for getting closer to a goal as quickly as possible\u2014in our case, a blue circle behind two \u201cwalls.\u201d This results in the agent learning to take the shortest path to the blue circle, leading the agent to bump into and drag along the walls on its way. In this exploration, we assume the role of a designer aiming for movements more reflective of how a human player might approach the challenge, by taking a more central path.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"500\" height=\"500\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Video1_RlGamingBlog.gif\" alt=\"A video in which an agent, represented by a red circle, moves around two rectangular walls to reach its goal, represented by a smaller blue circle. A black dotted line shows the agent\u2019s path, which runs closely along the walls.\" class=\"wp-image-726562\"\/><figcaption>Video 1: In a navigation task like the above, it\u2019s common for agents to drag along the walls, especially if they\u2019re rewarded for getting closer to the goal (the blue circle). We want to tune this agent so that it doesn\u2019t hug the walls as much and behaves in a way that\u2019s more reflective of how a human player might achieve the task, via a more central path.<\/figcaption><\/figure><\/div>\n\n\n\n<h2 id=\"preference-learning-as-a-method-to-specify-style-rewards\">Preference&nbsp;learning as a method to specify&nbsp;style&nbsp;rewards&nbsp;&nbsp;<\/h2>\n\n\n\n<p>RL algorithms learn through a reward function. Unfortunately, it\u2019s very difficult to computationally specify aesthetic style. If we\u2019re building a stealth game, we might want our agents to creep near building edges, but if we\u2019re making a game about cyborg warriors, we would much prefer them charging through a scene. It\u2019s unclear, though, how one might go about writing an RL <em>style reward<\/em> for being \u201cstealthy\u201d or \u201cboisterous.\u201d Even if it were, the designers who decide and tweak a game\u2019s aesthetic aspects are often separate from the engineers who implement the underlying AI behavior, requiring that designers become proficient enough in RL to adjust the AI codebase to achieve a desired style. Such an expectation is unrealistic in larger teams and impractical with most designer workflows used today.<\/p>\n\n\n\n<figure class=\"wp-block-video alignwide\"><video height=\"1080\" style=\"aspect-ratio: 1920 \/ 1080;\" width=\"1920\" controls src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/BleedingEdge-Composite-Shot.mp4\"><\/video><figcaption><br><center>Video 2: In this human-controlled demonstration (exaggerated to illustrate how an RL agent would behave), the character grinding along the wall while moving to its target in<em> Bleeding Edge<\/em> looks jarring. It\u2019s not representative of how a human player would likely move, nor does this type of behavior make sense for the aesthetic style of the game. From the perspective of an RL agent, however, there wouldn\u2019t be any problem! The agent would take the quickest path to a given goal. (This video is for demonstration purposes only. Does not represent the agents used in <em>Bleeding Edge <\/em>or Project Paidia. Not representative of final game gameplay or visuals.)<\/center><\/figcaption><\/figure>\n\n\n\n<p>The traditional method to achieve our goal of reaching the blue circle as quickly as possible while taking a more central path would be to write an extension to the reward function, also known as reward shaping. We can punish the agent for being too close to the walls while still rewarding it for reaching the goal. However, even with extensive experimentation, it\u2019s difficult to attain a behavior that\u2019s exactly what we want, as what we really want isn\u2019t simply \u201cstaying away from the walls\u201d but a more nuanced style of movement that\u2019s hard to capture mathematically.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"500\" height=\"500\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Video3_RLGamingBlog.gif\" alt=\"A video in which an agent, represented by a red circle, moves around two rectangular walls to reach its goal, represented by a smaller blue circle. A black dotted line shows the agent\u2019s path, which runs far from the walls.  \" class=\"wp-image-726568\"\/><figcaption>Video 3: To achieve a more central path with a traditional reward shaping approach, the agent receives a penalty for being too close to the walls in addition to the original task reward. However, this can result in another type of undesirable behavior if the shaping reward is weighted too heavily as in this video. Consequently, finding the right balance between task and style rewards isn\u2019t a simple task.<\/figcaption><\/figure><\/div>\n\n\n\n<p>It&nbsp;would be&nbsp;much&nbsp;easier&nbsp;to&nbsp;recognize a desired style&nbsp;as opposed&nbsp;to&nbsp;describing it&nbsp;mathematically.&nbsp;Because of&nbsp;this,&nbsp;we implemented a&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/proceedings.neurips.cc\/paper\/2017\/file\/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">preference-based&nbsp;learning method<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;to allow designers to specify their desired style&nbsp;through a simple&nbsp;user&nbsp;interface\u2014no coding required!&nbsp;<\/p>\n\n\n\n<p>Our proposed&nbsp;method works as follows:&nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>The policy, pretrained on the task reward,&nbsp;interacts with the environment and produces a set of trajectories.<\/li><li>The designer is shown segments of these trajectories,&nbsp;and they pick which segment is closer to their desired style.&nbsp;<\/li><li>A reward&nbsp;network tasked with capturing the style is updated&nbsp;according to the designer\u2019s&nbsp;preferences.<\/li><li>The reward network predicts how much a state exhibits the learned style. This predicted style reward plus the original task reward is used to optimize the original policy.<\/li><li>A new set of trajectories is collected with the updated policy to get new preferences.<\/li><\/ol>\n\n\n\n<p>Because RL training takes time, we fine-tune an already competent agent, cutting the amount of iteration down drastically compared to training an agent from scratch; all we need to do is add style to it. Further, this makes it easier to apply different styles to the same base agent, allowing AI engineers to train a base model and then designers to fine-tune style preferences.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"628\" height=\"428\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/RLGamingBlog_Video4.gif\" alt=\"A video demonstration of the prototype user interface. Two agent trajectories appear side by side. In each, the agent, represented by a blue circle, displays different styles of movement and behavior in making its way to the target, represented by a smaller red circle. At the bottom right, a hypothetical designer, represented by the movement of a cursor, selects their preference from three choices: left, even, and right.  \" class=\"wp-image-726595\"\/><figcaption>Video 4: Describing an aesthetic style mathematically is difficult. Our prototype user interface allows designers to choose the behavior that is closest to their desired style from a series of trajectories generated by the agent interacting with its environment.<\/figcaption><\/figure><\/div>\n\n\n\n<p>In a feedback efficiency study, we show we can reliably train a successful agent in our given task. For our task, training is only considered successful if the agent completes the task with a high enough style reward (measured by the mean distance from the nearest wall) and an acceptable task reward (an approximate equivalent to the time it takes to reach the goal). We reached our desired behavior in 50 comparisons. We expect the number of required preferences to increase as the complexity of the environment and of the desired style goes up.<\/p>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"500\" height=\"500\" data-id=\"726604\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/RLGamingBlog_GIFPart1_6-602c25e76a82b.gif\" alt=\"Side-by-side videos in which an agent, represented by a red circle, moves around two rectangular walls to reach its goal, represented by a smaller blue circle. On the left, a black dotted line shows a path that is mostly far from the wall. On the right, a black dotted line shows a style of behavior that starts far from the wall initially and becomes progressively more centered.\" class=\"wp-image-726604\"\/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"479\" height=\"479\" data-id=\"726607\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/RLGamingBlog_GIFPart2_6.gif\" alt=\"Side-by-side videos in which an agent, represented by a red circle, moves around two rectangular walls to reach its goal, represented by a smaller blue circle. On the left, a black dotted line shows a path that is mostly far from the wall. On the right, a black dotted line shows a style of behavior that starts far from the wall initially and becomes progressively more centered.\" class=\"wp-image-726607\"\/><\/figure>\n<figcaption class=\"blocks-gallery-caption\">Video 5: In our navigation task, the agent on the left was fine-tuned to maximize distance to the walls, while the agent on the right was fine-tuned using preference learning. Through preference learning, we were able to achieve more nuanced behavior that better captured our intended style.<\/figcaption><\/figure>\n\n\n\n<h2 id=\"potential-based-shaping-to-combine-style-and-task-rewards\">Potential-based shaping to combine style and task rewards<\/h2>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"624\" height=\"197\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/RLGaming_GIF1.png\" alt=\"A screenshot with the equation total_reward = A * style_reward + B * task_reward.  \" class=\"wp-image-726610\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/RLGaming_GIF1.png 624w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/RLGaming_GIF1-300x95.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/RLGaming_GIF1-16x5.png 16w\" sizes=\"auto, (max-width: 624px) 100vw, 624px\" \/><figcaption>The above represents the common workflow when combining multiple sources of reward. The RL agent is trying to optimize the total reward. It\u2019s up to the designer to decide on the correct ratio of style reward to task reward by specifying the weight of each, A and B here. This workflow requires a lot of iteration and offers little control.<\/figcaption><\/figure><\/div>\n\n\n\n<p>Specifying the reward that captures the desired style is only the first step; the style reward must then be integrated into the agent\u2019s preexisting task reward. This is by no means a trivial undertaking. If the ratio of the style reward to the task reward is too high, the style reward overwhelms the task reward and navigation performance suffers. If the ratio is too low, then there\u2019s no observable behavior change. The default approach to solving this problem is to iterate\u2014tweak the ratio slightly and run another experiment. Since each RL training run can take hours, trying to tune the style and task reward ratios manually is laborious and mind-numbingly boring.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/RLGamingVideo6.gif\" alt=\"In the video, the agent demonstrates \"reward hacking\" by moving to the point farthest away from the walls instead of reaching the goal. This happens when the combination of task and style rewards is misconfigured in a way that simply maximizing the style reward to avoid walls and ignoring the task reward to reach the goal yields the highest reward.\"\/><figcaption>Video 6: In the above video, the agent demonstrates &#8220;reward hacking&#8221; by moving to the point farthest away from the walls instead of reaching the goal. This happens when the combination of task and style rewards is misconfigured in a way that simply maximizing the style reward to avoid walls and ignoring the task reward to reach the goal yields the highest reward.<\/figcaption><\/figure><\/div>\n\n\n\n<p>When we&nbsp;first&nbsp;tried&nbsp;to fine-tune our agent with our new style,&nbsp;it failed&nbsp;to be successful in the initial task, as it gave too much weight to exhibiting the style!&nbsp;We&nbsp;used&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/people.eecs.berkeley.edu\/~russell\/papers\/icml99-shaping.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">potential-based reward shaping&nbsp;(PBRS)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;to solve this problem. PBRS ensures that when a shaping reward is introduced\u2014that is, a reward&nbsp;encouraging behavior other than the&nbsp;initial&nbsp;task,&nbsp;like our style reward\u2014the optimal policy&nbsp;for the&nbsp;initial&nbsp;task&nbsp;remains the same.&nbsp;&nbsp;<\/p>\n\n\n\n<p>PBRS is a simple&nbsp;yet powerful&nbsp;technique where,&nbsp;at each step, we subtract the&nbsp;previous step\u2019s&nbsp;style reward from the total reward&nbsp;(task reward plus style reward)&nbsp;of the current step.&nbsp;This means&nbsp;the agent is rewarded for being at a certain state and not&nbsp;moving into&nbsp;a certain state.&nbsp;The intuition behind PBRS can be expressed with the following&nbsp;example:&nbsp;imagine&nbsp;an agent is rewarded for crossing the finish line in a race.&nbsp;After crossing the finish line initially,&nbsp;the agent might be encouraged&nbsp;to&nbsp;step back over the finish line and forward again&nbsp;multiple times, effectively&nbsp;gaining&nbsp;infinite rewards. However,&nbsp;with&nbsp;PBRS, we only reward the agent for&nbsp;<em>being<\/em>&nbsp;at the finish line,&nbsp;not crossing it:&nbsp;whenever the agent&nbsp;steps back,&nbsp;we take away the reward we had given it, preventing&nbsp;it&nbsp;from&nbsp;accruing more&nbsp;reward&nbsp;by&nbsp;simply&nbsp;crossing&nbsp;back and forth.&nbsp;<\/p>\n\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1002645\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: AI-POWERED EXPERIENCE<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/aka.ms\/research-copilot\/?OCID=msr_researchforum_Copilot_MCR_Blog_Promo\" aria-label=\"Microsoft research copilot experience\" data-bi-cN=\"Microsoft research copilot experience\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/MSR-Chat-Promo.png\" alt=\"\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Microsoft research copilot experience<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"microsoft-research-copilot-experience\" class=\"large\">Discover more about research at Microsoft through our AI-powered experience<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/aka.ms\/research-copilot\/?OCID=msr_researchforum_Copilot_MCR_Blog_Promo\" aria-describedby=\"microsoft-research-copilot-experience\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Microsoft research copilot experience\" target=\"_blank\">\n\t\t\t\t\t\t\tStart now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n\n<p>While the distinction is subtle, this prevents the agent from \u201creward hacking\u201d to maximize rewards without accomplishing the initial task. When we started using PBRS to integrate our style reward into the task reward, the training was more successful in both keeping the original task rewards high and displaying the desired style. This alternative to the time-consuming task of fine-tuning the ratio of style reward to task reward manually means designers can explore many more style variations instead of spending their resources getting a single style to work properly.<\/p>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-2 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2141\" height=\"1242\" data-id=\"726616\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Fig1_Potential-Based-Reward-Mixing-Edited89511.png\" alt=\"chart, bar chart, histogram\" class=\"wp-image-726616\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Fig1_Potential-Based-Reward-Mixing-Edited89511.png 2141w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Fig1_Potential-Based-Reward-Mixing-Edited89511-300x174.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Fig1_Potential-Based-Reward-Mixing-Edited89511-1024x594.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Fig1_Potential-Based-Reward-Mixing-Edited89511-768x446.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Fig1_Potential-Based-Reward-Mixing-Edited89511-1536x891.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Fig1_Potential-Based-Reward-Mixing-Edited89511-2048x1188.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/Fig1_Potential-Based-Reward-Mixing-Edited89511-16x9.png 16w\" sizes=\"auto, (max-width: 2141px) 100vw, 2141px\" \/><\/figure>\n<figcaption class=\"blocks-gallery-caption\">Figure 1: Potential-based reward shaping (PBRS) makes integrating the style reward into the task reward easier, preventing the style reward from degrading task performance even at high style-reward-to-task-reward ratios. The above graph shows the percentage of successful trials with a particular reward ratio. In our navigation task, a successful trial was defined as the agent completing the task with a high enough style reward, measured by the mean distance from the nearest wall, and an acceptable task reward, an approximate equivalent to the time it takes to reach the goal. As shown, when PBRS isn\u2019t used, the agent is only successful with a 0.1 style-reward-to-task-reward ratio. When PBRS is used, the agent is successful at ratios between 0.5 and 100.<\/figcaption><\/figure>\n\n\n\n<h2 id=\"automatic-reward-ratio-adjustment-to-increase-designer-control\">Automatic reward ratio adjustment to increase designer control<\/h2>\n\n\n\n<p>Even though using PBRS made finding an acceptable ratio of style reward to task reward much easier, we\u2019re still asking designers to fine-tune a behavior by changing an arbitrary numeric ratio. There\u2019s no designer-interpretable meaning to \u201ccombining one parts task reward with four parts style reward.\u201d<\/p>\n\n\n\n<p>Rewards, especially when designed intentionally, can be meaningful from a design perspective. For example, if the agent is penalized by 1 point every second, we can see how fast the agent reaches the goal by simply looking at the final reward. A \u201315 task reward means the agent took 15 seconds to reach the goal. In scenarios where it\u2019s possible to provide similar types of meaningful rewards, it would be much more efficient for a designer to specify a minimum performance that\u2019s acceptable\u2014a<em> lower bound reward threshold<\/em>\u2014versus tweaking arbitrary numeric ratios.<\/p>\n\n\n\n<p>Toward this end, we implemented an automatic reward ratio schedule that tries to maximize the style reward while respecting the designer-specified threshold. The automated scheduler increases the ratio of style reward to task reward while the task reward is higher than the designer-specified threshold and reduces the ratio when the task performance starts to degrade. To be more specific, we linearly scale the style reward ratio between a maximum number and 0\u2014between the starting performance and the threshold performance. In the above example, if a designer wanted their agent to reach the goal in at most 15 seconds, the automated scheduler would increase the ratio of style reward to task reward until the agent started taking longer than the specified 15 seconds. At that point, the scheduler would then reduce the style reward weight until a performance time of 15 seconds was achieved. This automated schedule would continue over the course of training.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"649\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/RL-Gaming-Chart2.jpg\" alt=\"The figures demonstrate how designer-specified minimum task reward thresholds affect training. Figure 2a (top) plots the task rewards over total timesteps (length of an experiment) under five different minimum reward thresholds, from 150 to 110. The graph shows that the training starts from the same initial task reward of 150 but goes down to the minimum task threshold for each of the experiments. The automatic reward ratio scheduler is effective in keeping the task reward above the specified task threshold. Figure 2b (bottom) plots the mean distance from the walls (our proxy for the style reward) over total timesteps under the same five reward thresholds. The mean distance from the wall increases as the threshold is reduced from 150 to 110. Most notably, the agent fails to move away from the walls with a reward threshold of 150 since it has no slack to sacrifice task reward.\" class=\"wp-image-726538\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/RL-Gaming-Chart2.jpg 800w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/RL-Gaming-Chart2-300x243.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/RL-Gaming-Chart2-768x623.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/RL-Gaming-Chart2-16x12.jpg 16w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><figcaption>Figures 2a and 2b: The above figures demonstrate how designer-specified minimum task reward thresholds affect training. Figure 2a (top) plots the task rewards over total timesteps (length of an experiment) under five different minimum reward thresholds, from 150 to 110. The graph shows that the training starts from the same initial task reward of 150 but goes down to the minimum task threshold for each of the experiments. The automatic reward ratio scheduler is effective in keeping the task reward above the specified task threshold. Figure 2b (bottom) plots the mean distance from the walls (our proxy for the style reward) over total timesteps under the same five reward thresholds. The mean distance from the wall increases as the threshold is reduced from 150 to 110. Most notably, the agent fails to move away from the walls with a reward threshold of 150 since it has no slack to sacrifice task reward.<\/figcaption><\/figure><\/div>\n\n\n\n<p>Figure 2a shows the task reward with different designer-specified minimum task reward thresholds. The automated reward ratio adjustment is effective in keeping the task reward above the specified performance threshold.<\/p>\n\n\n\n<p>Figure 2b shows the mean distance to the closest wall, a simple approximation of our target style, under different reward thresholds. When the minimum task reward threshold is very high (150), the change in behavior is rather small, as the agent is prioritizing the task reward over exhibiting the style. However, as the designer relaxes the constraints, more style behavior emerges.<\/p>\n\n\n\n<p>This method of specifying a desired reward is much more meaningful than iteratively changing the numeric ratio to hit the desired target. We believe this workflow simplifies the job of the designer immensely.<\/p>\n\n\n\n<h2 id=\"open-questions-and-continued-collaboration\">Open questions and continued collaboration<\/h2>\n\n\n\n<p>While these results are encouraging, there are several open research questions. First, we need to validate our findings with a user study. While the high-level workflow is established, there\u2019s more to learn regarding the specifics. In the context of our proposed solutions, do designers continuously monitor the training, or do they give feedback in batches? What information is shown to the designers for them to make accurate choices?<\/p>\n\n\n\n<p>Another open question is exploring different methods of specifying a style. While preferences are useful, there are many other methods we can employ. Designers can demonstrate the desired style by taking control of the agent, or they can annotate individual states to guide the fine-tuning. It\u2019s unclear which of these methods (or what combination) offers the most control to the designers.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t\t<a href=\"https:\/\/note.microsoft.com\/MSR-Webinar-Project-Malmo-Registration-Live.html?wt.mc_id=blog_MSR-WBNR_malmo_margin\" target=\"_self\" aria-label=\"Reinforcement learning in Minecraft: Challenges and opportunities in multiplayer games\" data-bi-type=\"annotated-link\" data-bi-cN=\"Reinforcement learning in Minecraft: Challenges and opportunities in multiplayer games\" class=\"annotations__list-thumbnail\" >\n\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"170\" height=\"96\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Webinar_Social_Asset_noicons-300x169.png\" class=\"mb-2\" alt=\"Microsoft Research Webinars\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Webinar_Social_Asset_noicons-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Webinar_Social_Asset_noicons-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Webinar_Social_Asset_noicons-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Webinar_Social_Asset_noicons-1536x865.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Webinar_Social_Asset_noicons-2048x1153.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Webinar_Social_Asset_noicons-16x9.png 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Webinar_Social_Asset_noicons-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Webinar_Social_Asset_noicons-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Webinar_Social_Asset_noicons-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Webinar_Social_Asset_noicons-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Webinar_Social_Asset_noicons-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Webinar_Social_Asset_noicons-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/1400x788_Webinar_Social_Asset_noicons-1920x1080.png 1920w\" sizes=\"auto, (max-width: 170px) 100vw, 170px\" \/>\t\t\t\t<\/a>\n\t\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Event <\/span>\n\t\t\t<a href=\"https:\/\/note.microsoft.com\/MSR-Webinar-Project-Malmo-Registration-Live.html?wt.mc_id=blog_MSR-WBNR_malmo_margin\" data-bi-cN=\"Reinforcement learning in Minecraft: Challenges and opportunities in multiplayer games\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Reinforcement learning in Minecraft: Challenges and opportunities in multiplayer games<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>The journey toward RL that can be easily and organically incorporated into commercial game design is a long one. We feel taking a designer-centered approach, as demonstrated by the prototypes above, offers a promising means to achieving that goal, and we look forward to continuing to work with professionals in the game industry to deliver practical and empowering solutions.<\/p>\n\n\n\n<h3 id=\"additional-resources-and-opportunities\">Additional resources and opportunities<\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/research-collection-shall-we-play-a-game\/\" target=\"_blank\" rel=\"noreferrer noopener\">Research Collection \u2013 Shall we play a game?<\/a>&nbsp;<\/li><li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/note.microsoft.com\/MSR-Webinar-Project-Malmo-Registration-Live.html?wt.mc_id=blog_MSR-WBNR_malmo_link\" target=\"_blank\" rel=\"noopener noreferrer\">\u201cReinforcement learning in Minecraft: Challenges and opportunities in multiplayer games\u201d Webinar<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;<\/li><\/ul>\n\n\n\n<hr class=\"wp-block-separator is-style-dots\"\/>\n\n\n\n<p><em>This&nbsp;work was spearheaded by&nbsp;UC Santa Cruz PhD student&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=http%3A%2F%2Fbatuaytemiz.com%2F&data=04%7C01%7Cv-alhage%40microsoft.com%7C5807472333794332262e08d8d36eea52%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637491820638492875%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=B2MfF8jVQhHHFx1hljw%2BRAMYohjFzljoHWnGy0Yla%2Fk%3D&reserved=0\">Batu&nbsp;Aytemiz<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;during a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fwww.microsoft.com%2Fen-us%2Fresearch%2Flab%2Fmicrosoft-research-cambridge%2Finternships%2F%3F&data=04%7C01%7Cv-alhage%40microsoft.com%7C5807472333794332262e08d8d36eea52%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637491820638502831%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=7B6lDNfYmijv8tEsi8YsHCG0Q7AXDmeasqR3u0JA9Ls%3D&reserved=0\">Microsoft Research&nbsp;Cambridge&nbsp;internship<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Team members&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fwww.microsoft.com%2Fen-us%2Fresearch%2Fpeople%2Ft-mijaco%2F&data=04%7C01%7Cv-alhage%40microsoft.com%7C5807472333794332262e08d8d36eea52%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637491820638512798%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UYlmXOuDtvV3SzSSUbEaYeF0WpcyX5bM1WdoZItTJLU%3D&reserved=0\">Mikhail Jacob<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fwww.microsoft.com%2Fen-us%2Fresearch%2Fpeople%2Fsadevlin%2F&data=04%7C01%7Cv-alhage%40microsoft.com%7C5807472333794332262e08d8d36eea52%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637491820638522752%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=WWhWMFa6QP9vmU8B7qLVLuac64L%2FO4I2xLsrwvky%2F6g%3D&reserved=0\">Sam Devlin<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nam06.safelinks.protection.outlook.com\/?url=https%3A%2F%2Fwww.microsoft.com%2Fen-us%2Fresearch%2Fpeople%2Fkahofman%2F&data=04%7C01%7Cv-alhage%40microsoft.com%7C5807472333794332262e08d8d36eea52%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637491820638532708%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=r9pnmvqyEnaw828oQwWAqz0kgGs%2BwZwxbRnMM28t%2F1A%3D&reserved=0\">Katja Hofmann<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&nbsp;served as advisors on the work.<\/em>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In video games, nonplayer characters, bots, and other game agents help bring a digital world and its story to life. They can help make the mission of saving humanity feel urgent, transform every turn of a corner into a gamer\u2019s potential demise, and intensify the rush of driving behind the wheel of a super-fast race [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":726907,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-726529","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199561],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[583324,1142579],"related-projects":[669597],"related-events":[632643],"related-researchers":[{"type":"guest","value":"batu-aytemiz","user_id":"726652","display_name":"Batu  Aytemiz","author_link":"<a href=\"http:\/\/batuaytemiz.com\/\" aria-label=\"Visit the profile page for Batu  Aytemiz\">Batu  Aytemiz<\/a>","is_active":true,"last_first":"Aytemiz, Batu ","people_section":0,"alias":"batu-aytemiz"},{"type":"user_nicename","value":"Katja Hofmann","user_id":32468,"display_name":"Katja Hofmann","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/kahofman\/\" aria-label=\"Visit the profile page for Katja Hofmann\">Katja Hofmann<\/a>","is_active":false,"last_first":"Hofmann, Katja","people_section":0,"alias":"kahofman"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_Blog_still_No_logo-960x540.jpg\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_Blog_still_No_logo-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_Blog_still_No_logo-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_Blog_still_No_logo-1024x577.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_Blog_still_No_logo-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_Blog_still_No_logo-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_Blog_still_No_logo-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_Blog_still_No_logo-16x9.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_Blog_still_No_logo-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_Blog_still_No_logo-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_Blog_still_No_logo-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_Blog_still_No_logo-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_Blog_still_No_logo-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/02\/1400x788_Athens_Blog_still_No_logo-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"http:\/\/batuaytemiz.com\/\" title=\"Go to researcher profile for Batu  Aytemiz\" aria-label=\"Go to researcher profile for Batu  Aytemiz\" data-bi-type=\"byline author\" data-bi-cN=\"Batu  Aytemiz\">Batu  Aytemiz<\/a>, Mikhail Jacob, Sam Devlin, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/kahofman\/\" title=\"Go to researcher profile for Katja Hofmann\" aria-label=\"Go to researcher profile for Katja Hofmann\" data-bi-type=\"byline author\" data-bi-cN=\"Katja Hofmann\">Katja Hofmann<\/a>","formattedDate":"February 17, 2021","formattedExcerpt":"In video games, nonplayer characters, bots, and other game agents help bring a digital world and its story to life. They can help make the mission of saving humanity feel urgent, transform every turn of a corner into a gamer\u2019s potential demise, and intensify the&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/726529","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=726529"}],"version-history":[{"count":48,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/726529\/revisions"}],"predecessor-version":[{"id":840250,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/726529\/revisions\/840250"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/726907"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=726529"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=726529"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=726529"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=726529"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=726529"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=726529"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=726529"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=726529"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=726529"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=726529"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=726529"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}