{"id":750835,"date":"2021-06-07T10:32:29","date_gmt":"2021-06-07T17:32:29","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=750835"},"modified":"2021-06-07T12:37:22","modified_gmt":"2021-06-07T19:37:22","slug":"building-stronger-semantic-understanding-into-text-game-reinforcement-learning-agents","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/building-stronger-semantic-understanding-into-text-game-reinforcement-learning-agents\/","title":{"rendered":"Building stronger semantic understanding into text game reinforcement learning agents"},"content":{"rendered":"\n<figure class=\"wp-block-image alignwide size-large\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_animation_new.gif\" alt=\"\"\/><\/figure>\n\n\n\n<p>AI agents capable of understanding natural language, communicating, and accomplishing tasks hold the promise of revolutionizing the way we interact with computers in our everyday lives. Text-based games, such as the <em>Zork <\/em>series, act as testbeds for development of novel learning agents capable of understanding and interacting exclusively through language. Beyond requiring the use of imagination and myriad concepts of everyday life to solve, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/by-making-text-based-games-more-accessible-to-rl-agents-jericho-framework-opens-up-exciting-natural-language-challenges\/\">these fictional world\u2013based narratives are also a safe sandbox for AI testing<\/a> that avoids the expense of collecting user data and the risk of users having a bad experience interacting with agents that are still learning. <\/p>\n\n\n\n<p>In this blog post, we share two papers that explore reinforcement learning methods to improve semantic understanding in text agents, a key process by which AI understands and reacts to text-based input. We\u2019re also releasing source code for these agents to encourage the community to continue to improve semantic understanding in text-based games.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/reading-and-acting-while-blindfolded-the-need-for-semantics-in-text-game-agents\/\" data-bi-cN=\"Reading and Acting While Blindfolded: The Need for Semantics in Text Game Agents\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Reading and Acting While Blindfolded: The Need for Semantics in Text Game Agents<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">CODE<\/span>\n\t\t\t<a href=\"https:\/\/github.com\/princeton-nlp\/blindfold-textgame\" data-bi-cN=\"Blindfold Text Game\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Blindfold Text Game<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/keep-calm-and-explore-language-models-for-action-generation-in-text-based-games\/\" data-bi-cN=\"Keep CALM and Explore: Language Models for Action Generation in Text-based Games\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Keep CALM and Explore: Language Models for Action Generation in Text-based Games<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Code<\/span>\n\t\t\t<a href=\"https:\/\/github.com\/princeton-nlp\/calm-textgame\" data-bi-cN=\"CALM text game \" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>CALM text game <\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<p>Ever since text-based games were proposed as a benchmark for language understanding agents, a key challenge in these games has been the enormous action space. Games like<em> <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Zork_I\">Zork 1<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> <\/em>can have up to 98 million possible actions in each state\u2014the majority of which are nonsensical, ungrammatical, or inapplicable. In order to make text-based games more approachable to reinforcement learning (RL) agents, the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/jericho\">Jericho framework <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>provides several handicaps such as valid-action identification, which uses the game engine to identify a minimal set of 10\u2013100 textual actions applicable in the current game state. Other handicaps involve the ability to extract the recognized vocabulary for a game and to save and restore previously visited states. Certain RL agents depend on the valid-action handicap, like <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/deep-reinforcement-learning-with-an-action-space-defined-by-natural-language\/\">deep reinforcement relevance network (DRRN)<\/a>, which learns to choose the action within the set of valid actions at each timestep that maximizes expected game scores.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig1_TBGJ-1024x620.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"620\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig1_TBGJ-1024x620.png\" alt=\"Example from text-based game. From top to bottom: Box with dotted line reads \u201cObservation: This bedroom is extremely spare, with dirty laundry scattered haphazardly all over the floor. Cleaner clothing can be found in the dresser. A bathroom lies to the south, while a door to the east leads to the living room. On the end table are a telephone, a wallet and some keys. The phone rings.\u201d\nAction: Answer phone\nBox with dotted line reads \u201cObservation: You pick up the phone. \"Hadley!\" a shrill voice cries. \"Hadley, haven't you even left yet?? You knew that our presentation was at nine o'clock sharp! First the thing with the printers, now this - there won't even be enough left of you for Bowman to fire once he's done with you. Now get the hell down here!!\"\nAction: Examine me\nBox with dotted line reads \u201cObservation: You're covered with mud and dried sweat. It was an exhausting night no wonder you overslept! Even making it to the bed before conking out was a heroic accomplishment.\u201d\nValid Actions: get up, take phone, take off watch, take off clothing, take off all, take wallet, take keys, close door, put watch down, put clothing down, look under bed, open wallet, take all from end, put watch on end, put watch on phone, put clothing on end, put clothing on phone\n\" class=\"wp-image-751192\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig1_TBGJ-1024x620.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig1_TBGJ-300x182.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig1_TBGJ-768x465.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig1_TBGJ-16x10.png 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig1_TBGJ.png 1506w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Figure 1: An example text-based game consists of language-based observations describing the simulated world and actions, taken by the agent, that alter the state of the world and advance the story. Valid Actions, identified through introspection into the game emulator, are a list of actions applicable at the current step that are guaranteed to change the state of the game.<\/figcaption><\/figure>\n\n\n\n<p>Despite its usefulness from an RL tractability standpoint, the valid-action handicap, through the use of privileged game insights, can expose hidden information about the environment to the agent. For example, In Figure 1, the valid-action <em>take off watch<\/em> inadvertently leaks the existence of a watch that was not revealed in the observation text.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--left\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/reading-and-acting-while-blindfolded-the-need-for-semantics-in-text-game-agents\/\" data-bi-cN=\"Reading and Acting While Blindfolded: The Need for Semantics in Text Game Agents\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Reading and Acting While Blindfolded: The Need for Semantics in Text Game Agents<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>Our recent paper,&nbsp;\u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/reading-and-acting-while-blindfolded-the-need-for-semantics-in-text-game-agents\/\">Reading and Acting while Blindfolded: The need for semantics in text game agents<\/a>,\u201d shows that when using Jericho-provided handicaps, such as the identification of valid-actions, it\u2019s possible for text agents to achieve competitive scores while using textual representations entirely devoid of semantics. The paper has been accepted to the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/2021.naacl.org\/\">Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2021)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<h2 id=\"hash-based-representation-reveals-that-handicap-agents-bypass-semantic-understanding\">Hash-based representation reveals that handicap agents bypass semantic understanding<\/h2>\n\n\n\n<p>To probe the extent of semantic understanding of the DRRN agent, we replace the observation text and valid-action texts with consistent, yet semantically meaningless text strings derived by hashing the current state and actions. Figure 2 compares the standard textual representation of observations and actions with two ablations. MIN-OB (b) shortens observation text to only include the name of the current location without description of the objects or characters therein. HASH (c) entirely replaces the text of the observation and actions with the output of a hash-based representation.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig2_TBGJ-1024x301.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"301\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig2_TBGJ-1024x301.png\" alt=\"Three boxes: A, B, and C show a standard textual representation and two ablations respectively.\n(A) ZORK 1\nObservation 21: You are in the living room. There is a doorway to the\neast, a wooden door with strange gothic lettering to the west, which\nappears to be nailed shut, a trophy case, and a large oriental rug in the\ncenter of the room. You are carrying: A brass lantern. \nAction 21: move rug\nObservation 22: With a great effort, the rug is moved to one side of the\nroom, revealing the dusty cover of a closed trap door. Living room\nYou are carrying: ellipsis\nAction 22: open trap\n(B) MIN O B\nObservation 21: Living Room\nAction 21: move rug\nObservation 22: Living Room\nAction 22: open trap\n\n(C) HASH\nObservation 21: OX6FC\nAction 21: 0X3A04\nObservation 22: OX103B\nAction 22: OX16BB\n\" class=\"wp-image-751201\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig2_TBGJ-1024x301.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig2_TBGJ-300x88.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig2_TBGJ-768x226.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig2_TBGJ-1536x452.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig2_TBGJ-16x5.png 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig2_TBGJ.png 1744w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Figure 2: A standard textual representation (left) and two ablations: MIN-OB (top right) and HASH (bottom right).<\/figcaption><\/figure>\n\n\n\n<p>Specifically, we use Python\u2019s hash function to map each text string to a unique hash value, which ensures even if one word is changed in text, the representation will be completely different. From the perspective of playing this game, representations (b) and (c) would pose significant challenges to a human due to the lack of semantic information.<\/p>\n\n\n\n<p>Subsequently, a DRRN text agent was trained from scratch using RL to estimate the quality of each candidate action. Surprisingly, the agent on the hashed-representation was able to achieve comparable scores with the control agents using the original text-based observations and actions (see Figure 3), and it even outperforms the baseline (DRRN-base) in 3 out of the 12 games. This finding suggests that current text agents are likely circumventing the development of semantic understanding, particularly when trained on single games using the valid-action handicap.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Figure3_TBGJ1-1024x420.jpg\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"420\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Figure3_TBGJ1-1024x420.jpg\" alt=\"Average normalized score for: DRRN (HASH) 25.0%, DRRN (MIN-OB) 12.0%, DRRN (base) 21.0%. Table showing game name followed raw scores for DRRN, MIN-OB, HASH, and INV-DY: \nBalances: DRRN 10\/10, MIN-OB 10\/10, HASH 10\/10, INV-DY 10\/10, Max 51\nDeephome: DRRN 57\/66, MIN-OB 8.5\/27, HASH 58\/67, INV-DY 57.6\/67, Max 300\nDetective: DRRN 290\/337, MIN-OB 86.3\/350, HASH 290\/317, INV-DY 290\/323, Max 360\nDragon: DRRN 5.0\/6, MIN-OB 5.4\/3, HASH 5.0\/7, INV-DY -2.7\/8, Max 25\nCnchanter: DRRN 20\/20, MIN-OB 20\/40, HASH 20\/30, INV-DY 20\/30, Max 400\nInhumane: DRRN 21.1\/45, MIN-OB 12.4\/40, HASH 21.9\/45, INV-DY 19.6\/45, Max 90\nLibrary: DRRN 15.7\/21, MIN-OB 12.8\/21, HASH 17\/21, INV-DY 16.2\/21, Max 30\nLudicorp: DRRN 12.7\/23, MIN-OB 11.6\/21, HASH 14.8\/23, INV-DY 13.5\/23, Max 150\nOmniquest: DRRN 4.9\/5, MIN-OB 4.9\/5, HASH 4.9\/5, INV-DY 5.3\/10, Max 50\nPentari: DRRN 26.5\/45, MIN-OB 21.7\/45, HASH 51.9\/60, INV-DY 37.2\/50, Max 70\nzork1: DRRN 39.4\/53, MIN-OB 29\/46, HASH 35.5\/50, INV-DY 43.1\/87, Max 350\nzork3: DRRN 0.4\/4.5, MIN-OB 0.0\/4, HASH 0.4\/4, INV-DY 0.4\/4, Max 7\nAverage Norm: DRRN .21\/.38, MIN-OB .12\/.35, HASH .25\/.39, INV-DY .23\/.40\n\" class=\"wp-image-751204\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Figure3_TBGJ1-1024x420.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Figure3_TBGJ1-300x123.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Figure3_TBGJ1-768x315.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Figure3_TBGJ1-16x7.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Figure3_TBGJ1.jpg 1520w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Figure 3: The average normalized scores for DRRN (HASH), DRRN (MIN-OB), and DRRN-base (left) show that DRRN using the hash representation outperforms DRRN-base trained using textual inputs. Raw scores (right) show average and maximum performance on each game.<\/figcaption><\/figure>\n\n\n\n<p>In light of this finding, we highlight two recent methods that we believe encourage agents to build stronger semantic understanding throughout the learning process:<\/p>\n\n\n\n<h2 id=\"regularizing-semantics-via-inverse-dynamics-decoding\">Regularizing semantics via inverse dynamics decoding<\/h2>\n\n\n\n<p>Standard RL centers around the idea of learning a policy mapping from observations to actions that maximizes cumulative long-term discounted rewards. We contend that such an objective only encourages robust semantic understanding to the extent that\u2019s required to pick out the most rewarding actions from a list of candidates.<\/p>\n\n\n\n<p>We propose the use of an auxiliary training task that focuses on developing stronger semantic understanding\u2014 the inverse dynamics task challenges the agent to predict the action that spans two adjacent state observations. Consider the following observations: <em>You stand at the foot of a tall tree <\/em>and <em>You sit atop a large branch, feet dangling in the air.<\/em> It\u2019s reasonable to surmise that the intervening action may have been climb tree. Specifically, the agent is presented with a finite set of valid actions that could have been selected from the previous state and must choose which of them was most likely to have been selected given the text of the subsequent state. This <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1705.05363\">auxiliary loss has been investigated in the context of video games like <em>Super Mario Bros.<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<p>By training agents with this inverse dynamics auxiliary task, we show that they have stronger semantic understanding, as&nbsp;evidenced by their ability to generalize between semantically similar states in Figure 4.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Fig4_-TBGJ-1024x238.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"238\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Fig4_-TBGJ-1024x238.png\" alt=\"DRRN (base) on left shows seen and unseen text observations and semantically similar observations originating from the living room. There are three distinct groupings on a scatter plot. The bottom right corner contains the majority of observations in a fairly tight grouping. Most are unseen, with about 13 marked unseen in living room. One observation is seen. A small grouping in the middle, slightly upper left grouping shows all seen observations with about 4 marked in living room. A medium-sized, tightly concentrated grouping in the upper left corner shows mostly unseen observations, with a few seen observations and a concentrated group of unseen observations in living room. On the right, in DRRN (INV-DY), a different pattern occurs. There is a more loosely concentrated large grouping of unseen observations in the lower right corner, with more seen observations (about 10). The middle grouping in closer to center, small, and contains roughly an equal number of seen and unseen observations. All of the unseen and seen observations in the living room fall in the upper left, tightly concentrated grouping. There are a few outlying unseen and seen observations surrounding this grouping. \" class=\"wp-image-751207\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Fig4_-TBGJ-1024x238.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Fig4_-TBGJ-300x70.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Fig4_-TBGJ-768x179.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Fig4_-TBGJ-1536x357.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Fig4_-TBGJ-16x4.png 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Fig4_-TBGJ.png 1719w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Figure 4: Above are t-SNE visualizations of text observations of the game <em>Zork 1<\/em>. Outlined in black, observations originating from the Living Room are all semantically similar (see text observation 21 in Figure 1). The Inverse-Dynamics agent (inv-dy in middle) minimizes distances between these semantically similar observations, whereas DRRN-base (left) clusters them disparately. Furthermore, when encountering unseen states, DRRN-base is unable to relate these novel observations to any of the previously encountered states, likely leading to poor generalization. On the other hand, Inverse-Dynamics can use its language understanding to relate unseen states to semantically similar, previously seen states.<\/figcaption><\/figure>\n\n\n\n<p>As an added benefit, we show that it\u2019s possible to use the inverse-dynamics model to incentivize the agent to explore regions of the state space in which model predictions are inaccurate. This improved exploration led the agent to notably high single-episode scores of 54, 94, and 113 on<em> Zork 1<\/em>, compared to previous models that get stuck at a score of 55. Looking at the trajectories, we observe that the inverse-dynamics model uniquely exhibits some challenging exploration behaviors, such as navigating through \u201ca maze of twisty little passages, all alike\u201d; taking a coin at a specific spot of the maze (+10 score); or finding the \u201cCyclops Room\u201d at the exit of the maze and then going up to the \u201cTreasure Room\u201d (+25 score). Thus, we find that by improving semantic understanding, approaches like inverse dynamics can also lead to better exploration in agents.<\/p>\n\n\n\n<h2 id=\"leveraging-human-priors-for-action-selection\">Leveraging human priors for action selection<\/h2>\n\n\n\n<p>Another way to build in semantic language understanding is to leverage it for action generation. Published in the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/2020.emnlp.org\/\">Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/keep-calm-and-explore-language-models-for-action-generation-in-text-based-games\/\">Keep CALM and Explore: Language Models for Action Generation in Text-based Games,<\/a>\u201d introduces a text game\u2013oriented language model called CALM.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--left\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/keep-calm-and-explore-language-models-for-action-generation-in-text-based-games\/\" data-bi-cN=\"Keep CALM and Explore: Language Models for Action Generation in Text-based Games\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Keep CALM and Explore: Language Models for Action Generation in Text-based Games<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>CALM (short for Contextual-Action Language Model) avoids the use of the valid-action handicap altogether by using a language model to generate a compact set of candidate actions. CALM starts from a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.persagen.com\/files\/misc\/radford2019language.pdf\">pretrained GPT-2 model&nbsp;<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>and is fine-tuned on a corpus featuring years of&nbsp;human-gameplay transcripts spanning 590 different text-based games. During this training process, CALM is instilled with priors that humans use when playing a large variety of text-based games.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-medium is-resized\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Observation: You are in the living room.\nThere is a doorway to the east, a wooden door\nwith strange gothic lettering to the west, which\nappears to be nailed shut, a trophy case, and\na large oriental rug in the center of the room.\nYou are carrying: A brass lantern ...\nRandom Actions (red outline):\nclose door, north a, eat troll with egg, ...\nCALM (n-gram) Actions (blue outline):\nenter room, leave room, lock room,\nopen door, close door, knock on door, ...\nCALM (GPT-2) Actions (green outline):\neast, open case, get rug, turn on lantern,\nmove rug, unlock case with key, ...\nNext Observation: With a great effort, the rug\nis moved to one side of the room, revealing\nthe dusty cover of a closed trap door...\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Figure-5_TBGJ2.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Figure-5_TBGJ2-257x300.png\" alt=\"Observation: You are in the living room.\nThere is a doorway to the east, a wooden door\nwith strange gothic lettering to the west, which\nappears to be nailed shut, a trophy case, and\na large oriental rug in the center of the room.\nYou are carrying: A brass lantern ...\nRandom Actions (red outline):\nclose door, north a, eat troll with egg, ...\nCALM (n-gram) Actions (blue outline):\nenter room, leave room, lock room,\nopen door, close door, knock on door, ...\nCALM (GPT-2) Actions (green outline):\neast, open case, get rug, turn on lantern,\nmove rug, unlock case with key, ...\nNext Observation: With a great effort, the rug\nis moved to one side of the room, revealing\nthe dusty cover of a closed trap door...\" class=\"wp-image-751210\" width=\"363\" height=\"424\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Figure-5_TBGJ2-257x300.png 257w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Figure-5_TBGJ2-10x12.png 10w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/Figure-5_TBGJ2.png 604w\" sizes=\"auto, (max-width: 363px) 100vw, 363px\" \/><\/a><figcaption>Figure 5: CALM can pair common verbs with objects from an observation when trained with an n-gram (blue), and it can generate more complex actions when trained with GPT-2 (green).<\/figcaption><\/figure><\/div>\n\n\n\n<p>As shown in Figure 5, conditioned on the observation text, randomly sampled actions are nonsensical (outlined in red). CALM, trained with an n-gram model, pairs common verbs with objects from the observation (outlined in blue), and CALM trained with GPT-2 (outlined in green) can generate actions that are more complex and highly relevant.<\/p>\n\n\n\n<p>To quantify this effect, we compared the quality of CALM-generated actions to ground-truth valid actions on walkthrough trajectories from 28 games. We find higher precision and recall with CALM than with a comparably trained model trained using a bag-of-words representation.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig6_TBGJ-1024x351.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"351\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig6_TBGJ-1024x351.png\" alt=\"The full CALM architecture: Given the current game context, the CALM language model generates actions candidates such as \u201copen mailbox\u201d. The DRRN reinforcement learning agent conditions on the current observation, which is encoded word-by-word using a GRU, as well as each action candidate which is encoded similarly through a separate GRU. DRRN subsequently estimates a Q-value for each observation-action pair. The final action is sampled from the Q-values using a softmax distribution. \" class=\"wp-image-751213\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig6_TBGJ-1024x351.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig6_TBGJ-300x103.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig6_TBGJ-768x263.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig6_TBGJ-1536x526.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig6_TBGJ-16x5.png 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/fig6_TBGJ.png 1706w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Figure 6: The full CALM architecture. Given the current game context, the CALM language model generates actions candidates such as \u201copen mailbox\u201d. The DRRN reinforcement learning agent conditions on the current observation, which is encoded word-by-word using a GRU, as well as each action candidate which is encoded similarly through a separate GRU. DRRN subsequently estimates a Q-value for each observation-action pair. The final action is sampled from the Q-values using a softmax distribution.<\/figcaption><\/figure><\/div>\n\n\n\n<p>The figure above shows the full integration of the CALM language model with the DRRN learning agent. CALM is responsible for producing Action Candidates conditioned on the current context, while DRRN attempts to select the best Action Candidate from the produced set. This action is then executed in the environment and the cycle repeats.<\/p>\n\n\n\n<p>We subsequently deployed CALM in the RL context on held-out games that were excluded from the training set. CALM acts as a drop-in replacement for the valid-action handicap by generating a small set of likely actions conditioned on the current text observation. We find that in eight of 28 game cases, CALM can entirely replace the need for Jericho\u2019s valid-action and agents trained to pick from CALM-generated actions outperform those trained with the handicap.<\/p>\n\n\n\n<p>We additionally compare against NAIL, a competition-winning general game-playing agent that uses handwritten rules to explore and act. We find that CALM generates an average normalized completion of 9.4% across games, compared to NAIL with 5.6%.<\/p>\n\n\n\n<h2 id=\"valid-action-identification-and-calm-a-springboard-for-furthering-semantic-understanding\">Valid-action identification and CALM: A springboard for furthering semantic understanding<\/h2>\n\n\n\n<p>Text-based games remain an exciting and challenging testbed for the development of language agents. Our recent work highlights the need for careful examination of the handicaps as they relate to the development of semantic understanding the text agents trained within. Our findings show that handicaps such as valid-action identification can allow agents to bypass the challenges of developing semantic language understanding. We outlined two different ways of increasing semantic understanding in text agents. First, we show better semantic clustering of text observations using an auxiliary task focused on inverse dynamics prediction. Second, we show compelling performance by using the contextual-action language model to create candidate actions and eliminating the valid-action handicap altogether.<\/p>\n\n\n\n<p>More broadly, we aim to spark discussion with the community around the new methods and experimental protocols to encourage agents to develop better language understanding, and we are excited about the continued development of text-based agents with strong semantic understanding.<\/p>\n\n\n\n<p>Read our paper <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/keep-calm-and-explore-language-models-for-action-generation-in-text-based-games\/\">here <\/a>and find the source code <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/princeton-nlp\/calm-textgame\">here<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>AI agents capable of understanding natural language, communicating, and accomplishing tasks hold the promise of revolutionizing the way we interact with computers in our everyday lives. Text-based games, such as the Zork series, act as testbeds for development of novel learning agents capable of understanding and interacting exclusively through language. Beyond requiring the use of [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":751741,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13554],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-750835","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-human-computer-interaction","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[740920],"related-researchers":[{"type":"guest","value":"karthik-narasimhan","user_id":"751732","display_name":"Karthik  Narasimhan","author_link":"<a href=\"https:\/\/www.cs.princeton.edu\/~karthikn\/\" aria-label=\"Visit the profile page for Karthik  Narasimhan\">Karthik  Narasimhan<\/a>","is_active":true,"last_first":"Narasimhan, Karthik ","people_section":0,"alias":"karthik-narasimhan"},{"type":"guest","value":"shunyu-yao","user_id":"751735","display_name":"Shunyu  Yao","author_link":"Shunyu  Yao","is_active":true,"last_first":"Yao, Shunyu ","people_section":0,"alias":"shunyu-yao"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_no_logo_still-1-960x540.jpg\" class=\"img-object-cover\" alt=\"New research explores reinforcement learning methods to improve semantic understanding in text agents, a key process by which AI understands and reacts to text-based input. Learning agents understand the world by parsing observations: West of House You are standing in an open field west of a white house, with a boarded front door. There is a small mailbox here. And generate language-based actions that change the state of the game: &gt; south South of House You are facing the south side of a white house. There is no door here, and all the windows are boarded. &gt; east Behind House You are behind the white house. A path leads into the forest to the east. In one corner of the house there is a small wndow which is slightly ajar. &gt; open window\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_no_logo_still-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_no_logo_still-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_no_logo_still-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_no_logo_still-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_no_logo_still-1-1536x865.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_no_logo_still-1-2048x1153.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_no_logo_still-1-16x9.jpg 16w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_no_logo_still-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_no_logo_still-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_no_logo_still-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_no_logo_still-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_no_logo_still-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/06\/1400x788_Text_based_games_no_logo_still-1-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Matthew Hausknecht, <a href=\"https:\/\/www.cs.princeton.edu\/~karthikn\/\" title=\"Go to researcher profile for Karthik  Narasimhan\" aria-label=\"Go to researcher profile for Karthik  Narasimhan\" data-bi-type=\"byline author\" data-bi-cN=\"Karthik  Narasimhan\">Karthik  Narasimhan<\/a>, and Shunyu  Yao","formattedDate":"June 7, 2021","formattedExcerpt":"AI agents capable of understanding natural language, communicating, and accomplishing tasks hold the promise of revolutionizing the way we interact with computers in our everyday lives. Text-based games, such as the Zork series, act as testbeds for development of novel learning agents capable of understanding&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/750835","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=750835"}],"version-history":[{"count":23,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/750835\/revisions"}],"predecessor-version":[{"id":751849,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/750835\/revisions\/751849"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/751741"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=750835"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=750835"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=750835"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=750835"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=750835"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=750835"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=750835"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=750835"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=750835"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=750835"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=750835"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}