Building stronger semantic understanding into text game reinforcement learning agents


By , Senior Researcher , Assistant Professor , PhD student, Princeton University

AI agents capable of understanding natural language, communicating, and accomplishing tasks hold the promise of revolutionizing the way we interact with computers in our everyday lives. Text-based games, such as the Zork series, act as testbeds for development of novel learning agents capable of understanding and interacting exclusively through language. Beyond requiring the use of imagination and myriad concepts of everyday life to solve, these fictional world–based narratives are also a safe sandbox for AI testing that avoids the expense of collecting user data and the risk of users having a bad experience interacting with agents that are still learning.

In this blog post, we share two papers that explore reinforcement learning methods to improve semantic understanding in text agents, a key process by which AI understands and reacts to text-based input. We’re also releasing source code for these agents to encourage the community to continue to improve semantic understanding in text-based games.

Ever since text-based games were proposed as a benchmark for language understanding agents, a key challenge in these games has been the enormous action space. Games like Zork 1 can have up to 98 million possible actions in each state—the majority of which are nonsensical, ungrammatical, or inapplicable. In order to make text-based games more approachable to reinforcement learning (RL) agents, the Jericho framework provides several handicaps such as valid-action identification, which uses the game engine to identify a minimal set of 10–100 textual actions applicable in the current game state. Other handicaps involve the ability to extract the recognized vocabulary for a game and to save and restore previously visited states. Certain RL agents depend on the valid-action handicap, like deep reinforcement relevance network (DRRN), which learns to choose the action within the set of valid actions at each timestep that maximizes expected game scores.

Example from text-based game. From top to bottom: Box with dotted line reads “Observation: This bedroom is extremely spare, with dirty laundry scattered haphazardly all over the floor. Cleaner clothing can be found in the dresser. A bathroom lies to the south, while a door to the east leads to the living room. On the end table are a telephone, a wallet and some keys. The phone rings.”
Action: Answer phone
Box with dotted line reads “Observation: You pick up the phone.
Figure 1: An example text-based game consists of language-based observations describing the simulated world and actions, taken by the agent, that alter the state of the world and advance the story. Valid Actions, identified through introspection into the game emulator, are a list of actions applicable at the current step that are guaranteed to change the state of the game.

Spotlight: On-Demand EVENT

Microsoft Research Summit 2022

Watch now to learn about some of the most pressing questions facing our research community and listen in on conversations with 120+ researchers around how to ensure new technologies have the broadest possible benefit for humanity.

Despite its usefulness from an RL tractability standpoint, the valid-action handicap, through the use of privileged game insights, can expose hidden information about the environment to the agent. For example, In Figure 1, the valid-action take off watch inadvertently leaks the existence of a watch that was not revealed in the observation text.

Our recent paper, “Reading and Acting while Blindfolded: The need for semantics in text game agents,” shows that when using Jericho-provided handicaps, such as the identification of valid-actions, it’s possible for text agents to achieve competitive scores while using textual representations entirely devoid of semantics. The paper has been accepted to the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2021).

Hash-based representation reveals that handicap agents bypass semantic understanding

To probe the extent of semantic understanding of the DRRN agent, we replace the observation text and valid-action texts with consistent, yet semantically meaningless text strings derived by hashing the current state and actions. Figure 2 compares the standard textual representation of observations and actions with two ablations. MIN-OB (b) shortens observation text to only include the name of the current location without description of the objects or characters therein. HASH (c) entirely replaces the text of the observation and actions with the output of a hash-based representation.

Three boxes: A, B, and C show a standard textual representation and two ablations respectively.
(A) ZORK 1
Observation 21: You are in the living room. There is a doorway to the
east, a wooden door with strange gothic lettering to the west, which
appears to be nailed shut, a trophy case, and a large oriental rug in the
center of the room. You are carrying: A brass lantern. 
Action 21: move rug
Observation 22: With a great effort, the rug is moved to one side of the
room, revealing the dusty cover of a closed trap door. Living room
You are carrying: ellipsis
Action 22: open trap
Observation 21: Living Room
Action 21: move rug
Observation 22: Living Room
Action 22: open trap

Observation 21: OX6FC
Action 21: 0X3A04
Observation 22: OX103B
Action 22: OX16BB
Figure 2: A standard textual representation (left) and two ablations: MIN-OB (top right) and HASH (bottom right).

Specifically, we use Python’s hash function to map each text string to a unique hash value, which ensures even if one word is changed in text, the representation will be completely different. From the perspective of playing this game, representations (b) and (c) would pose significant challenges to a human due to the lack of semantic information.

Subsequently, a DRRN text agent was trained from scratch using RL to estimate the quality of each candidate action. Surprisingly, the agent on the hashed-representation was able to achieve comparable scores with the control agents using the original text-based observations and actions (see Figure 3), and it even outperforms the baseline (DRRN-base) in 3 out of the 12 games. This finding suggests that current text agents are likely circumventing the development of semantic understanding, particularly when trained on single games using the valid-action handicap.

Average normalized score for: DRRN (HASH) 25.0%, DRRN (MIN-OB) 12.0%, DRRN (base) 21.0%. Table showing game name followed raw scores for DRRN, MIN-OB, HASH, and INV-DY: 
Balances: DRRN 10/10, MIN-OB 10/10, HASH 10/10, INV-DY 10/10, Max 51
Deephome: DRRN 57/66, MIN-OB 8.5/27, HASH 58/67, INV-DY 57.6/67, Max 300
Detective: DRRN 290/337, MIN-OB 86.3/350, HASH 290/317, INV-DY 290/323, Max 360
Dragon: DRRN 5.0/6, MIN-OB 5.4/3, HASH 5.0/7, INV-DY -2.7/8, Max 25
Cnchanter: DRRN 20/20, MIN-OB 20/40, HASH 20/30, INV-DY 20/30, Max 400
Inhumane: DRRN 21.1/45, MIN-OB 12.4/40, HASH 21.9/45, INV-DY 19.6/45, Max 90
Library: DRRN 15.7/21, MIN-OB 12.8/21, HASH 17/21, INV-DY 16.2/21, Max 30
Ludicorp: DRRN 12.7/23, MIN-OB 11.6/21, HASH 14.8/23, INV-DY 13.5/23, Max 150
Omniquest: DRRN 4.9/5, MIN-OB 4.9/5, HASH 4.9/5, INV-DY 5.3/10, Max 50
Pentari: DRRN 26.5/45, MIN-OB 21.7/45, HASH 51.9/60, INV-DY 37.2/50, Max 70
zork1: DRRN 39.4/53, MIN-OB 29/46, HASH 35.5/50, INV-DY 43.1/87, Max 350
zork3: DRRN 0.4/4.5, MIN-OB 0.0/4, HASH 0.4/4, INV-DY 0.4/4, Max 7
Average Norm: DRRN .21/.38, MIN-OB .12/.35, HASH .25/.39, INV-DY .23/.40
Figure 3: The average normalized scores for DRRN (HASH), DRRN (MIN-OB), and DRRN-base (left) show that DRRN using the hash representation outperforms DRRN-base trained using textual inputs. Raw scores (right) show average and maximum performance on each game.

In light of this finding, we highlight two recent methods that we believe encourage agents to build stronger semantic understanding throughout the learning process:

Regularizing semantics via inverse dynamics decoding

Standard RL centers around the idea of learning a policy mapping from observations to actions that maximizes cumulative long-term discounted rewards. We contend that such an objective only encourages robust semantic understanding to the extent that’s required to pick out the most rewarding actions from a list of candidates.

We propose the use of an auxiliary training task that focuses on developing stronger semantic understanding— the inverse dynamics task challenges the agent to predict the action that spans two adjacent state observations. Consider the following observations: You stand at the foot of a tall tree and You sit atop a large branch, feet dangling in the air. It’s reasonable to surmise that the intervening action may have been climb tree. Specifically, the agent is presented with a finite set of valid actions that could have been selected from the previous state and must choose which of them was most likely to have been selected given the text of the subsequent state. This auxiliary loss has been investigated in the context of video games like Super Mario Bros.

By training agents with this inverse dynamics auxiliary task, we show that they have stronger semantic understanding, as evidenced by their ability to generalize between semantically similar states in Figure 4.

DRRN (base) on left shows seen and unseen text observations and semantically similar observations originating from the living room. There are three distinct groupings on a scatter plot. The bottom right corner contains the majority of observations in a fairly tight grouping. Most are unseen, with about 13 marked unseen in living room. One observation is seen. A small grouping in the middle, slightly upper left grouping shows all seen observations with about 4 marked in living room. A medium-sized, tightly concentrated grouping in the upper left corner shows mostly unseen observations, with a few seen observations and a concentrated group of unseen observations in living room. On the right, in DRRN (INV-DY), a different pattern occurs. There is a more loosely concentrated large grouping of unseen observations in the lower right corner, with more seen observations (about 10). The middle grouping in closer to center, small, and contains roughly an equal number of seen and unseen observations. All of the unseen and seen observations in the living room fall in the upper left, tightly concentrated grouping. There are a few outlying unseen and seen observations surrounding this grouping.
Figure 4: Above are t-SNE visualizations of text observations of the game Zork 1. Outlined in black, observations originating from the Living Room are all semantically similar (see text observation 21 in Figure 1). The Inverse-Dynamics agent (inv-dy in middle) minimizes distances between these semantically similar observations, whereas DRRN-base (left) clusters them disparately. Furthermore, when encountering unseen states, DRRN-base is unable to relate these novel observations to any of the previously encountered states, likely leading to poor generalization. On the other hand, Inverse-Dynamics can use its language understanding to relate unseen states to semantically similar, previously seen states.

As an added benefit, we show that it’s possible to use the inverse-dynamics model to incentivize the agent to explore regions of the state space in which model predictions are inaccurate. This improved exploration led the agent to notably high single-episode scores of 54, 94, and 113 on Zork 1, compared to previous models that get stuck at a score of 55. Looking at the trajectories, we observe that the inverse-dynamics model uniquely exhibits some challenging exploration behaviors, such as navigating through “a maze of twisty little passages, all alike”; taking a coin at a specific spot of the maze (+10 score); or finding the “Cyclops Room” at the exit of the maze and then going up to the “Treasure Room” (+25 score). Thus, we find that by improving semantic understanding, approaches like inverse dynamics can also lead to better exploration in agents.

Leveraging human priors for action selection

Another way to build in semantic language understanding is to leverage it for action generation. Published in the Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), “Keep CALM and Explore: Language Models for Action Generation in Text-based Games,” introduces a text game–oriented language model called CALM.

CALM (short for Contextual-Action Language Model) avoids the use of the valid-action handicap altogether by using a language model to generate a compact set of candidate actions. CALM starts from a pretrained GPT-2 model and is fine-tuned on a corpus featuring years of human-gameplay transcripts spanning 590 different text-based games. During this training process, CALM is instilled with priors that humans use when playing a large variety of text-based games.

Observation: You are in the living room.
There is a doorway to the east, a wooden door
with strange gothic lettering to the west, which
appears to be nailed shut, a trophy case, and
a large oriental rug in the center of the room.
You are carrying: A brass lantern ...
Random Actions (red outline):
close door, north a, eat troll with egg, ...
CALM (n-gram) Actions (blue outline):
enter room, leave room, lock room,
open door, close door, knock on door, ...
CALM (GPT-2) Actions (green outline):
east, open case, get rug, turn on lantern,
move rug, unlock case with key, ...
Next Observation: With a great effort, the rug
is moved to one side of the room, revealing
the dusty cover of a closed trap door...
Figure 5: CALM can pair common verbs with objects from an observation when trained with an n-gram (blue), and it can generate more complex actions when trained with GPT-2 (green).

As shown in Figure 5, conditioned on the observation text, randomly sampled actions are nonsensical (outlined in red). CALM, trained with an n-gram model, pairs common verbs with objects from the observation (outlined in blue), and CALM trained with GPT-2 (outlined in green) can generate actions that are more complex and highly relevant.

To quantify this effect, we compared the quality of CALM-generated actions to ground-truth valid actions on walkthrough trajectories from 28 games. We find higher precision and recall with CALM than with a comparably trained model trained using a bag-of-words representation.

The full CALM architecture: Given the current game context, the CALM language model generates actions candidates such as “open mailbox”. The DRRN reinforcement learning agent conditions on the current observation, which is encoded word-by-word using a GRU, as well as each action candidate which is encoded similarly through a separate GRU. DRRN subsequently estimates a Q-value for each observation-action pair. The final action is sampled from the Q-values using a softmax distribution.
Figure 6: The full CALM architecture. Given the current game context, the CALM language model generates actions candidates such as “open mailbox”. The DRRN reinforcement learning agent conditions on the current observation, which is encoded word-by-word using a GRU, as well as each action candidate which is encoded similarly through a separate GRU. DRRN subsequently estimates a Q-value for each observation-action pair. The final action is sampled from the Q-values using a softmax distribution.

The figure above shows the full integration of the CALM language model with the DRRN learning agent. CALM is responsible for producing Action Candidates conditioned on the current context, while DRRN attempts to select the best Action Candidate from the produced set. This action is then executed in the environment and the cycle repeats.

We subsequently deployed CALM in the RL context on held-out games that were excluded from the training set. CALM acts as a drop-in replacement for the valid-action handicap by generating a small set of likely actions conditioned on the current text observation. We find that in eight of 28 game cases, CALM can entirely replace the need for Jericho’s valid-action and agents trained to pick from CALM-generated actions outperform those trained with the handicap.

We additionally compare against NAIL, a competition-winning general game-playing agent that uses handwritten rules to explore and act. We find that CALM generates an average normalized completion of 9.4% across games, compared to NAIL with 5.6%.

Valid-action identification and CALM: A springboard for furthering semantic understanding

Text-based games remain an exciting and challenging testbed for the development of language agents. Our recent work highlights the need for careful examination of the handicaps as they relate to the development of semantic understanding the text agents trained within. Our findings show that handicaps such as valid-action identification can allow agents to bypass the challenges of developing semantic language understanding. We outlined two different ways of increasing semantic understanding in text agents. First, we show better semantic clustering of text observations using an auxiliary task focused on inverse dynamics prediction. Second, we show compelling performance by using the contextual-action language model to create candidate actions and eliminating the valid-action handicap altogether.

More broadly, we aim to spark discussion with the community around the new methods and experimental protocols to encourage agents to develop better language understanding, and we are excited about the continued development of text-based agents with strong semantic understanding.

Read our paper here and find the source code here.

Related publications

Continue reading

See all blog posts