Microsoft Research Blog

Microsoft Research Blog

The Microsoft Research blog provides in-depth views and perspectives from our researchers, scientists and engineers, plus information about noteworthy events and conferences, scholarships, and fellowships designed for academic and scientific communities.

Provably efficient reinforcement learning with rich observations

June 3, 2019 | By Akshay Krishnamurthy, Researcher

From left to right: Akshay Krishnamurthy, Nan Jiang, and Alekh Agarwal

From left to right: Akshay Krishnamurthy, Nan Jiang, and Alekh Agarwal

Reinforcement learning, a machine learning paradigm for sequential decision making, has stormed into the limelight, receiving tremendous attention from both researchers and practitioners. When combined with deep learning, reinforcement learning (RL) has produced impressive empirical results, but the successes to date are limited to simulation scenarios in which data is cheap, primarily because modern “deep RL” algorithms are extremely data hungry. In “Provably Efficient RL with Rich Observations via Latent State Decoding”, Microsoft Research PhD intern Simon S. Du of Carnegie Mellon University, Nan Jiang of UIUC, Alekh Agarwal of Microsoft Research AI, along with myself, Miroslav Dudík, and John Langford of Microsoft Research New York City, develop a provably data-efficient reinforcement learning algorithm, making important progress towards practical RL for real-world applications. The paper will be presented at the Thirty-sixth International Conference on Machine Learning (ICML 2019) in June by Simon Du.

Reinforcement learning involves an agent interacting with an environment through trial and error. The agent repeatedly uses environment cues, or observations, to perform actions, and these actions alter the state of the environment. The goal is for the agent to accomplish some pre-specified task—for example to have a robot navigate to its charging station. In modern RL scenarios, the observations are typically rich and noisy sensor readings like images from a front-view camera on the robot. However, the state of the environment is often much simpler; it might just be the location of the robot.

The key to data-efficient RL is strategic exploration—the agent must learn about the environment through interaction, which often requires performing specific action sequences to reach particular environment states. Classical algorithms address the exploration question when the state is directly observed (the “tabular” setting), by counting the number of times each state has been visited and driving the agent to minimally visited states. However, these methods fail with rich observations because the state is not directly observed and must be inferred. Indeed, prior to our ICML 2017 paper, there were no provably data-efficient methods for handling rich observations.

Unfortunately, our previous method was not computationally tractable, but the new paper resolves this shortcoming. The new algorithm is a computationally and data-efficient method for strategic exploration with rich observations. It explores by first extracting underlying states from the observations, a form of dimensionality reduction. Once we have a small learned state space, classical RL algorithms are tractable, and therefore, our key contribution is a new technique for learning state representations.

We learn representations by leveraging supervised learning. Specifically, our supervised learner uses the observation to predict the “backward transition probability”—a distribution over previous actions and the state representation of the previous observation. We then construct the state representation from the predictions of the supervised learner. The intuition is that observations arising from semantically similar behaviors will induce the same predictions, so they will be collapsed into a single underlying state. We call the final algorithm “PCID,” which stands for “Policy Cover by Inductive Decoding.”

We obtained a data-efficiency guarantee: the algorithm learns a high-quality state representation and the amount of experience required is independent of the size of the observation space, scaling only with the number of underlying states. This allowed us to scale to rich observation settings and it represents an exponential improvement over earlier approaches. Furthermore, via the reduction to supervised learning, the algorithm is computationally tractable.

Figure 1. Empirical results comparing PCID with Q-Learning in a simple synthetic reinforcement learning environment. Q-Learning has the unfair advantage of directly accessing the latent state, yet, PCID is exponentially more sample efficient.

Figure 1. Empirical results comparing PCID with Q-Learning in a simple synthetic reinforcement learning environment. Q-Learning has the unfair advantage of directly accessing the latent state, yet, PCID is exponentially more sample efficient.

We also benchmark PCID on synthetic environments, comparing with a simple but popular baseline: Q-learning with epsilon-greedy exploration. Our environments have a small underlying state space but rich high-dimensional observations. While we run PCID on the observations, we give Q-learning an unfair advantage by running it directly on the underlying state space. Despite being severely handicapped in this manner, PCID is exponentially more efficient than Q-learning, scaling gracefully with problem difficulty. The paper includes several other experiments as well.

Provably efficient RL with Rich Observations via Latent State Decoding” is the latest in a series of works by Microsoft researchers and colleagues, developing a new statistical theory for modern reinforcement learning.

We look forward to seeing you at ICML 2019 in Long Beach, California!

Up Next

Artificial intelligence

Project Malmo competition returns with student organizers and a new mission: To democratize reinforcement learning

When I was asked about my favorite movie in a game with friends after my wedding ceremony, I replied Star Wars. That was about two decades ago, and, yes, it’s still the case. I especially like Return of the Jedi. The third installment in the original trilogy is almost perfect to me. Luke Skywalker returns […]

Noboru Sean Kuno

Senior Research Program Manager

Artificial intelligence

Reliability in Reinforcement Learning

Reinforcement Learning (RL), much like scaling a 3,000-foot rock face, is about learning to make sequential decisions. The list of potential RL applications is expansive, spanning robotics (drone control), dialogue systems (personal assistants, automated call centers), the game industry (non-player characters, computer AI), treatment design (pharmaceutical tests, crop management), complex systems control (for resource allocation, […]

Romain Laroche

Principal Researcher

Artificial intelligence

Reinforcement learning for the real world with Dr. John Langford and Rafah Hosn

Episode 75, May 8, 2019- Dr. John Langford, a partner researcher in the Machine Learning group at Microsoft Research New York City, is a reinforcement learning expert who is working, in his own words, to solve machine learning. Rafah Hosn, also of MSR New York, is a principal program manager who’s working to take that work to the world. If that sounds like big thinking in the Big Apple, well, New York City has always been a “go big, or go home” kind of town, and MSR NYC is a “go big, or go home” kind of lab. Today, Dr. Langford explains why online reinforcement learning is critical to solving machine learning and how moving from the current foundation of a Markov decision process toward a contextual bandit future might be part of the solution. Rafah Hosn talks about why it’s important, from a business perspective, to move RL agents out of simulated environments and into the open world, and gives us an under-the-hood look at the product side of MSR’s “research, incubate, transfer” process, focusing on real world reinforcement learning which, at Microsoft, is now called Azure Cognitive Services Personalizer.

Microsoft blog editor