{"id":590116,"date":"2019-06-03T10:29:11","date_gmt":"2019-06-03T17:29:11","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=590116"},"modified":"2019-06-26T14:11:56","modified_gmt":"2019-06-26T21:11:56","slug":"provably-efficient-reinforcement-learning-with-rich-observations","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/provably-efficient-reinforcement-learning-with-rich-observations\/","title":{"rendered":"Provably efficient reinforcement learning with rich observations"},"content":{"rendered":"<div id=\"attachment_590707\" style=\"width: 1034px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-590707\" class=\"wp-image-590707 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788-1024x576.png\" alt=\"From left to right: Akshay Krishnamurthy, Nan Jiang, and Alekh Agarwal\" width=\"1024\" height=\"576\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><p id=\"caption-attachment-590707\" class=\"wp-caption-text\">From left to right: Akshay Krishnamurthy, Nan Jiang, and Alekh Agarwal<\/p><\/div>\n<p>Reinforcement learning, a machine learning paradigm for sequential decision making, has stormed into the limelight, receiving tremendous attention from both researchers and practitioners. When combined with deep learning, reinforcement learning (RL) has produced impressive empirical results, but the successes to date are limited to simulation scenarios in which data is cheap, primarily because modern &#8220;deep RL&#8221; algorithms are extremely data hungry. In \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/provably-efficient-rl-with-rich-observations-via-latent-state-decoding\/\">Provably Efficient RL with Rich Observations via Latent State Decoding<\/a>\u201d, Microsoft Research PhD intern <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.cs.cmu.edu\/~ssdu\/index.html\">Simon S. Du<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> of Carnegie Mellon University, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nanjiang.cs.illinois.edu\/\">Nan Jiang<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> of UIUC, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/alekhagarwal.net\/\">Alekh Agarwal<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> of Microsoft Research AI, along with <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/people.cs.umass.edu\/~akshay\/papers.html\">myself<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/mdudik\/\">Miroslav Dud\u00edk<\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jcl\/\">John Langford<\/a> of Microsoft Research New York City, develop a provably data-efficient reinforcement learning algorithm, making important progress towards practical RL for real-world applications. The paper will be presented at the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/icml.cc\/\">Thirty-sixth International Conference on Machine Learning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (ICML 2019) in June by Simon Du.<\/p>\n<p>Reinforcement learning involves an agent interacting with an environment through trial and error. The agent repeatedly uses environment cues, or observations, to perform actions, and these actions alter the state of the environment. The goal is for the agent to accomplish some pre-specified task\u2014for example to have a robot navigate to its charging station. In modern RL scenarios, the observations are typically rich and noisy sensor readings like images from a front-view camera on the robot. However, the state of the environment is often much simpler; it might just be the location of the robot.<\/p>\n<p>The key to data-efficient RL is strategic exploration\u2014the agent must learn about the environment through interaction, which often requires performing specific action sequences to reach particular environment states. Classical algorithms address the exploration question when the state is directly observed (the &#8220;tabular&#8221; setting), by counting the number of times each state has been visited and driving the agent to minimally visited states. However, these methods fail with rich observations because the state is not directly observed and must be inferred. Indeed, prior to our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/contextual-decision-processes-low-bellman-rank-pac-learnable\/\">ICML 2017 paper<\/a>, there were no provably data-efficient methods for handling rich observations.<\/p>\n<p>Unfortunately, our previous method was not computationally tractable, but the new paper resolves this shortcoming. The new algorithm is a computationally and data-efficient method for strategic exploration with rich observations. It explores by first extracting underlying states from the observations, a form of dimensionality reduction. Once we have a small learned state space, classical RL algorithms are tractable, and therefore, our key contribution is a new technique for learning state representations.<\/p>\n<p>We learn representations by leveraging supervised learning. Specifically, our supervised learner uses the observation to predict the &#8220;backward transition probability&#8221;\u2014a distribution over previous actions and the state representation of the previous observation. We then construct the state representation from the predictions of the supervised learner. The intuition is that observations arising from semantically similar behaviors will induce the same predictions, so they will be collapsed into a single underlying state. We call the final algorithm \u201cPCID,\u201d which stands for \u201cPolicy Cover by Inductive Decoding.\u201d<\/p>\n<p>We obtained a data-efficiency guarantee: the algorithm learns a high-quality state representation and the amount of experience required is independent of the size of the observation space, scaling only with the number of underlying states. This allowed us to scale to rich observation settings and it represents an exponential improvement over earlier approaches. Furthermore, via the reduction to supervised learning, the algorithm is computationally tractable.<\/p>\n<div id=\"attachment_590122\" style=\"width: 666px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably_efficient_PR_figure_1.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-590122\" class=\"wp-image-590122 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably_efficient_PR_figure_1.png\" alt=\"Figure 1. Empirical results comparing PCID with Q-Learning in a simple synthetic reinforcement learning environment. Q-Learning has the unfair advantage of directly accessing the latent state, yet, PCID is exponentially more sample efficient.\" width=\"656\" height=\"330\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably_efficient_PR_figure_1.png 656w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably_efficient_PR_figure_1-300x151.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably_efficient_PR_figure_1-655x330.png 655w\" sizes=\"auto, (max-width: 656px) 100vw, 656px\" \/><\/a><p id=\"caption-attachment-590122\" class=\"wp-caption-text\">Figure 1. Empirical results comparing PCID with Q-Learning in a simple synthetic reinforcement learning environment. Q-Learning has the unfair advantage of directly accessing the latent state, yet, PCID is exponentially more sample efficient.<\/p><\/div>\n<p>We also benchmark PCID on synthetic environments, comparing with a simple but popular baseline: Q-learning with epsilon-greedy exploration. Our environments have a small underlying state space but rich high-dimensional observations. While we run PCID on the observations, we give Q-learning an unfair advantage by running it directly on the underlying state space. Despite being severely handicapped in this manner, PCID is exponentially more efficient than Q-learning, scaling gracefully with problem difficulty. The paper includes several other experiments as well.<\/p>\n<p>\u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/provably-efficient-rl-with-rich-observations-via-latent-state-decoding\/\">Provably efficient RL with Rich Observations via Latent State Decoding<\/a>\u201d is the latest in a series of works by Microsoft researchers and colleagues, developing a new statistical theory for modern reinforcement learning.<\/p>\n<p>We look forward to seeing you at ICML 2019 in Long Beach, California!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Reinforcement learning, a machine learning paradigm for sequential decision making, has stormed into the limelight, receiving tremendous attention from both researchers and practitioners. When combined with deep learning, reinforcement learning (RL) has produced impressive empirical results, but the successes to date are limited to simulation scenarios in which data is cheap, primarily because modern &#8220;deep [&hellip;]<\/p>\n","protected":false},"author":38022,"featured_media":590707,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Akshay Krishnamurthy","user_id":"30913"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[194467],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-590116","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artifical-intelligence","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199571],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144902,395930],"related-projects":[568491],"related-events":[590557],"related-researchers":[{"type":"user_nicename","value":"Akshay Krishnamurthy","user_id":30913,"display_name":"Akshay Krishnamurthy","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/akshaykr\/\" aria-label=\"Visit the profile page for Akshay Krishnamurthy\">Akshay Krishnamurthy<\/a>","is_active":false,"last_first":"Krishnamurthy, Akshay","people_section":0,"alias":"akshaykr"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788.png\" class=\"img-object-cover\" alt=\"a man standing in front of a building\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/05\/Provably-Efficient-Reinforcement-Learning-With-Rich-Observations_Site_05_2019_1400x788-343x193.png 343w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/akshaykr\/\" title=\"Go to researcher profile for Akshay Krishnamurthy\" aria-label=\"Go to researcher profile for Akshay Krishnamurthy\" data-bi-type=\"byline author\" data-bi-cN=\"Akshay Krishnamurthy\">Akshay Krishnamurthy<\/a>","formattedDate":"June 3, 2019","formattedExcerpt":"Reinforcement learning, a machine learning paradigm for sequential decision making, has stormed into the limelight, receiving tremendous attention from both researchers and practitioners. When combined with deep learning, reinforcement learning (RL) has produced impressive empirical results, but the successes to date are limited to simulation&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/590116","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38022"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=590116"}],"version-history":[{"count":7,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/590116\/revisions"}],"predecessor-version":[{"id":590899,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/590116\/revisions\/590899"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/590707"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=590116"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=590116"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=590116"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=590116"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=590116"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=590116"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=590116"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=590116"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=590116"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=590116"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=590116"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}