{"id":852783,"date":"2022-06-16T08:05:33","date_gmt":"2022-06-16T15:05:33","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=852783"},"modified":"2022-06-16T13:36:16","modified_gmt":"2022-06-16T20:36:16","slug":"offlinerltheory","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/offlinerltheory\/","title":{"rendered":"Theoretical foundations for Offline Reinforcement Learning"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background bg-gray-200 has-background- card-background--full-bleed\">\n\t\t\t\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 align-self-center\">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 id=\"theoretical-foundations-for-offline-reinforcement-learning\">Theoretical foundations for Offline Reinforcement Learning<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">MSR contributions in the space of theoretical foundation for Offline RL<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<p class=\"wp-block-paragraph\">Globally, MSR has made some recent advances in the space of the statistical foundations of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/aka.ms\/offlinerl\">Offline RL<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, where a central question is to understand what representational conditions (involving the function approximator) and coverage conditions (involving the data distribution) enable sample efficient offline RL in large state spaces. Other theoretical questions about specific algorithms have also been addressed:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>A natural starting point is to consider linear function approximation, where <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/a-sharp-characterization-of-linear-estimators-for-offline-policy-evaluation\/\">we have developed<\/a> a precise understanding of how these conditions affect the performance of traditional linear methods like the least squares temporal difference (LSTD) method.<\/li><li>Beyond this, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/offline-reinforcement-learning-fundamental-barriers-for-value-function-approximation\/\">we have established<\/a> a trade-off between these conditions in a strict sense: Offline RL is possible with (comparatively) strong representation conditions and (comparatively) weak coverage conditions, or vice versa, but it is not possible under weak versions of both conditions. These results raise intriguing questions about Offline RL frameworks or settings that allow for significant weakening of these conditions.<\/li><li>Theoretical and empirical analysis of the Reward-conditioned supervised learning approach where we provide tight performance bounds exposing their failure modes (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2206.01079\">in review<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>).<\/li><li>Any non-Markovian<sup>#<\/sup>&nbsp;policy admits an occupancy measure (i.e. a distribution of transition samples) that can be reproduced with a Markovian policy. This <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/non-markovian-policies-occupancy-measures\/\">theoretical result<\/a> is highly impactful in the Offline RL field because many algorithms (such as the SPIBB family) rely on estimating the baseline by assuming that it is Markovian, which is generally not the case in practice. Our result proves the well-foundedness of this method even when the behavioral is not Markovian (in review).<\/li><\/ul>\n\n\n","protected":false},"excerpt":{"rendered":"<p>MSR contributions in the space of theoretical foundation for Offline RL Globally, MSR has made some recent advances in the space of the statistical foundations of Offline RL (opens in new tab), where a central question is to understand what representational conditions (involving the function approximator) and coverage conditions (involving the data distribution) enable sample [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-852783","msr-project","type-msr-project","status-publish","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[888921],"related-downloads":[],"related-videos":[],"related-groups":[896463,1148823],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"related-researchers":[],"msr_research_lab":[437514,1148609],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/852783","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":3,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/852783\/revisions"}],"predecessor-version":[{"id":853020,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/852783\/revisions\/853020"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=852783"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=852783"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=852783"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=852783"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=852783"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}