Safe Policy Improvement with Baseline Bootstrapping

Romain Laroche; Paul Trichelair

Safe Policy Improvement with Baseline Bootstrapping

Romain Laroche ,
Paul Trichelair

European Workshop for Reinforcement Learning (EWRL) | October 2018

Download BibTex

In this paper, we consider the Batch Reinforcement Learning task and adopt the safe policy improvement (SPI) approach: we compute a target policy guaranteed to perform at least as well as a given baseline policy, approximately and with high probability. Our SPI strategy, inspired by the knows-what-it-knows paradigm, consists in bootstrapping the target with the baseline when the target does not know. We develop a policy-based computationally efficient bootstrapping algorithm, accompanied by theoretical SPI bounds for the tabular case. We empirically show the limits of the existing algorithms on a small stochastic gridworld problem, and then demonstrate that our algorithm not only improve the worst-case scenario but also the mean performance.

Related Tools

Implementation of Safe Policy Improvement with Baseline Bootstrapping

May 13, 2019

This project can be used to reproduce the finite MDPs experiments presented in the ICML2019 paper: Safe Policy Improvement with Baseline Bootstrapping, by Romain Laroche, Paul Trichelair, and Rémi Tachet des Combes. For the DQN implementation, please refer to git repository at this address.

Access

Implementation of SPIBB-DQN

May 13, 2019

This project can be used to reproduce the DQN implementation presented in the ICML2019 paper: Safe Policy Improvement with Baseline Bootstrapping, by Romain Laroche, Paul Trichelair, and Rémi Tachet des Combes. For the finite MDPs experiments, please refer to git repository at this address.

Access