Safe Policy Improvement with Baseline Bootstrapping

In this umbrella project, we investigate a class of conservative Offline RL algorithms that use uncertainty estimators to decide whether they can trust their prediction to optimize their policy of they would better reproduce the policy that was used to collect the dataset.

This umbrella project focuses on Offline RL (opens in new tab) algorithmic development in the space of conservative algorithms, i.e. algorithms that constrain the set of candidate policies in such a way that it remains close to the behavioral policy (also called baseline). Our algorithmic contributions to the field have focused more precisely on SPIBB* algorithmic family which offers guarantees on the policy improvement granted by the trained policy as compared to the behavioral policy (see blog post):

  • SPIBB [ICML’19] [RL1] provides the seminal algorithm with a theoretical and empirical analysis in the tabular MDP setting. Its motto: Allow change only when you have sufficient evidence that it is for the better.
  • Soft-SPIBB [ECML’20] provides a softening of the SPIBB algorithm with a theoretical and empirical analysis in the tabular MDP setting. Its motto: Allow change for the better proportionally to the amount of evidence.
  • Estimated-baseline SPIBB [AAMAS’20] proves that both SPIBB and Soft-SPIBB keep their guarantees when the behavioral policy is not given, but learnt with behavioral cloning.
  • Multi-Objective SPIBB [NeurIPS’21] provides an algorithmic, theoretical, and empirical extension to the setting of Multi-Objective RL.
  • Deep SPIBB [RLDM’19 + in review (opens in new tab)] algorithmically and empirically studies the adaptation of SPIBB algorithms to complex tasks requiring function approximation.