This umbrella project focuses on Offline RL (opens in new tab) algorithmic development in the space of conservative algorithms, i.e. algorithms that constrain the set of candidate policies in such a way that it remains close to the behavioral policy (also called baseline). Our algorithmic contributions to the field have focused more precisely on SPIBB* algorithmic family which offers guarantees on the policy improvement granted by the trained policy as compared to the behavioral policy (see blog post):
- SPIBB [ICML’19] [RL1] provides the seminal algorithm with a theoretical and empirical analysis in the tabular MDP setting. Its motto: Allow change only when you have sufficient evidence that it is for the better.
- Soft-SPIBB [ECML’20] provides a softening of the SPIBB algorithm with a theoretical and empirical analysis in the tabular MDP setting. Its motto: Allow change for the better proportionally to the amount of evidence.
- Estimated-baseline SPIBB [AAMAS’20] proves that both SPIBB and Soft-SPIBB keep their guarantees when the behavioral policy is not given, but learnt with behavioral cloning.
- Multi-Objective SPIBB [NeurIPS’21] provides an algorithmic, theoretical, and empirical extension to the setting of Multi-Objective RL.
- Deep SPIBB [RLDM’19 + in review (opens in new tab)] algorithmically and empirically studies the adaptation of SPIBB algorithms to complex tasks requiring function approximation.