Abstract

One of the key problems in reinforcement learning (RL) is balancing exploration and exploitation. Another is learning and acting in large Markov decision processes (MDPs) where compact function approximation has to be used. This paper introduces REKWIRE, a provably efficient, model-free algorithm for finite-horizon RL problems with value function approximation (VFA) that addresses the exploration-exploitation tradeoff in a principled way. The crucial element of this algorithm is a reduction of RL to online regression in the recently proposed KWIK learning model. We show that, if the KWIK online regression problem can be solved efficiently, then the sample complexity of exploration of REKWIRE is polynomial. Therefore, the reduction suggests a new and sound direction to tackle general RL problems. The efficiency of our algorithm is verified on a set of proof-of-concept experiments where popular, ad hoc exploration approaches fail.