Abstract

One of the key problems in reinforcement learning (RL) is balancing exploration and exploitation. Another is learning and acting in large or even continuous Markov decision processes (MDPs), where compact function approximation has to be used. In this paper, we provide a provably efficient, model-free RL algorithm for finite-horizon problems with linear value-function approximation that addresses the exploration-exploitation trade off in a principled way. The key element of this algorithm is the use of a hypothesized online linear-regression algorithm in the recently proposed KWIK framework. We show that, if the sample complexity of the KWIK online linear-regression algorithm is polynomial, then the sample complexity of exploration of the RL algorithm is also polynomial. Such a connection provides a promising approach to efficient RL with function approximation via studying a simpler setting.