Towards a Theory for Sample-efficient Reinforcement Learning with Rich Observations

  • Alekh Agarwal

How can we tractably solve sequential decision making problems where the learning agent receives rich observations? We begin with a new model called Contextual Decision Processes (CDPs) for studying such problems, and show that it encompasses several prior setups to study RL such as MDPs and POMDPs. Several special cases of CDPs are, however, known to be provably intractable in their sample complexities. To overcome this challenge, we further propose a structural property of such processes, called the Bellman Rank. We find that the Bellman Rank of a CDP (and an associated class of functions) provides an intuitive measure of the hardness of a problem in terms of sample complexity and is small in several practical settings. In particular, we propose an algorithm, whose sample complexity scales with the Bellman Rank of the process, and is completely independent of the size of the observation space of the agent. We also show that our techniques are robust to our modeling assumptions, and make connections to several known results as well as highlight novel consequences of our results.

Finally, we also discuss the computational difficulties which arise in operationalizing the resulting methods and directions for further progress.

This talk is based on joint works with Christoph Dann Nan Jiang, Akshay Krishnamurthy, John Langford and Rob Schapire and the longer articles can be found in the below links.

Speaker Details

I am currently a researcher in the New York lab of Microsoft Research, where I have also spent two wonderful years as a postdoc. Prior to that, I obtained my PhD in Computer Science from UC Berkeley, working with Peter Bartlett and Martin Wainwright. Alekh is also is teaching a class at Columbia.