Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervised-learning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an inﬂuential role in understanding the nature of supervised learning, they have not been as successful in reinforcement learning (RL). Here, the fundamental barrier is the need for active exploration in sequential decision problems.
An RL agent tries to maximize long-term utility by exploiting its knowledge about the problem, but this knowledge has to be acquired by the agent itself through exploring the problem that may reduce short-term utility. The need for active exploration is common in many problems in daily life, engineering, and sciences. For example, a Backgammon program strives to take good moves to maximize the probability of winning a game, but sometimes it may try novel and possibly harmful moves to discover how the opponent reacts in the hope of discovering a better game-playing strategy. It has been known since the early days of RL that a good tradeoﬀ between exploration and exploitation is critical for the agent to learn fast (i.e., to reach near-optimal strategies with a small sample complexity), but a general theoretical analysis of this tradeoﬀ remained open until recently.
In this dissertation, we introduce a novel computational learning model called KWIK (Knows What It Knows) that is designed particularly for its utility in analyzing learning problems like RL where active exploration can impact the training data the learner is exposed to. My thesis is that the KWIK learning model provides a ﬂexible, modularized, and unifying way for creating and analyzing reinforcement-learning algorithms with provably eﬃcient exploration. In particular, we show how the KWIK perspective can be used to unify the analysis of existing RL algorithms with polynomial sample complexity. It also facilitates the development of new algorithms with smaller sample complexity, which have demonstrated empirically faster learning speed in real-world problems. Furthermore, we provide an improved, matching sample complexity lower bound, which suggests the optimality (in a sense) of one of the KWIK-based algorithms known as delayed Q-learning.