## Finding the best learning targets automatically: Fully Parameterized Quantile Function for distributional RL

December 18, 2019 | By Li Zhao, Senior Researcher

Reinforcement learning has achieved great success in game scenarios, with RL agents beating human competitors in such games as Go and poker. Distributional reinforcement learning, in particular, has proven to be an effective approach for training an agent to maximize reward, producing state-of-the-art results on Atari games, which are widely used as benchmarks for testing RL algorithms.

Because of the intrinsic randomness of game environments—with the roll of the dice in Monopoly, for example, you can lose $2,000 by landing on someone else’s property or gain $200 by passing “Go”—the return an agent receives is a random variable. Unlike traditional reinforcement learning, distributional RL takes into account that intrinsic randomness, investigating not the expectation of the return, or Q-value, but the *distribution* of the return. This greatly enhances the efficiency of the data. Since the return distribution is generally continuous and takes infinite parameters to represent perfectly, an important and ongoing area of research in the field has become how to parameterize the return distribution effectively.

A series of distributional RL algorithms developed over the last couple of years has achieved state-of-the-art performance by parameterizing either the probability or value side and then uniformly fixing or randomly sampling the other. My co-authors and I went one step further. In our paper “Fully Parameterized Quantile Function for Distributional Reinforcement Learning,” which was presented at the 33rd Conference on Neural Information Processing Systems (NeurIPS), we propose Fully Parameterized Quantile Function (FQF), a distributional RL algorithm in which both the probability and value sides are parameterized and jointly trained to better characterize the distribution of return. This work was developed by myself and co-authors Derek Yang and Zichuan Lin, Microsoft Research interns at the time, and Senior Principal Research Manager Tao Qin, Principal Research Manager Jiang Bian, and Assistant Managing Director Tie-Yan Liu.

### Capturing distribution of return

In distributional reinforcement learning, the distribution of return can be represented as a probability density function (PDF), a cumulative distribution function (CDF), or an inverse of CDF—that is, a quantile function. Each captures the mapping relation between probability and return value. We parameterize the functions by selecting several points in the function, representing the original function with a histogram function or staircase function, so that we can do efficient computations.

FQF was inspired by existing distributional algorithms C51, QR-DQN, and IQN. Using the probability density function, C51 parameterizes only the probability side, leaving the value side as uniformly fixed classes. With QR-DQN, researchers turned to the quantile function to capture distribution and parameterized only the value side, or the *quantile value*, uniformly fixing the probability side, or the *quantile fraction*. QR-DQN surpassed C51 in performance. Developed to better characterize the random variable at full potential, something learning fixed target attributes might not achieve, IQN—based on QR-DQN—approximates the distribution on sampled quantile fractions rather than fixed ones and learns a mapping from the sampled quantile fractions to quantile values. However, the sampled quantile fractions aren’t necessarily the best quantile fractions.

As shown in Figure 1, two sets of sampled quantile fractions can lead to quite different approximation errors. This motivated us to find the learning targets—that is, the quantile fractions—that would result in the least approximation error.

### Predicting good learning targets

Our method consists of two jointly trained networks: a fraction proposal network and a quantile function network.

We propose predicting the best quantile fractions for each state using the fraction proposal network, trained to minimize the approximation error, which is the 1-Wasserstein distance as shown in Figure 1. Generally speaking, the 1-Wasserstein distance can’t be computed accurately, but we derive the gradient with respect to the quantile fractions to minimize the 1-Wasserstein distance without actually computing the distance. The quantile function network is then trained, using quantile regression and a distributional Bellman update as in IQN, to approximate quantile value on the fractions generated by the fraction proposal network. Since we parameterize both the quantile fractions and the quantile value, we call our method Fully Parameterized Quantile Function.

### State dependence of the fractions

We tested FQF on 55 Atari 2600 games within the Arcade Learning Environment (ALE) framework. According to our results, the generated fractions have a great dependence on states, changing along with them. This leads us into an interesting area of reinforcement learning where we no longer focus on only learning the value of target attributes, but also on finding the attributes that are worthy to learn.

In summary, we propose the idea of adjusting the quantile fractions automatically to better approximate the quantile function, leading to a novel parameterization method in distributional reinforcement learning and achieving new state-of-the-art performance on Atari games for non-distributed model-free algorithms. Finding the best learning targets automatically is not limited to RL, but widely applicable to machine learning research. We believe the idea can motivate the broader machine learning community.