FlowRL: Matching Reward Distributions for LLM Reasoning
- Xuekai Zhu ,
- Daixuan Cheng ,
- Dinghuai Zhang ,
- Hengli Li ,
- Kaiyan Zhang ,
- Che Jiang ,
- Youbang Sun ,
- Ermo Hua ,
- Yuxin Zuo ,
- Xingtai Lv ,
- Qizheng Zhang ,
- Lin Chen ,
- Fanghao Shao ,
- Bo Xue ,
- Yunchong Song ,
- Zhenjie Yang ,
- Ganqu Cui ,
- Ning Ding ,
- Jianfeng Gao ,
- Xiaodong Liu ,
- Bowen Zhou ,
- Hongyuan Mei ,
- Zhouhan Lin
ICLR 2026 |
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.