FlowRL: Matching Reward Distributions for LLM Reasoning

Xuekai Zhu; Daixuan Cheng; Dinghuai Zhang; Hengli Li; Kaiyan Zhang; Che Jiang; Youbang Sun; Ermo Hua; Yuxin Zuo; Xingtai Lv; Qizheng Zhang; Lin Chen; Fanghao Shao; Bo Xue; Yunchong Song; Zhenjie Yang; Ganqu Cui; Ning Ding; Jianfeng Gao; Xiaodong Liu; Bowen Zhou; Hongyuan Mei; Zhouhan Lin

FlowRL: Matching Reward Distributions for LLM Reasoning

Xuekai Zhu ,
Daixuan Cheng ,
Dinghuai Zhang ,
Hengli Li ,
Kaiyan Zhang ,
Che Jiang ,
Youbang Sun ,
Ermo Hua ,
Yuxin Zuo ,
Xingtai Lv ,
Qizheng Zhang ,
Lin Chen ,
Fanghao Shao ,
Bo Xue ,
Yunchong Song ,
Zhenjie Yang ,
Ganqu Cui ,
Ning Ding ,
Jianfeng Gao ,
Xiaodong Liu ,
Bowen Zhou ,
Hongyuan Mei ,
Zhouhan Lin

ICLR 2026 | September 2025

Download BibTex

We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.