FlowRL: Matching Reward Distributions for LLM Reasoning

  • Xuekai Zhu ,
  • Daixuan Cheng ,
  • Dinghuai Zhang ,
  • Hengli Li ,
  • Kaiyan Zhang ,
  • Che Jiang ,
  • Youbang Sun ,
  • Ermo Hua ,
  • Yuxin Zuo ,
  • Xingtai Lv ,
  • Qizheng Zhang ,
  • Lin Chen ,
  • Fanghao Shao ,
  • Bo Xue ,
  • Yunchong Song ,
  • Zhenjie Yang ,
  • Ganqu Cui ,
  • Ning Ding ,
  • Jianfeng Gao ,
  • Xiaodong Liu ,
  • Bowen Zhou ,
  • Hongyuan Mei ,
  • Zhouhan Lin

ICLR 2026 |

Publication

We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.