Direct Reasoning Optimization: Constrained RL with Token-Level Dense Reward and Rubric-Gated Constraints for Open-ended Tasks

Yifei Xu; Tusher Chakraborty; Srinagesh Sharma; Leonardo Nunes; Swati Sharma; Kate Drakos Demopulos; Emre Kiciman; Songwu Lu; Ranveer Chandra

Direct Reasoning Optimization: Constrained RL with Token-Level Dense Reward and Rubric-Gated Constraints for Open-ended Tasks

Yifei Xu ,
Tusher Chakraborty ,
Srinagesh Sharma ,
Leonardo Nunes ,
Swati Sharma ,
Kate Drakos Demopulos ,
Emre Kiciman ,
Songwu Lu ,
Ranveer Chandra

Arxiv | June 2025

Published by Arxiv

Download BibTex

RL training of LLMs on open-ended tasks is challenging due to the lack of direct verifiability. In this paper, we frame such training as constrained RL that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model’s token-level certainty of a reference answer under its CoT reasoning prefix while selectively emphasizing reasoning-reflective tokens to capture how likely the generated reasoning is to yield the desired answer. Rubric-gating complements R3 by operationalizing principled task criteria as hard accept/reject checks on final answers. Empirically, across four datasets, our framework outperforms baselines, achieves faster, more sample-efficient learning, and respects feasibility constraints.