Direct Reasoning Optimization: Constrained RL with Token-Level Dense Reward and Rubric-Gated Constraints for Open-ended Tasks
- Yifei Xu ,
- Tusher Chakraborty ,
- Srinagesh Sharma ,
- Leonardo Nunes ,
- Swati Sharma ,
- Kate Drakos Demopulos ,
- Emre Kiciman ,
- Songwu Lu ,
- Ranveer Chandra
Arxiv |
Published by Arxiv
RL training of LLMs on open-ended tasks is challenging due to the lack of direct verifiability. In this paper, we frame such training as constrained RL that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model’s token-level certainty of a reference answer under its CoT reasoning prefix while selectively emphasizing reasoning-reflective tokens to capture how likely the generated reasoning is to yield the desired answer. Rubric-gating complements R3 by operationalizing principled task criteria as hard accept/reject checks on final answers. Empirically, across four datasets, our framework outperforms baselines, achieves faster, more sample-efficient learning, and respects feasibility constraints.