MetroRLHF: Enabling Memory-Effective Training for On-Policy RLHF via Adaptive Sequence Streaming
NeurIPS 2025 The First Workshop of Efficient Reasoning |
Reinforcement learning from human feedback (RLHF) has become the
standard post-training technique for endowing large language models (LLMs)
with helpful, harmless, and intent-consistent behavior. In practice, however, its
adoption is hampered by prohibitive memory consumption during the phase of
the policy-model update, especially when training on long-form generation tasks.
In this paper, we propose MetroRLHF, a memory-efficient, on-policy RLHF ap
proach that exploits the inference-time computations to reduce the training-time
memory budget and to skip unnecessary work. By re-using the inference-phase
materialized K,V context, the inter-token dependencies are freely removed that
normally force the entire sequence to train in parallel. Building upon fine-grained
subsequence streaming, RLHF can train the productive tokens in an effective
manner. This yields a training pipeline that matches the exact behavior of conven
tional full-sequence RLHF while using less memory and incurring no arithmetic
recomputation. Experiments on the Qwen-3 models demonstrate that MetroRLHF
rescheduled algorithm reduces peak training memory usage to 1/3.8 ∼ 1/5.9, enabling
not only memory-effective but also semantic-reliable fine-tuning for LLM.