Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

Wanyi Chen; Xiao Yang; Xu Yang; Tianming Sha; Qizheng Li; Zhuo Wang; Bowen Xian; Fang Kong; Weiqing Liu; Jiang Bian

Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?

Wanyi Chen ,
Xiao Yang ,
Xu Yang ,
Tianming Sha ,
Qizheng Li ,
Zhuo Wang ,
Bowen Xian ,
Fang Kong ,
Weiqing Liu ,
Jiang Bian

April 2026

arXiv

Download BibTex

We introduce Agent^2 RL-Bench, a benchmark for evaluating agentic RL post-training — whether LLM agents can autonomously design, implement, and run complete RL pipelines that improve foundation models. This capability is important because RL post-training increasingly drives model alignment and specialization, yet existing benchmarks remain largely static: supervised fine-tuning alone yields strong results, leaving interactive RL engineering untested. Agent^2 RL-Bench addresses this with six tasks across three levels — from static rule-based training to closed-loop online RL with trajectory collection — each adding a structural requirement that prior levels do not impose. The benchmark provides isolated workspaces with a grading API, runtime instrumentation that records every submission and code revision, and automated post-hoc analysis that generates structured run reports, enabling the first automated diagnostic of agent-driven post-training behavior. Across multiple agent stacks spanning five agent systems and six driver LLMs, we find that agents achieve striking interactive gains — on ALFWorld, an RL-only agent improves from 5.97 to 93.28 via SFT warm-up and GRPO with online rollouts — yet make only marginal progress on others (DeepSearchQA: +2.75 within evaluation noise), and that driver choice has a large effect on interactive tasks — within the same scaffold, switching drivers changes interactive improvement from near-zero to +78pp. More broadly, the benchmark reveals that supervised pipelines dominate agent-driven post-training under fixed budgets, with online RL succeeding as the final best route only on ALFWorld. Code is available at https://github.com/microsoft/RD-Agent/tree/main/rdagent/scenarios/rl/autorl_bench.

Related Tools

RD-Agent

July 29, 2024

Research and development (R&D) is crucial for the enhancement of industrial productivity, especially in the AI era, where the core aspects of R&D are mainly focused on data and models. We are committed to automate these high-value generic R&D processes through our open source R&D automation tool RD-Agent, which let AI drive data-driven AI.

Access