Agent Lightning: Adding reinforcement learning to AI agents without code rewrites

Publié

Par , Senior Researcher , Research SDE II , Research Software Development Engineer II , Senior Researcher , Principal Research Manager , Sr Technical Program Manager , Principal Research SDE Manager

Three white line icons on a blue-to-purple gradient background: the first icon shows a simple flowchart with connected squares and a diamond, the second icon shows a network of interconnected circles, and the third icon shows three user profile symbols linked together.

AI agents are reshaping software development, from writing code to carrying out complex instructions. Yet LLM-based agents are prone to errors and often perform poorly on complicated, multi-step tasks. Reinforcement learning (RL) is an approach where AI systems learn to make optimal decisions by receiving rewards or penalties for their actions, improving through trial and error. RL can help agents improve, but it typically requires developers to extensively rewrite their code. This discourages adoption, even though the data these agents generate could significantly boost performance through RL training.

To address this, a research team from Microsoft Research Asia – Shanghai has introduced Agent Lightning. This open-source (opens in new tab) framework makes AI agents trainable through RL by separating how agents execute tasks from model training, allowing developers to add RL capabilities with virtually no code modification.

Capturing agent behavior for training

Agent Lightning converts an agent’s experience into a format that RL can use by treating the agent’s execution as a sequence of states and actions, where each state captures the agent’s status and each LLM call is an action that moves the agent to a new state.

This approach works for any workflow, no matter how complex. Whether it involves multiple collaborating agents or dynamic tool use, Agent Lightning breaks it down into a sequence of transitions. Each transition captures the LLM’s input, output, and reward (Figure 1). This standardized format means the data can be used for training without any additional steps.

Figure 1: Diagram illustrating Agent Lightning’s unified data interface for a retrieval-augmented generation (RAG) agent. On the left, four states (state₀ to state₃) show the agent’s execution flow, where semantic variables—UserInput, Query, Passages, and Answer—are updated after each component call (LLM or Search). Green blocks represent populated variables; gray blocks indicate empty ones. On the right, the unified data interface converts these transitions into a trajectory format containing prompt, generation, and immediate reward for RL training.
Figure 1. An illustration of Agent Lightning’s standardized format using a retrieval-augmented generation (RAG) agent. Left: The full agent workflow, where the agent’s state updates after each component step. The green blocks show assigned variables, and the gray blocks indicate variables without content. Right: The collected transitions are based on the standardized format for the RL training process, with each transition corresponding to one LLM step that contains its prompt, result, and immediate reward.

Hierarchical reinforcement learning

Traditional RL training for agents that make multiple LLM requests involves stitching together all content into one long sequence and then identifying which parts should be learned and which ignored during training. This approach is difficult to implement and can create excessively long sequences that degrade model performance.

Instead, Agent Lightning’s LightningRL algorithm takes a hierarchical approach. After a task completes, a credit assignment module determines how much each LLM request contributed to the outcome and assigns it a corresponding reward. These independent steps, now paired with their own reward scores, can be used with any existing single-step RL algorithm, such as Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO) (Figure 2).

Figure 2: Comparison of three reinforcement learning approaches for LLM tasks. (a) Single-step GRPO: The model completes the task in one call, and multiple outputs for the same task are compared with associated rewards. (b) Previous multi-step GRPO: The task spans multiple LLM calls, forming trajectories; non-LLM tokens (gray boxes) are ignored during training, and entire multi-step runs are compared. (c) LightningRL: Breaks multi-step runs into individual LLM calls, each including input, context, output, and reward assigned by a credit assignment module. Calls from the same task are grouped for reinforcement.
Figure 2. (a) Single-step GRPO: The LLM completes the task in one call. Multiple responses for the same task are compared to determine how strongly each should be reinforced. (b) Previous multi-step GRPO: The task involves multiple LLM calls. Multiple multi-step runs of the same task are compared, with non-LLM generated tokens (grey boxes) ignored during training. (c) LightningRL: The multi-step run is divided into individual LLM calls. Calls from the same task are compared to determine how strongly each should be reinforced. Each call includes its input, context, output, and reward, assigned by the credit assignment module.

This design offers several benefits. It remains fully compatible with widely used single-step RL algorithms, allowing existing training methods to be applied without modification. Organizing data as a sequence of independent transitions lets developers flexibly construct the LLM input as needed, supporting complex behaviors like agents that use multiple tools or work with other agents. Additionally, by keeping sequences short, the approach scales cleanly and keeps training efficient.

Agent Lightning as middleware

Agent Lightning serves as middleware between RL algorithms and agent environments, providing modular components that enable scalable RL through standardized protocols and well-defined interfaces.

An agent runner manages the agents as they complete tasks. It distributes work and collects and stores the results and progress data. It operates separately from the LLMs, enabling them to run on different resources and scale to support multiple agents running concurrently.

An algorithm trains the models and hosts the LLMs used for inference and training. It orchestrates the overall RL cycle, managing which tasks are assigned, how agents complete them, and how models are updated based on what the agents learn. It typically runs on GPU resources and communicates with the agent runner through shared protocols.

The LightningStore (opens in new tab) serves as the central repository for all data exchanges within the system. It provides standardized interfaces and a shared format, ensuring that the different components can work together and enabling the algorithm and agent runner to communicate effectively.

Figure 3: Diagram showing the architecture of Agent Lightning (AGL). On the left, the AGL Algorithm block includes an inference engine (e.g., vLLM), an algorithm iteration loop, and an adapter for trainable data and weights update. In the center, the AGL Core contains LightningStore, which manages tasks, resources, spans, and LLM calls. On the right, the AGL Agent Runner & Tracer includes a user-defined agent using OpenAI chat completion and agl.emit(). Arrows indicate flows of prompts, responses, tasks, resources, spans, and datasets between components, with roles for algorithm researchers and agent developers highlighted.
Figure 3. The Agent Lightning framework

All RL cycles follow two steps: (1) Agent Lightning collects agent execution data (called “spans”) and store them in the data store; (2) it then retrieves the required data and sends it to the algorithm for training. Through this design, the algorithm can delegate tasks asynchronously to the agent runner, which completes them and reports the results back (Figure 4).

Figure 4: Diagram of the training loop in Agent Lightning. The central element is ‘Trainer,’ with arrows forming a cycle between three components: Agent on the left, Algorithm on the right, and Trainer in the middle. The top arrow labeled ‘Tasks’ flows from Algorithm to Agent, while the bottom arrow labeled ‘Spans’ flows from Agent to Algorithm. ‘Prompt Templates’ is noted above the cycle, indicating its role in task generation.
Figure 4. Agent Lightning’s RL cycle

One key advantage of this approach is its algorithmic flexibility. The system makes it easy for developers to customize how agents learn, whether they’re defining different rewards, capturing intermediate data, or experimenting with different training approaches.

Another advantage is resource efficiency. Agentic RL systems are complex, integrating agentic systems, LLM inference engines, and training frameworks. By separating these components, Agent Lightning makes this complexity manageable and allows each part to be optimized independently

A decoupled design allows each component to use the hardware that suits it best. The agent runner can use CPUs while model training uses GPUs. Each component can also scale independently, improving efficiency and making the system easier to maintain. In practice, developers can keep their existing agent frameworks and switch model calls to the Agent Lightning API without changing their agent code (Figure 5).

Figure 5: Side-by-side code comparison showing agent implementation before and after integrating Agent Lightning. The left panel (dark background) displays the original agent code written by the developer, including logic for LLM calls, tool usage, and reward assignment. The right panel (light background) shows the modified version using Agent Lightning, where most of the agent logic remains unchanged but includes additional imports and calls to Agent Lightning components such as agl.PromptTemplate, agl.emit(), and agl.Trainer for training and credit assignment. A stylized lightning icon is centered between the two panels.
Figure 5. On the left, the developer implements the agent code. On the bottom right is the code required for Agent Lightning. The main body of the agent code is unchanged.

Evaluation across three real-world scenarios

Agent Lightning was tested on three distinct tasks, achieving consistent performance improvements across all scenarios (Figure 6):

Text-to-SQL (LangChain): In a system with three agents handling SQL generation, checking, and rewriting, Agent Lightning simultaneously optimized two of them, significantly improving the accuracy of generating executable SQL from natural language queries.

Retrieval-augmented generation (OpenAI Agents SDK implementation): On the multi-hop question-answering dataset MuSiQue, which requires querying a large Wikipedia database, Agent Lightning helped the agent generate more effective search queries and reason better from retrieved content.

Mathematical QA and tool use (AutoGen implementation): For complex math problems, Agent Lightning trained LLMs to more accurately determine when and how to call the tool and integrate the results into its reasoning, increasing accuracy.

Figure 6: Figure with six line charts showing reward curves across three evaluation scenarios (Spider, MuSiQue, Calculator) for train and test splits. Top row: Train Rewards on Spider, MuSiQue, and Calculator—each plot shows a blue line with noisy upward trend over steps, indicating increasing rewards; Spider and Calculator rise faster with more variance, MuSiQue climbs more gradually. Bottom row: Test Rewards on Spider, MuSiQue, and Calculator—each plot shows a blue line that increases and then stabilizes at higher rewards; Calculator reaches near-plateau earliest, Spider shows steady gains with minor fluctuations, MuSiQue improves more slowly. All plots use ‘Steps’ on the x‑axis and ‘Rewards’ on the y‑axis, with a legend labeled ‘ours’ and light gridlines.
Figure 6. Reward curves across the three evaluation scenarios

Enabling continuous agent improvement

By simplifying RL integration, Agent Lightning can make it easier for developers to build, iterate, and deploy high-performance agents. We plan to expand Agent Lightning’s capabilities to include automatic prompt optimization and additional RL algorithms.

The framework is designed to serve as an open platform where any AI agent can improve through real-world practice. By bridging existing agentic systems with reinforcement learning, Agent Lightning aims to help create AI systems that learn from experience and improve over time.

Lire la suite

Voir tous les articles de blog