docs/advance/agent_loop.rst
Last updated: 07/17/2025.
.. versionadded:: 0.4.2 [status: alpha]
.. warning:: Agent Loop is ready for use, but the API may change in future releaes.
Agent Loop is designed as general interface for multi-turn rollout and agentic reinforcement learning.
Design goal:
Non-goal:
In high level overview, agent loop is given a prompt, run user defined loop: call LLM generate api, call tools, ... and return the final output. The final output is then calculated reward and used as trajectory for RL training.
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/agent_loop_overview.svg?raw=true
AgentLoopBase class is the abstraction of agent loop, and run method is the only interface that user need to implement.
The run method, given prompt messages in format: [{"role": "user"}, {"content": "..."}], and additional sampling params,
could do whatever user wants, such as
.. code:: python
class AgentLoopBase(ABC): @abstractmethod async def run(self, sampling_params: dict[str, Any], **kwargs) -> AgentLoopOutput: """Run agent loop to interact with LLM server and environment.
Args:
sampling_params (Dict[str, Any]): LLM sampling params.
**kwargs: dataset fields from `verl.utils.dataset.RLHFDataset`.
Returns:
AgentLoopOutput: Agent loop output.
"""
raise NotImplementedError
After running user defined loop, run method should return AgentLoopOutput, including prompt token ids,
response token ids, and response mask.
.. code:: python
class AgentLoopOutput(BaseModel): """Agent loop output."""
prompt_ids: list[int]
"""Prompt token ids."""
response_ids: list[int]
"""Response token ids including LLM generated token, tool response token."""
response_mask: list[int]
"""Response mask, 1 for LLM generated token, 0 for tool response token."""
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/agent_loop_output.svg?raw=true
.. note:: AgentLoopOutput only output one trajectory for a given prompt, multiple trajectories output is still under discussion.
A single PPO step contain two phase: rollout and train. In rollout phase:
AgentLoopManager.generate_sequences.wake_up all async LLM server instances, which will sync weights between inference engine(vLLM/SGLang) and training engine(FSDP/Megatron-LM).AgentLoopWorker.AgentLoopBase instance, run run coroutine until end and get AgentLoopOutput... tip:: AgentLoopWorker schedules multiple coroutines concurrently. If number of AgentLoopWorker equals batch_size, then each worker is response for one prompt.
In agent loop, when user need LLM generate response:
AsyncLLMServerManager.generate with prompt_ids.When all prompts in all AgentLoopWorker finish, AgentLoopManager gather results and return to PPOTrainer.
sleep all server instances, which will free kv cache and offload weights to CPU memory.AsyncLLMServer
AsyncLLMServer is the abstraction of LLM server with two types of generation api:
- `OpenAI chat completion <https://platform.openai.com/docs/api-reference/chat>`_: generate response for the given chat conversation.
- Token in token out: generate response ids for the given token ids.
We have officially supported vLLM and SGLang AsyncLLMServer, both of them implement the two api and are well tested.
Other inference engine should be easy to plug-in by implement the ``AsyncServerBase`` class.
.. code:: python
class AsyncServerBase(ABC):
@abstractmethod
async def chat_completion(self, raw_request: Request) -> JSONResponse:
"""OpenAI chat completion API.
Args:
raw_request (Request): raw json request
Returns:
JSONResponse: json response
API reference: https://platform.openai.com/docs/api-reference/chat/create
"""
raise NotImplementedError
@abstractmethod
async def generate(self, prompt_ids: list[int], sampling_params: dict[str, Any], request_id: str) -> list[int]:
"""Generate response ids given prompt ids.
Args:
prompt_ids (List[int]): prompt ids
sampling_params (Dict[str, Any]): sampling params
request_id (str): request id
Returns:
List[int]: response ids
"""
raise NotImplementedError
Chat completion vs Token in token out
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. warning::
The following conclusion is based on our recent experience and is still open to investigation and discussion.
Almost all agent frameworks (LangGraph, CrewAI, LlamaIndex, etc) call LLM with OpenAI chat completion api, and
keep chat history as messages. So user may expect that we should use the chat completion api in multi-turn rollout.
But based on our recent experience on single-turn training on DAPO and multi-turn training on `retool <https://github.com/verl-project/verl-recipe/tree/main/retool>`_,
we found the token_ids from apply the final messages may not equal to the token_ids by concat prompt_ids and response_ids in each turn.
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/multi_turn.png?raw=true
**Where does this inconsistency happened?**
First, the tool parser may alter the content. For example
.. code:: json
{"role": "assistant", "content": "Let me call a <tool_call>...</tool_call> and get the result"}
After tool_calls extraction, the messages is like this:
.. code:: json
{"role": "assistant", "content": "Let me call a and get the result", "tool_calls": [{"name": "foo", "arguments": "{}"}]}
Encode the extracted message back is not equal to the original LLM generated response_ids.
Second, the `decode-encode` may also lead to inconsistency: `Agent-R1 issue#30 <https://github.com/0russwest0/Agent-R1/issues/30#issuecomment-2826155367>`_.
**What is the impact of this inconsistency?**
This inconsistency is not a big problem for serving/agent system, but is critical to RL training.
It causes the trajectory deviate from the policy model distribution. We have observed that apply_chat_template
to the final chat history messages make PPO training not even converged in single-turn.
vLLM
^^^^
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/async_vllm.png?raw=true
For vLLM, the Async LLM Engine is running in same process as the server, and ModelRunner is running in same process as FSDP/Megatron-LM workers.
Async LLM Engine communicate with ModelRunner through ZeroMQ. When server receive a request, it directly call engine to generate response_ids.
SGLang
^^^^^^
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/async_sglang.png?raw=true
For SGLang, the Async LLM Engine is running in same process as FSDP/Megatron-LM worker-0, and it spawn multiple subprocesses as ModelRunner.
Also, Async LLM Engine communicate with ModelRunner through ZeroMQ. When server receive a request, it remote call the worker-0 and get response_ids.
AsyncLLMServerManager
AsyncLLMServerManager serve as proxy to multiple AsyncLLMServer instances, provides:
AsyncLLMServerManager is passed to AgentLoopBase.__init__, whenever user want to interact with LLM in agent loop,
they can call AsyncLLMServerManager.generate to generate response_ids.
.. code:: python
class AsyncLLMServerManager: async def generate( self, request_id, *, prompt_ids: list[int], sampling_params: dict[str, Any], ) -> list[int]: """Generate tokens from prompt ids.
Args:
request_id (str): request id for sticky session.
prompt_ids (List[int]): List of prompt token ids.
sampling_params (Dict[str, Any]): Sampling parameters for the chat completion.
Returns:
List[int]: List of generated token ids.
"""
...
Agentic RL Training<../start/agentic_rl>: Quick start agentic RL training with gsm8k dataset.LangGraph MathExpression <https://github.com/verl-project/verl-recipe/tree/main/langgraph_agent/example>_: Demonstrate how to use LangGraph to build agent loop.Retool <https://github.com/verl-project/verl-recipe/tree/main/retool>_: End-to-end retool paper reproduction using tool agent.