docs/blog/v0.7.md
Author: verl team
Last updated: 01/03/2026.
verl adopts a Hybrid-Controller architecture (also known as HybridFlow). Sharing design principles with asynchronous sharded dataflow systems like Google Pathways, verl models Reinforcement Learning (RL) algorithms, such as PPO, GRPO, DAPO, and others, as a multi-stage, multi-model and parallelizable dataflow graph.
To balance flexibility with performance, verl unifies two distinct programming models:
High-Level Single-Controller (MPMD): At the orchestration level, a single process RLTrainer manages the global computation graph. It handles macro-tasks such as scheduling rollout generation, triggering reward scoring, and dispatching distributed training jobs.
Internal Multi-Controller (SPMD): Internally, the Model Engine operates in standard distributed training mode. Workers execute identical programs, via trainer backends like FSDP, Megatron, or VeOmni, or rollout executors (not rollout server) like vLLM/SGLang/TensorRT-LLM, to perform heavy distributed computation, synchronizing via collective communication.
<div align="center"> </div>This hybrid approach offers significant advantages:
Flexible Orchestration: The single-controller design allows verl to dynamically manage complex constraints within the computation graph, including flexible data dependencies, diverse resource allocation and model placement, and fine-grained asynchronous staleness control.
Abstraction of Complexity: We encapsulate complex parallel strategies—such as 5D parallelism (DP, TP, CP, PP, and EP)—strictly within the Model Engine. This allows users to focus entirely on RL algorithm implementation without getting bogged down by the details of distributed training.
Furthermore, leveraging Ray placement groups, verl provides ResourcePool and WorkerGroup abstractions. These enable flexible GPU sharing among the various roles in the RL process—such as actor, critic, reward, and rollout—allowing components to share resources efficiently while remaining isolated.
As illustrated in the diagram below, the overall architecture of verl is divided into two layers:
The Model Engine serves as verl's core training engine, defining a set of abstract interfaces that support pluggable backends. It operates in SPMD mode:
The abstract interfaces include methods like initialize, forward, optimizer_step, and load/offload. Integrating a new training engine simply requires inheriting and implementing these interfaces. Crucially, because all backends adhere to this unified abstraction, adding a new Model Engine requires absolutely no code modification on the caller side. The RLTrainer remains completely agnostic to the backend's specific parallel strategy when calling these interfaces, while the WorkerGroup automatically handles data dispatch and collection based on the underlying parallelism.
Currently, the Model Engine supports the following backends (more backend maybe supported in future, e.g torchtitan):
| Backend | Parallelism | Performance | Support Model | New Model Support Time |
|---|---|---|---|---|
| FSDP | FSDP+SP | Dense medium/MoE low | all transformer models | Day 0 |
| MCore | DP+TP+PP+EP+CP | High | see Megatron-Bridge support model list | few weeks or month |
| VeOmni | FSDP+SP+EP | Medium | see VeOmni support model list | ~1 week |
class BaseEngine:
def initialize(self):
"""Instantiate or load the model, optimizer, and learning rate scheduler."""
raise NotImplementedError
def optimizer_zero_grad(self):
"""Zero the gradients of the optimizer."""
raise NotImplementedError
def optimizer_step(self):
"""Perform an optimization step using the optimizer."""
raise NotImplementedError
def lr_scheduler_step(self):
"""Advance the learning rate scheduler by one step."""
raise NotImplementedError
def forward_backward_batch(self, data: TensorDict, loss_function: Callable, forward_only=False) -> Any:
"""Perform a forward pass and optionally a backward pass on a batch of data."""
raise NotImplementedError
def get_per_tensor_param(self) -> tuple[Generator[tuple[str, torch.Tensor], None, None], Optional[dict]]:
"""Get a generator that yields per-tensor parameters and optional peft config."""
raise NotImplementedError
def to(self, device: str, model: bool = True, optimizer: bool = True, grad: bool = True):
"""Move model parameters, optimizer states, or both to the specified device."""
raise NotImplementedError
As LLM reinforcement learning evolves from single-turn, static tasks to multi-turn, dynamic, and interactive agentic tasks, the legacy SPMD rollout mode previously used by verl has become insufficient. Consequently, in verl v0.7, we have removed the SPMD rollout mode and switched to rollout server mode by default.
<div align="center"> </div>In the server mode, the LLM server operates as online serving rather than the traditional offline batch inference. Clients send per-sample requests to the server, enabling the engine to utilize dynamic batching. This significantly enhances throughput efficiency for multi-turn conversation. Furthermore, the server-based approach eliminates the need for intrusive modifications to the LLM inference engine, allowing for the seamless integration of modern inference backends such as vLLM, SGLang, and TensorRT-LLM.
On the client side, verl introduces an extensible AgentLoop abstraction designed to define custom agentic task loops. This abstraction manages the cycle of requesting responses from the LLM server and interacting with external environments to obtain feedback. We provide two default implementations:
Users can implement custom AgentLoop logic tailored to their specific needs, such as SWEAgentLoop or GUIAgentLoop.
class AgentLoopBase(ABC):
@abstractmethod
async def run(self, sampling_params: dict[str, Any], **kwargs) -> AgentLoopOutput:
"""Run agent loop to interact with LLM server and environment.
Args:
sampling_params (Dict[str, Any]): LLM sampling params.
**kwargs: dataset fields from `verl.utils.dataset.RLHFDataset`.
Returns:
AgentLoopOutput: Agent loop output.
"""
raise NotImplementedError
As mentioned, verl uses a global single-controller RLTrainer to orchestrate the computation graph. A major limitation in the current implementation is that the RLTrainer handles both control and data flow, creating a bottleneck when dispatching data between components. This issue is amplified by the massive data volumes in multimodal training (images, video, audio) and complex algorithms like router replay, which requires transmitting large tensors per sample. Our earlier attempt to solve this using the Ray object store yielded poor performance due to the lack of tensor optimization and fine-grained column access.
<div align="center"> </div>In v0.7, we experimentally introduced TransferQueue to decouple control flow from data flow. The RLTrainer now only dispatch instructions and metadata, while TransferQueue handles data transmission via reference passing. TransferQueue is specifically optimized for PyTorch tensors (supporting zero-copy and RDMA) and allows for backend extensions like ZeroMQ, NIXL, and Ray RDT. We plan to make this the default transmission method in v0.8.
# In PPOTrainer
def fit(self):
batch = next(dataloader)
gen_batch: BatchMeta = self.rollout_manager.generate_sequences(batch)
output: BatchMeta = self.actor_rollout_wg.compute_log_prob(gen_batch)
gen_batch = gen_batch.union(output)
output = self.actor_rollout_wg.update_actor(gen_batch)
# In Worker
def compute_log_prob(self, batch: BatchMeta) -> BatchMeta:
data = tq.get(batch)
output = self.actor.infer_batch(data=data)
return tq.put(output)
With the increase in LLM context lengths and the evolution of agentic tasks, the "long-tail" problem in rollout has become prominent, limiting the overall efficiency of RL training.
To mitigate this, a viable strategy is moving from on-policy synchronous training to off-policy asynchronous training, e.g Laminar, Areal, StreamRL, LlamaRL, PipelineRL. This involves separating the rollout and model engines onto different nodes (a disaggregated architecture, as opposed to colocated), with data transmitted via queues. This separation alleviates the rollout long-tail issue and enables rollout elastic scaling, fault tolerance, and heterogeneous hardware. However, it introduces a new challenge: efficient cross-node parameter synchronization.
<div align="center"> </div>To address this, we introduce the Checkpoint Engine: a unified abstraction layer designed to synchronize weights between various training and inference backends.
Currently, we provide two transport backends: NCCL (for broadcast collective communication) and NIXL (for P2P point-to-point communication).
class CheckpointEngine(ABC):
@abstractmethod
async def send_weights(self, weights: Generator[tuple[str, torch.Tensor], None, None]):
"""Send the weights of the model.
Args:
weights: A generator that yields the name of the weight tensor and the tensor itself.
"""
raise NotImplementedError
@abstractmethod
async def receive_weights(self) -> Generator[tuple[str, torch.Tensor], None, None]:
"""Receive the weights of the model.
Yields:
A tuple of the name of the weight tensor and the tensor itself.
"""
raise NotImplementedError
Building upon the four core components provided by verl-core, verl-trainer constructs several RL training pipelines tailored to specific scenarios. These pipelines are designed to address training efficiency challenges across varying scales and requirements:
On-policy (Synchronous)
One-step-off-policy (Async)
Fully async (Decoupled & Streaming)
Model Engine
Rollout Engine
Reward
Algorithm
Recipe
Model Engine
Rollout Engine
TransferQueue
Checkpoint Engine
Trainer
Model Engine
Rollout Engine
More agentic training recipe