README - Verl — ContextQMD

Checkpoint Engine

Overview

Checkpoint Engine is an unified abstract layer to synchronize weights between various training backends and inference backends. It provides three unified APIs:

send_weights: get named tensors from generator and send them in streaming manner.
receive_weights: return a tensor generator that yield named tensors in streaming manner.
get_weights: return a tensor generator that yield named tensors in streaming manner, used for each inference instance update weight independently from local cache (e.g share memory, disk).

Supported Backends

	Comm Library	Topology	Hardware	Performance	Elastic	Use case
naive	torch.distributed	all_gather	NVIDIA/AMD/Ascend	Very High	NA	On-policy training

Trainer/rollout colocated |nccl|NCCL|all_gather+broadcast|NVIDIA GPU & NCCL|Very High|Low: rebuild nccl group|Off-policy training
Trainer/rollout disaggregated
Fixed clusters |hccl|HCCL|all_gather+broadcast|Ascend NPU & HCCL| High|Low: rebuild hccl group|Off-policy training
Trainer/rollout disaggregated
Fixed clusters |nixl|NIXL|all_gather+ring p2p|Various transport backends (D2D, H2H, H2D, etc)
UCX
UCCL
Mooncacke|Medium/High|High: dynamic adjust ring topology|Off-policy training
Trainer/rollout disaggregated
Elastic rollout
Rollout fault tolerance
Heterogeneous hardware rollout |kimi_ckpt_engine|MOONCAKE+NCCL/HCCL|p2p+broadcast|NVIDIA/Ascend|High|Low: rebuild communication group|Off-policy training
Trainer/rollout disaggregated
Save checkpoint each time |mooncake|Mooncake Transfer Engine|all_gather+ring p2p|NVIDIA/Ascend|High|High: dynamic adjust ring topology|Off-policy training
Trainer/rollout disaggregated
Fixed clusters

kimi_ckpt_engine detail:

In the kimi_ckpt_engine workflow, the trainer first offloads the weights to the CPU, and the rollout creates a sub communication group that includes all the cards for the rollout. Then, using Mooncake transfer engine, these weights are transmitted via P2P to a specific worker in the rollout, followed by a broadcast to all other rollout workers.

This mode requires the P2P feature of checkpoint_engine. Please ensure you have installed it via pip install 'checkpoint-engine[p2p]' and that your version is 0.4.0 or higher.

In addition, during the installation of checkpoint-engine[p2p], the transfer engine will be installed. However, This library has no prebuilt packages for Ascend devices and must be compiled from source. For detailed compilation instructions, see: transfer-engine: ascend direct

Note: Important Configuration for Ascend Devices If you are using CANN version >= 8.5.0 on Ascend devices, you must set the following environment variable to enable intra-node ROCE:

bash

export HCCL_INTRA_ROCE_ENABLE=1

Benchmark

benchmark setup

model: Qwen/Qwen3-30B-A3B-Base
trainer: fsdp world_size=2 (since Ascend 910C has 64GB of HBM, we set world_size=4)
rollout: num_rollout=30 (only receive weight without cuda ipc to vllm/sglang)

bash

pytest tests/checkpoint_engine/test_correctness_on_gpu.py
pytest tests/checkpoint_engine/test_correctness_on_npu.py
pytest tests/checkpoint_engine/test_special_server_adapter.py

benchmark result

hardware	backend	time cost (s)	Bandwidth(GB/s)
4*8 H100, ConnectX-7 400 Gbps (InfiniBand)	NCCL	~7	8.25
4*8 H100, ConnectX-7 400 Gbps (InfiniBand)	NIXL	~7	8.25
2*16 Ascend 910C, inner suppernode	HCCL	~11	5.3
2*16 Ascend 910C, inner suppernode	kimi_ckpt_engine	offload: 7 update: 3.5	16.5
2*8 H100, ConnectX-7 400 Gbps (InfiniBand)	mooncake	5.93	9.44