Back to Verl

README

verl/checkpoint_engine/README.md

0.8.04.0 KB
Original Source

Checkpoint Engine

Overview

Checkpoint Engine is an unified abstract layer to synchronize weights between various training backends and inference backends. It provides three unified APIs:

  • send_weights: get named tensors from generator and send them in streaming manner.
  • receive_weights: return a tensor generator that yield named tensors in streaming manner.
  • get_weights: return a tensor generator that yield named tensors in streaming manner, used for each inference instance update weight independently from local cache (e.g share memory, disk).

Supported Backends

Comm LibraryTopologyHardwarePerformanceElasticUse case
naivetorch.distributedall_gatherNVIDIA/AMD/AscendVery HighNAOn-policy training
  • Trainer/rollout colocated |nccl|NCCL|all_gather+broadcast|NVIDIA GPU & NCCL|Very High|Low: rebuild nccl group|Off-policy training
  • Trainer/rollout disaggregated
  • Fixed clusters |hccl|HCCL|all_gather+broadcast|Ascend NPU & HCCL| High|Low: rebuild hccl group|Off-policy training
  • Trainer/rollout disaggregated
  • Fixed clusters |nixl|NIXL|all_gather+ring p2p|Various transport backends (D2D, H2H, H2D, etc)
  • UCX
  • UCCL
  • Mooncacke|Medium/High|High: dynamic adjust ring topology|Off-policy training
  • Trainer/rollout disaggregated
  • Elastic rollout
  • Rollout fault tolerance
  • Heterogeneous hardware rollout |kimi_ckpt_engine|MOONCAKE+NCCL/HCCL|p2p+broadcast|NVIDIA/Ascend|High|Low: rebuild communication group|Off-policy training
  • Trainer/rollout disaggregated
  • Save checkpoint each time |mooncake|Mooncake Transfer Engine|all_gather+ring p2p|NVIDIA/Ascend|High|High: dynamic adjust ring topology|Off-policy training
  • Trainer/rollout disaggregated
  • Fixed clusters
kimi_ckpt_engine detail:

In the kimi_ckpt_engine workflow, the trainer first offloads the weights to the CPU, and the rollout creates a sub communication group that includes all the cards for the rollout. Then, using Mooncake transfer engine, these weights are transmitted via P2P to a specific worker in the rollout, followed by a broadcast to all other rollout workers.

This mode requires the P2P feature of checkpoint_engine. Please ensure you have installed it via pip install 'checkpoint-engine[p2p]' and that your version is 0.4.0 or higher.

In addition, during the installation of checkpoint-engine[p2p], the transfer engine will be installed. However, This library has no prebuilt packages for Ascend devices and must be compiled from source. For detailed compilation instructions, see: transfer-engine: ascend direct

Note: Important Configuration for Ascend Devices If you are using CANN version >= 8.5.0 on Ascend devices, you must set the following environment variable to enable intra-node ROCE:

bash
export HCCL_INTRA_ROCE_ENABLE=1

Benchmark

  1. benchmark setup
  • model: Qwen/Qwen3-30B-A3B-Base
  • trainer: fsdp world_size=2 (since Ascend 910C has 64GB of HBM, we set world_size=4)
  • rollout: num_rollout=30 (only receive weight without cuda ipc to vllm/sglang)
bash
pytest tests/checkpoint_engine/test_correctness_on_gpu.py
pytest tests/checkpoint_engine/test_correctness_on_npu.py
pytest tests/checkpoint_engine/test_special_server_adapter.py
  1. benchmark result
hardwarebackendtime cost (s)Bandwidth(GB/s)
4*8 H100, ConnectX-7 400 Gbps (InfiniBand)NCCL~78.25
4*8 H100, ConnectX-7 400 Gbps (InfiniBand)NIXL~78.25
2*16 Ascend 910C, inner suppernodeHCCL~115.3
2*16 Ascend 910C, inner suppernodekimi_ckpt_engineoffload: 7 update: 3.516.5
2*8 H100, ConnectX-7 400 Gbps (InfiniBand)mooncake5.939.44