verl/checkpoint_engine/README.md
Checkpoint Engine is an unified abstract layer to synchronize weights between various training backends and inference backends. It provides three unified APIs:
| Comm Library | Topology | Hardware | Performance | Elastic | Use case | |
|---|---|---|---|---|---|---|
| naive | torch.distributed | all_gather | NVIDIA/AMD/Ascend | Very High | NA | On-policy training |
In the kimi_ckpt_engine workflow, the trainer first offloads the weights to the CPU, and the rollout creates a sub communication group that includes all the cards for the rollout. Then, using Mooncake transfer engine, these weights are transmitted via P2P to a specific worker in the rollout, followed by a broadcast to all other rollout workers.
This mode requires the P2P feature of checkpoint_engine. Please ensure you have installed it via pip install 'checkpoint-engine[p2p]' and that your version is 0.4.0 or higher.
In addition, during the installation of checkpoint-engine[p2p], the transfer engine will be installed. However, This library has no prebuilt packages for Ascend devices and must be compiled from source. For detailed compilation instructions, see: transfer-engine: ascend direct
Note: Important Configuration for Ascend Devices If you are using CANN version >= 8.5.0 on Ascend devices, you must set the following environment variable to enable intra-node ROCE:
export HCCL_INTRA_ROCE_ENABLE=1
pytest tests/checkpoint_engine/test_correctness_on_gpu.py
pytest tests/checkpoint_engine/test_correctness_on_npu.py
pytest tests/checkpoint_engine/test_special_server_adapter.py
| hardware | backend | time cost (s) | Bandwidth(GB/s) |
|---|---|---|---|
| 4*8 H100, ConnectX-7 400 Gbps (InfiniBand) | NCCL | ~7 | 8.25 |
| 4*8 H100, ConnectX-7 400 Gbps (InfiniBand) | NIXL | ~7 | 8.25 |
| 2*16 Ascend 910C, inner suppernode | HCCL | ~11 | 5.3 |
| 2*16 Ascend 910C, inner suppernode | kimi_ckpt_engine | offload: 7 update: 3.5 | 16.5 |
| 2*8 H100, ConnectX-7 400 Gbps (InfiniBand) | mooncake | 5.93 | 9.44 |