docs/training/weight_transfer/nccl.md
The NCCL weight transfer engine uses NCCL broadcast operations to transfer weights from the trainer to inference workers. It supports multi-node and multi-GPU setups where the trainer and inference engine run on separate GPUs.
StatelessProcessGroup (vLLM's torch.distributed-independent group abstraction).NCCL requires explicit process group setup. The trainer and inference workers must agree on a master address, port, and world size.
from vllm.distributed.weight_transfer.base import WeightTransferInitRequest
# rank_offset accounts for the trainer occupying rank 0
llm.init_weight_transfer_engine(
WeightTransferInitRequest(
init_info=dict(
master_address=master_address,
master_port=master_port,
rank_offset=1,
world_size=world_size, # trainer + all inference workers
)
)
)
from vllm.distributed.weight_transfer.nccl_engine import (
NCCLWeightTransferEngine,
)
group = NCCLWeightTransferEngine.trainer_init(
dict(
master_address=master_address,
master_port=master_port,
world_size=world_size,
)
)
!!! note
trainer_init always assigns the trainer to rank 0. Inference workers start at rank_offset (typically 1).
from vllm.distributed.weight_transfer.nccl_engine import (
NCCLTrainerSendWeightsArgs,
NCCLWeightTransferEngine,
)
trainer_args = NCCLTrainerSendWeightsArgs(
group=group,
packed=True, # use packed broadcasting for efficiency
)
NCCLWeightTransferEngine.trainer_send_weights(
iterator=model.named_parameters(),
trainer_args=trainer_args,
)
See NCCLTrainerSendWeightsArgs for the full list of configurable fields.
When packed=True, multiple weight tensors are packed into large contiguous buffers before broadcasting. This reduces the number of NCCL operations and uses double/triple buffering with dedicated CUDA streams for overlap between packing, broadcasting, and unpacking.
Both the trainer (NCCLTrainerSendWeightsArgs) and inference side (NCCLWeightTransferUpdateInfo) must use matching packed_buffer_size_bytes and packed_num_buffers values.
The inference side triggers weight reception by calling update_weights:
from vllm.distributed.weight_transfer.base import WeightTransferUpdateRequest
llm.update_weights(
WeightTransferUpdateRequest(
update_info=dict(
names=names,
dtype_names=dtype_names,
shapes=shapes,
packed=True,
)
)
)
The names, dtype_names, and shapes lists describe each parameter. These must match the order in which the trainer iterates over its parameters.