NCCL Engine

The NCCL weight transfer engine uses NCCL broadcast operations to transfer weights from the trainer to inference workers. It supports multi-node and multi-GPU setups where the trainer and inference engine run on separate GPUs.

When to Use NCCL

Training and inference on separate GPUs (possibly across nodes)
Tensor-parallel inference with multiple workers that all need the updated weights
You need high-bandwidth, low-latency weight transfer over NVLink or InfiniBand

How It Works

The trainer and all inference workers join a shared NCCL process group using StatelessProcessGroup (vLLM's torch.distributed-independent group abstraction).
The trainer broadcasts weights to all workers simultaneously. Each worker receives and loads weights incrementally.
Optionally, packed tensor broadcasting batches multiple small tensors into larger buffers with double/triple buffering and CUDA stream overlap for higher throughput. This implementation is based on NeMo-RL's packed tensor.

Initialization

NCCL requires explicit process group setup. The trainer and inference workers must agree on a master address, port, and world size.

Inference Side

python

from vllm.distributed.weight_transfer.base import WeightTransferInitRequest

# rank_offset accounts for the trainer occupying rank 0
llm.init_weight_transfer_engine(
    WeightTransferInitRequest(
        init_info=dict(
            master_address=master_address,
            master_port=master_port,
            rank_offset=1,
            world_size=world_size,  # trainer + all inference workers
        )
    )
)

Trainer Side

python

from vllm.distributed.weight_transfer.nccl_engine import (
    NCCLWeightTransferEngine,
)

group = NCCLWeightTransferEngine.trainer_init(
    dict(
        master_address=master_address,
        master_port=master_port,
        world_size=world_size,
    )
)

!!! note trainer_init always assigns the trainer to rank 0. Inference workers start at rank_offset (typically 1).

Sending Weights

python

from vllm.distributed.weight_transfer.nccl_engine import (
    NCCLTrainerSendWeightsArgs,
    NCCLWeightTransferEngine,
)

trainer_args = NCCLTrainerSendWeightsArgs(
    group=group,
    packed=True,  # use packed broadcasting for efficiency
)

NCCLWeightTransferEngine.trainer_send_weights(
    iterator=model.named_parameters(),
    trainer_args=trainer_args,
)

See NCCLTrainerSendWeightsArgs for the full list of configurable fields.

Packed Tensor Broadcasting

When packed=True, multiple weight tensors are packed into large contiguous buffers before broadcasting. This reduces the number of NCCL operations and uses double/triple buffering with dedicated CUDA streams for overlap between packing, broadcasting, and unpacking.

Both the trainer (NCCLTrainerSendWeightsArgs) and inference side (NCCLWeightTransferUpdateInfo) must use matching packed_buffer_size_bytes and packed_num_buffers values.

Receiving Weights (Inference Side)

The inference side triggers weight reception by calling update_weights:

python

from vllm.distributed.weight_transfer.base import WeightTransferUpdateRequest

llm.update_weights(
    WeightTransferUpdateRequest(
        update_info=dict(
            names=names,
            dtype_names=dtype_names,
            shapes=shapes,
            packed=True,
        )
    )
)

The names, dtype_names, and shapes lists describe each parameter. These must match the order in which the trainer iterates over its parameters.

Examples

RLHF with NCCL weight syncing (offline, Ray) - Trainer on one GPU, 2x tensor-parallel vLLM engine on two others, with packed NCCL weight broadcast
RLHF with async weight syncing (offline, Ray) - Async generation with mid-flight pause, weight sync, resume, and validation against a fresh model
RLHF with NCCL weight syncing (online serving, HTTP) - Weight transfer with a running vLLM HTTP server using HTTP control plane and NCCL data plane