docs/training/weight_transfer/ipc.md
The IPC weight transfer engine uses CUDA IPC (Inter-Process Communication) handles to share GPU memory directly between the trainer and inference workers on the same node and same GPU. This avoids any data copying, making it a efficient option when colocating training and inference.
torch.multiprocessing.reductions.reduce_tensor.!!! warning
IPC handles involve sending serialized Python objects. When using HTTP transport, you must set VLLM_ALLOW_INSECURE_SERIALIZATION=1 on both the server and client. This is because IPC handles are pickled and base64-encoded for HTTP transmission.
The IPC backend requires no initialization on either side. The init_transfer_engine call is a no-op for IPC.
IPC supports two transport modes for delivering the handles:
Used when vLLM is running as a Ray actor:
from vllm.distributed.weight_transfer.ipc_engine import (
IPCTrainerSendWeightsArgs,
IPCWeightTransferEngine,
)
trainer_args = IPCTrainerSendWeightsArgs(
mode="ray",
llm_handle=llm_actor_handle,
)
IPCWeightTransferEngine.trainer_send_weights(
iterator=model.named_parameters(),
trainer_args=trainer_args,
)
In Ray mode, the engine calls llm_handle.update_weights.remote(...) directly, passing the IPC handles via Ray's serialization.
Used when vLLM is running as an HTTP server:
trainer_args = IPCTrainerSendWeightsArgs(
mode="http",
url="http://localhost:8000",
)
IPCWeightTransferEngine.trainer_send_weights(
iterator=model.named_parameters(),
trainer_args=trainer_args,
)
In HTTP mode, IPC handles are pickled, base64-encoded, and sent as JSON to the /update_weights endpoint.
See IPCTrainerSendWeightsArgs for the full list of configurable fields.