docs/source/en/ddp.md
DistributedDataParallel (DDP) maintains a full copy of a model on each GPU. Each GPU processes a non-overlapping shard of data with a forward and backward pass. Before the optimizer step, an all-reduce averages gradients across all GPUs so every model copy stays identical. Use DDP when your model fits on a single GPU.
┌─────────────────┐
│ training data │
└────────┬────────┘
┌──────────────────┼──────────────────┐
│ shard 0 │ shard 1 │ shard 2
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ model │ │ model │ │ model │
│ (copy 0) │ │ (copy 1) │ │ (copy 2) │
│ GPU 0 │ │ GPU 1 │ │ GPU 2 │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ grads │ grads │ grads
└──────────────────┼──────────────────┘
all-reduce
(average gradients)
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ optimizer │ │ optimizer │ │ optimizer │
│ step │ │ step │ │ step │
└─────────────┘ └─────────────┘ └─────────────┘
(identical) (identical) (identical)
DDP activates automatically when you launch with a multi-process launcher like Accelerate.
# 4 GPUs on one machine
accelerate launch --num_processes 4 train.py
Pass these [TrainingArguments] to control DDP behavior.
~TrainingArguments.gradient_accumulation_steps] determines when to perform the all-reduce. [Trainer] skips the all-reduce on intermediate accumulation steps and runs it only on the final micro-batch. For example, with gradient_accumulation_steps=4, the all-reduce runs every 4 backward passes.~TrainingArguments.ddp_find_unused_parameters] traverses the autograd graph at the end of the forward pass for parameters that won't receive a gradient and marks them as ready so they don't block the all-reduce. Don't use with [~TrainingArguments.gradient_checkpointing] because gradient checkpointing discards intermediate activations and recomputes them on the fly.~TrainingArguments.ddp_bucket_cap_mb] is the bucket size for batching gradients into a single all-reduce during the backward pass. A larger bucket means fewer all-reduce calls and less launch overhead.~TrainingArguments.ddp_broadcast_buffers] synchronizes model buffers (such as BatchNorm running statistics) from rank 0 to all other ranks at the start of every forward pass. Disable if your model only uses LayerNorm. Don't use with [~TrainingArguments.gradient_checkpointing].~TrainingArguments.ddp_backend] sets the communication backend. Use "nccl" for NVIDIA GPUs (default and fastest), "gloo" for CPU training or debugging, and "xccl", "hccl", or "cncl" for other hardware.~TrainingArguments.ddp_timeout] sets the time limit for all processes and operations (all-reduce, broadcast) to complete. If a process hangs, like when loading a large model slowly, the timeout raises an error instead of blocking indefinitely.from transformers import TrainingArguments
args = TrainingArguments(
...,
gradient_accumulation_steps=4,
ddp_backend="nccl",
ddp_find_unused_parameters=False,
ddp_bucket_cap_mb=25,
ddp_broadcast_buffers=True,
ddp_timeout=1800,
)