DDP

DistributedDataParallel (DDP) maintains a full copy of a model on each GPU. Each GPU processes a non-overlapping shard of data with a forward and backward pass. Before the optimizer step, an all-reduce averages gradients across all GPUs so every model copy stays identical. Use DDP when your model fits on a single GPU.

text

                         ┌─────────────────┐
                         │  training data  │
                         └────────┬────────┘
               ┌──────────────────┼──────────────────┐
               │ shard 0          │ shard 1          │ shard 2
               ▼                  ▼                  ▼
        ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
        │   model     │    │   model     │    │   model     │
        │  (copy 0)   │    │  (copy 1)   │    │  (copy 2)   │
        │   GPU 0     │    │   GPU 1     │    │   GPU 2     │
        └──────┬──────┘    └──────┬──────┘    └──────┬──────┘
               │ grads            │ grads            │ grads
               └──────────────────┼──────────────────┘
                               all-reduce
                          (average gradients)
               ┌──────────────────┼──────────────────┐
               ▼                  ▼                  ▼
        ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
        │  optimizer  │    │  optimizer  │    │  optimizer  │
        │    step     │    │    step     │    │    step     │
        └─────────────┘    └─────────────┘    └─────────────┘
          (identical)        (identical)        (identical)

DDP activates automatically when you launch with a multi-process launcher like Accelerate.

cli

# 4 GPUs on one machine
accelerate launch --num_processes 4 train.py

Configure DDP

Pass these [TrainingArguments] to control DDP behavior.

[~TrainingArguments.gradient_accumulation_steps] determines when to perform the all-reduce. [Trainer] skips the all-reduce on intermediate accumulation steps and runs it only on the final micro-batch. For example, with gradient_accumulation_steps=4, the all-reduce runs every 4 backward passes.
[~TrainingArguments.ddp_find_unused_parameters] traverses the autograd graph at the end of the forward pass for parameters that won't receive a gradient and marks them as ready so they don't block the all-reduce. Don't use with [~TrainingArguments.gradient_checkpointing] because gradient checkpointing discards intermediate activations and recomputes them on the fly.
[~TrainingArguments.ddp_bucket_cap_mb] is the bucket size for batching gradients into a single all-reduce during the backward pass. A larger bucket means fewer all-reduce calls and less launch overhead.
[~TrainingArguments.ddp_broadcast_buffers] synchronizes model buffers (such as BatchNorm running statistics) from rank 0 to all other ranks at the start of every forward pass. Disable if your model only uses LayerNorm. Don't use with [~TrainingArguments.gradient_checkpointing].
[~TrainingArguments.ddp_backend] sets the communication backend. Use "nccl" for NVIDIA GPUs (default and fastest), "gloo" for CPU training or debugging, and "xccl", "hccl", or "cncl" for other hardware.
[~TrainingArguments.ddp_timeout] sets the time limit for all processes and operations (all-reduce, broadcast) to complete. If a process hangs, like when loading a large model slowly, the timeout raises an error instead of blocking indefinitely.

from transformers import TrainingArguments

args = TrainingArguments(
    ...,
    gradient_accumulation_steps=4,
    ddp_backend="nccl",
    ddp_find_unused_parameters=False,
    ddp_bucket_cap_mb=25,
    ddp_broadcast_buffers=True,
    ddp_timeout=1800,
)

Next steps

See FSDP for training models too large to fit on a single GPU.
See DeepSpeed for ZeRO optimization and offloading.
Read the Data Parallelism chapter from The Ultra-Scale Playbook for more information about how DDP works.