Distributed Training - Comprehensive Guide

Overview

PyTorch Lightning provides several strategies for training large models efficiently across multiple GPUs, nodes, and machines. Choose the right strategy based on model size and hardware configuration.

Strategy Selection Guide

When to Use Each Strategy

Regular Training (Single Device)

Model size: Any size that fits in single GPU memory
Use case: Prototyping, small models, debugging

DDP (Distributed Data Parallel)

Model size: <500M parameters (e.g., ResNet50 ~80M parameters)
When: Weights, activations, optimizer states, and gradients all fit in GPU memory
Goal: Scale batch size and speed across multiple GPUs
Best for: Most standard deep learning models

FSDP (Fully Sharded Data Parallel)

Model size: 500M+ parameters (e.g., large transformers like BERT-Large, GPT)
When: Model doesn't fit in single GPU memory
Recommended for: Users new to model parallelism or migrating from DDP
Features: Activation checkpointing, CPU parameter offloading

DeepSpeed

Model size: 500M+ parameters
When: Need cutting-edge features or already familiar with DeepSpeed
Features: CPU/disk parameter offloading, distributed checkpoints, fine-grained control
Trade-off: More complex configuration

DDP (Distributed Data Parallel)

Basic Usage

python

# Single GPU
trainer = L.Trainer(accelerator="gpu", devices=1)

# Multi-GPU on single node (automatic DDP)
trainer = L.Trainer(accelerator="gpu", devices=4)

# Explicit DDP strategy
trainer = L.Trainer(strategy="ddp", accelerator="gpu", devices=4)

Multi-Node DDP

python

# On each node, run:
trainer = L.Trainer(
    strategy="ddp",
    accelerator="gpu",
    devices=4,  # GPUs per node
    num_nodes=4  # Total nodes
)

DDP Configuration

python

from lightning.pytorch.strategies import DDPStrategy

trainer = L.Trainer(
    strategy=DDPStrategy(
        process_group_backend="nccl",  # "nccl" for GPU, "gloo" for CPU
        find_unused_parameters=False,   # Set True if model has unused parameters
        gradient_as_bucket_view=True    # More memory efficient
    ),
    accelerator="gpu",
    devices=4
)

DDP Spawn

Use when ddp causes issues (slower but more compatible):

python

trainer = L.Trainer(strategy="ddp_spawn", accelerator="gpu", devices=4)

Best Practices for DDP

Batch size: Multiply by number of GPUs

python

# If using 4 GPUs, effective batch size = batch_size * 4
dm = MyDataModule(batch_size=32)  # 32 * 4 = 128 effective batch size

Learning rate: Often scaled with batch size

python

# Linear scaling rule
base_lr = 0.001
num_gpus = 4
lr = base_lr * num_gpus

Synchronization: Use sync_dist=True for metrics
python
```
self.log("val_loss", loss, sync_dist=True)
```

Rank-specific operations: Use decorators for main process only

python

from lightning.pytorch.utilities import rank_zero_only

@rank_zero_only
def save_results(self):
    # Only runs on main process (rank 0)
    torch.save(self.results, "results.pt")

FSDP (Fully Sharded Data Parallel)

Basic Usage

python

trainer = L.Trainer(
    strategy="fsdp",
    accelerator="gpu",
    devices=4
)

FSDP Configuration

python

from lightning.pytorch.strategies import FSDPStrategy
import torch.nn as nn

trainer = L.Trainer(
    strategy=FSDPStrategy(
        # Sharding strategy
        sharding_strategy="FULL_SHARD",  # or "SHARD_GRAD_OP", "NO_SHARD", "HYBRID_SHARD"

        # Activation checkpointing (save memory)
        activation_checkpointing_policy={nn.TransformerEncoderLayer},

        # CPU offloading (save GPU memory, slower)
        cpu_offload=False,

        # Mixed precision
        mixed_precision=True,

        # Wrap policy (auto-wrap layers)
        auto_wrap_policy=None
    ),
    accelerator="gpu",
    devices=8,
    precision="bf16-mixed"
)

Sharding Strategies

FULL_SHARD (default)

Shards optimizer states, gradients, and parameters
Maximum memory savings
More communication overhead

SHARD_GRAD_OP

Shards optimizer states and gradients only
Parameters kept on all devices
Less memory savings but faster

NO_SHARD

No sharding (equivalent to DDP)
For comparison or when sharding not needed

HYBRID_SHARD

Combines FULL_SHARD within nodes and NO_SHARD across nodes
Good for multi-node setups

Activation Checkpointing

Trade computation for memory:

python

from lightning.pytorch.strategies import FSDPStrategy
import torch.nn as nn

# Checkpoint specific layer types
trainer = L.Trainer(
    strategy=FSDPStrategy(
        activation_checkpointing_policy={
            nn.TransformerEncoderLayer,
            nn.TransformerDecoderLayer
        }
    )
)

CPU Offloading

Offload parameters to CPU when not in use:

python

trainer = L.Trainer(
    strategy=FSDPStrategy(
        cpu_offload=True  # Slower but saves GPU memory
    ),
    accelerator="gpu",
    devices=4
)

FSDP with Large Models

python

from lightning.pytorch.strategies import FSDPStrategy
import torch.nn as nn

class LargeTransformer(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=4096, nhead=32),
            num_layers=48
        )

    def configure_sharded_model(self):
        # Called by FSDP to wrap model
        pass

# Train
trainer = L.Trainer(
    strategy=FSDPStrategy(
        activation_checkpointing_policy={nn.TransformerEncoderLayer},
        cpu_offload=False,
        sharding_strategy="FULL_SHARD"
    ),
    accelerator="gpu",
    devices=8,
    precision="bf16-mixed",
    max_epochs=10
)

model = LargeTransformer()
trainer.fit(model, datamodule=dm)

DeepSpeed

Installation

bash

pip install deepspeed

Basic Usage

python

trainer = L.Trainer(
    strategy="deepspeed_stage_2",  # or "deepspeed_stage_3"
    accelerator="gpu",
    devices=4,
    precision="16-mixed"
)

DeepSpeed Stages

Stage 1: Optimizer State Sharding

Shards optimizer states
Moderate memory savings

python

trainer = L.Trainer(strategy="deepspeed_stage_1")

Stage 2: Optimizer + Gradient Sharding

Shards optimizer states and gradients
Good memory savings

python

trainer = L.Trainer(strategy="deepspeed_stage_2")

Stage 3: Full Model Sharding (ZeRO-3)

Shards optimizer states, gradients, and model parameters
Maximum memory savings
Can train very large models

python

trainer = L.Trainer(strategy="deepspeed_stage_3")

Stage 2 with Offloading

Offload to CPU or NVMe

python

trainer = L.Trainer(strategy="deepspeed_stage_2_offload")
trainer = L.Trainer(strategy="deepspeed_stage_3_offload")

DeepSpeed Configuration File

For fine-grained control:

python

from lightning.pytorch.strategies import DeepSpeedStrategy

# Create config file: ds_config.json
config = {
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": True
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9
    },
    "fp16": {
        "enabled": True,
        "loss_scale": 0,
        "initial_scale_power": 16,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "gradient_clipping": 1.0,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"
}

trainer = L.Trainer(
    strategy=DeepSpeedStrategy(config=config),
    accelerator="gpu",
    devices=8,
    precision="16-mixed"
)

DeepSpeed Best Practices

Use Stage 2 for models <10B parameters
Use Stage 3 for models >10B parameters
Enable offloading if GPU memory is insufficient
Tune reduce_bucket_size for communication efficiency

Comparison Table

Feature	DDP	FSDP	DeepSpeed
Model Size	<500M params	500M+ params	500M+ params
Memory Efficiency	Low	High	Very High
Speed	Fastest	Fast	Fast
Setup Complexity	Simple	Medium	Complex
Offloading	No	CPU	CPU + Disk
Best For	Standard models	Large models	Very large models
Configuration	Minimal	Moderate	Extensive

Mixed Precision Training

Use mixed precision to speed up training and save memory:

python

# FP16 mixed precision
trainer = L.Trainer(precision="16-mixed")

# BFloat16 mixed precision (A100, H100)
trainer = L.Trainer(precision="bf16-mixed")

# Full precision (default)
trainer = L.Trainer(precision="32-true")

# Double precision
trainer = L.Trainer(precision="64-true")

Mixed Precision with Different Strategies

python

# DDP + FP16
trainer = L.Trainer(
    strategy="ddp",
    accelerator="gpu",
    devices=4,
    precision="16-mixed"
)

# FSDP + BFloat16
trainer = L.Trainer(
    strategy="fsdp",
    accelerator="gpu",
    devices=8,
    precision="bf16-mixed"
)

# DeepSpeed + FP16
trainer = L.Trainer(
    strategy="deepspeed_stage_2",
    accelerator="gpu",
    devices=4,
    precision="16-mixed"
)

Multi-Node Training

SLURM

bash

#!/bin/bash
#SBATCH --nodes=4
#SBATCH --gpus-per-node=4
#SBATCH --time=24:00:00

srun python train.py

python

# train.py
trainer = L.Trainer(
    strategy="ddp",
    accelerator="gpu",
    devices=4,
    num_nodes=4
)

Manual Multi-Node Setup

Node 0 (master):

bash

python train.py --num_nodes=2 --node_rank=0 --master_addr=192.168.1.1 --master_port=12345

Node 1:

bash

python train.py --num_nodes=2 --node_rank=1 --master_addr=192.168.1.1 --master_port=12345

python

# train.py
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--num_nodes", type=int, default=1)
parser.add_argument("--node_rank", type=int, default=0)
parser.add_argument("--master_addr", type=str, default="localhost")
parser.add_argument("--master_port", type=int, default=12345)
args = parser.parse_args()

trainer = L.Trainer(
    strategy="ddp",
    accelerator="gpu",
    devices=4,
    num_nodes=args.num_nodes
)

Common Patterns

Gradient Accumulation with DDP

python

# Simulate larger batch size
trainer = L.Trainer(
    strategy="ddp",
    accelerator="gpu",
    devices=4,
    accumulate_grad_batches=4  # Effective batch size = batch_size * devices * 4
)

Model Checkpoint with Distributed Training

python

from lightning.pytorch.callbacks import ModelCheckpoint

checkpoint_callback = ModelCheckpoint(
    monitor="val_loss",
    save_top_k=3,
    mode="min"
)

trainer = L.Trainer(
    strategy="ddp",
    accelerator="gpu",
    devices=4,
    callbacks=[checkpoint_callback]
)

Reproducibility in Distributed Training

python

import lightning as L

L.seed_everything(42, workers=True)

trainer = L.Trainer(
    strategy="ddp",
    accelerator="gpu",
    devices=4,
    deterministic=True
)

Troubleshooting

NCCL Timeout

Increase timeout for slow networks:

python

import os
os.environ["NCCL_TIMEOUT"] = "3600"  # 1 hour

trainer = L.Trainer(strategy="ddp", accelerator="gpu", devices=4)

CUDA Out of Memory

Solutions:

Enable gradient checkpointing
Reduce batch size
Use FSDP or DeepSpeed
Enable CPU offloading
Use mixed precision

python

# Option 1: Gradient checkpointing
class MyModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = MyTransformer()
        self.model.gradient_checkpointing_enable()

# Option 2: Smaller batch size
dm = MyDataModule(batch_size=16)  # Reduce from 32

# Option 3: FSDP with offloading
trainer = L.Trainer(
    strategy=FSDPStrategy(cpu_offload=True),
    precision="bf16-mixed"
)

# Option 4: Gradient accumulation
trainer = L.Trainer(accumulate_grad_batches=4)

Distributed Sampler Issues

Lightning handles DistributedSampler automatically:

python

# Don't do this
from torch.utils.data import DistributedSampler
sampler = DistributedSampler(dataset)  # Lightning does this automatically

# Just use shuffle
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

Communication Overhead

Reduce communication with larger find_unused_parameters:

python

trainer = L.Trainer(
    strategy=DDPStrategy(find_unused_parameters=False),
    accelerator="gpu",
    devices=4
)

Best Practices

1. Start with Single GPU

Test your code on single GPU before scaling:

python

# Debug on single GPU
trainer = L.Trainer(accelerator="gpu", devices=1, fast_dev_run=True)

# Then scale to multiple GPUs
trainer = L.Trainer(accelerator="gpu", devices=4, strategy="ddp")

2. Use Appropriate Strategy

<500M params: Use DDP
500M-10B params: Use FSDP
10B params: Use DeepSpeed Stage 3

3. Enable Mixed Precision

Always use mixed precision for modern GPUs:

python

trainer = L.Trainer(precision="bf16-mixed")  # A100, H100
trainer = L.Trainer(precision="16-mixed")    # V100, T4

4. Scale Hyperparameters

Adjust learning rate and batch size when scaling:

python

# Linear scaling rule
lr = base_lr * num_gpus

5. Sync Metrics

Always sync metrics in distributed training:

python

self.log("val_loss", loss, sync_dist=True)

6. Use Rank-Zero Operations

File I/O and expensive operations on main process only:

python

from lightning.pytorch.utilities import rank_zero_only

@rank_zero_only
def save_predictions(self):
    torch.save(self.predictions, "predictions.pt")

7. Checkpoint Regularly

Save checkpoints to resume from failures:

python

checkpoint_callback = ModelCheckpoint(
    save_top_k=3,
    save_last=True,  # Always save last for resuming
    every_n_epochs=5
)