docs/user-guide/parallelism-guide.md
Megatron Core supports multiple parallelism strategies that can be combined to efficiently train models from billions to trillions of parameters across thousands of GPUs.
| Strategy | What it parallelizes | Best for |
|---|---|---|
| Data Parallelism (DP) | Batch dimension | Standard training, most common |
| Tensor Parallelism (TP) | Individual layers | Large layers, GPU memory constraints |
| Pipeline Parallelism (PP) | Model depth | Very deep models |
| Context Parallelism (CP) | Sequence length | Long sequences (8K+ tokens) |
| Expert Parallelism (EP) | MoE experts | Mixture-of-Experts models |
Replicate the model across GPUs and split the batch.
torchrun --nproc_per_node=8 pretrain_gpt.py \
--data-parallel-sharding-strategy no_shard
Each GPU has a full copy of the model and processes a portion of the batch.
Shard model parameters, gradients, and optimizer states to reduce memory:
# Megatron FSDP (~15% faster than PyTorch FSDP2)
--use-megatron-fsdp \
--data-parallel-sharding-strategy optim_grads_params
Sharding strategies:
optim - Shard optimizer states only (ZeRO-1)optim_grads - Shard gradients + optimizer (ZeRO-2)optim_grads_params - Shard parameters + gradients + optimizer (ZeRO-3)Split individual model layers across GPUs. Recommended for large hidden dimensions.
--tensor-model-parallel-size 4 # 4-way tensor parallelism
--sequence-parallel # Enable sequence parallelism (recommended)
When to use:
Split model layers across GPUs vertically (by depth).
--pipeline-model-parallel-size 8 # 8 pipeline stages
--num-layers-per-virtual-pipeline-stage 4 # Virtual pipeline for load balancing
When to use:
Split long sequences across GPUs for efficient long-context training.
--context-parallel-size 2 # 2-way context parallelism
--cp-comm-type p2p # Communication type
When to use:
→ Context Parallelism Deep Dive - Detailed guide with performance analysis
Distribute experts across GPUs in Mixture-of-Experts models.
--expert-model-parallel-size 8 # 8-way expert parallelism
--num-experts 64 # 64 experts per MoE layer
--moe-grouped-gemm # Optimize expert computation
Important: When combining EP with TP, you must enable Sequence Parallelism:
--tensor-model-parallel-size 4
--expert-model-parallel-size 8
--sequence-parallel # Required when using TP + EP
Recommended configurations based on NVIDIA NeMo production setups:
| Model | Size | GPUs | TP | PP | CP | EP | Configuration Notes |
|---|---|---|---|---|---|---|---|
| LLaMA-3 | 8B | 8 | 1 | 1 | 2 | 1 | CP=2 for long context (8K seqlen) |
| LLaMA-3 | 70B | 64 | 4 | 4 | 2 | 1 | Balanced TP+PP for 70B scale |
| LLaMA-3.1 | 405B | 1024 | 8 | 8 | 2 | 1 | 3D parallelism (TP+PP+CP) |
| GPT-3 | 175B | 128-512 | 4 | 8 | 1 | 1 | Standard large model config |
| Model | Size | GPUs | TP | PP | CP | EP | Configuration Notes |
|---|---|---|---|---|---|---|---|
| Mixtral | 8x7B | 64 | 1 | 4 | 1 | 8 | EP=8 for 8 experts |
| Mixtral | 8x22B | 256 | 4 | 4 | 1 | 8 | TP+PP+EP for large MoE |
| DeepSeek-V3 | 671B | 1024 | 2 | 16 | 1 | 64 | Massive MoE with 256 experts |
The total number of GPUs is calculated as:
Total GPUs = TP × PP × CP × EP × DP
# TP=4, PP=4, CP=2, DP=2 => 4 × 4 × 2 × 2 = 64 GPUs
torchrun --nproc_per_node=8 pretrain_gpt.py \
--tensor-model-parallel-size 4 \
--pipeline-model-parallel-size 4 \
--context-parallel-size 2 \
--num-layers 80 \
--hidden-size 8192 \
--num-attention-heads 64 \
--seq-length 8192 \
--micro-batch-size 1 \
--global-batch-size 512 \
--bf16
Enable overlapping of communication with computation:
--overlap-grad-reduce # Overlap gradient reduction with backward pass
--overlap-param-gather # Overlap parameter gathering with forward pass
--tp-comm-overlap # Overlap TP communication
Recommended for all multi-GPU training:
--use-distributed-optimizer
Benefits:
Always enable when using TP:
--sequence-parallel
Reduces activation memory by sharding sequence dimension in LayerNorm and Dropout.