docs/_tutorials/autotp-training.md
This tutorial covers Automatic Tensor Parallelism for combining tensor parallelism with ZeRO optimization during training. For inference-only tensor parallelism, see Automatic Tensor Parallelism (Inference).
The AutoTP Training API enables hybrid parallelism by combining:
Tensor parallelism (TP) splits the computations and parameters of large layers across multiple GPUs so each rank holds only a shard of the weight matrix. This is an efficient way to train large-scale transformer models by reducing per-GPU memory pressure while keeping the layer math distributed across the TP group.
AutoTP training can be enabled entirely through the DeepSpeed config. When
tensor_parallel is set in the config, deepspeed.initialize(...) applies
AutoTP sharding during engine initialization, so the training loop itself does
not change.
import torch
import deepspeed
# 1. Create your model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
# 2. Define the DeepSpeed config with tensor_parallel settings
ds_config = {
"train_micro_batch_size_per_gpu": 1,
"zero_optimization": {"stage": 2},
"bf16": {"enabled": True},
"tensor_parallel": {"autotp_size": 4},
}
# 3. Initialize DeepSpeed with AutoTP + ZeRO
engine, optimizer, _, _ = deepspeed.initialize(
model=model,
optimizer=optimizer,
config=ds_config,
mpu=mpu # Model parallel unit (optional if you provide tp_group elsewhere)
)
# 4. Train as usual
for batch in dataloader:
outputs = engine(input_ids=batch["input_ids"], labels=batch["labels"])
engine.backward(outputs.loss)
engine.step()
Compatibility note: For backward compatibility, you can still call
set_autotp_mode(training=True) and deepspeed.tp_model_init(...), but they
are not required when the DeepSpeed config provides the necessary
tensor_parallel settings.
If your model matches a built-in preset, set tensor_parallel.preset_model in the DeepSpeed config:
{
"train_batch_size": 8,
"train_micro_batch_size_per_gpu": 1,
"bf16": { "enabled": true },
"zero_optimization": { "stage": 2 },
"tensor_parallel": {
"autotp_size": 4,
"preset_model": "llama"
}
}
For the list of available presets, see supported models.
Many HuggingFace models (e.g. Llama, Qwen, Gemma2) ship with a built-in
base_model_tp_plan in their model config that describes how each layer
should be partitioned for tensor parallelism. DeepSpeed can automatically
detect and use this plan, so you do not need to configure preset_model or
partition_config for these models.
When tensor_parallel is set in the DeepSpeed config, the initialization
follows this priority:
partition_config (highest): User-defined regex patterns.tp_plan: Automatically extracted from
model._tp_plan or model.config.base_model_tp_plan.For models that define a tp_plan, you only need a minimal config:
{
"train_micro_batch_size_per_gpu": 1,
"zero_optimization": { "stage": 2 },
"bf16": { "enabled": true },
"tensor_parallel": { "autotp_size": 4 }
}
DeepSpeed will read the model's tp_plan at initialization and convert it to
internal partition rules. Currently colwise and rowwise partition types
are supported. Additional types defined by HuggingFace (such as
colwise_rep, local_colwise, local_rowwise, etc.) are not yet handled
and will raise an error if encountered.
If you need to override the model's built-in tp_plan, provide a
partition_config in the DeepSpeed config -- it takes precedence.
If you are training a custom model, define regex-based patterns and partition rules in tensor_parallel.partition_config:
{
"tensor_parallel": {
"autotp_size": 4,
"partition_config": {
"use_default_specs": false,
"layer_specs": [
{
"patterns": [".*\\.o_proj\\.weight$", ".*\\.down_proj\\.weight$"],
"partition_type": "row"
},
{
"patterns": [".*\\.[qkv]_proj\\.weight$"],
"partition_type": "column"
},
{
"patterns": [".*\\.gate_up_proj\\.weight$"],
"partition_type": "column",
"shape": [2, -1],
"partition_dim": 0
}
]
}
}
}
For models not covered by presets, define custom layer specs:
{
"tensor_parallel": {
"autotp_size": 4,
"partition_config": {
"use_default_specs": false,
"layer_specs": [
{
"patterns": [".*\\.o_proj\\.weight$", ".*\\.down_proj\\.weight$"],
"partition_type": "row"
},
{
"patterns": [".*\\.[qkv]_proj\\.weight$"],
"partition_type": "column"
},
{
"patterns": [".*\\.gate_up_proj\\.weight$"],
"partition_type": "column",
"shape": [2, -1],
"partition_dim": 0
}
]
}
}
}
For Grouped Query Attention with different Q/K/V sizes:
{
"tensor_parallel": {
"partition_config": {
"layer_specs": [
{
"patterns": [".*\\.qkv_proj\\.weight$"],
"partition_type": "column",
"shape": [[q_size, kv_size, kv_size], -1],
"partition_dim": 0
}
]
}
}
}
ZeRO Stage 3 not supported: AutoTP currently only works with ZeRO stages 0, 1, and 2.
TP size must divide model dimensions: The tensor parallel size must evenly divide the attention head count and hidden dimensions.