docs/code-docs/source/autoep.rst
AutoEP automatically detects MoE layers in Hugging Face models and replaces them with EP-enabled versions, requiring zero model code changes. It follows the pattern of AutoTP (Automatic Tensor Parallelism).
This API is separate from the explicit deepspeed.moe.layer.MoE layer API.
For the explicit DeepSpeed MoE layer API, see :doc:moe.
Built-in AutoEP presets: mixtral (Mixtral), qwen3_moe (Qwen3-MoE),
qwen3_5_moe (Qwen3.5-MoE), deepseek_v2 (DeepSeek-V2), and
deepseek_v3 (DeepSeek-V3).
The preset name means AutoEP knows the router, expert, and weight naming
patterns for that model family. Running a Hugging Face model also requires a
Transformers build that exposes the matching config/model classes,
model.config.model_type value, and fused expert layout.
.. list-table:: AutoEP preset compatibility by Transformers version :header-rows: 1
mixtral5.0.0qwen3_moe5.0.04.51.3,
but the tested 4.x builds do not match the validated AutoEP layout.qwen3_5_moe5.2.0qwen3_5_moe_text model type;
for performance on Qwen3.5's Gated DeltaNet layers, install optimized
kernels. See the Hugging Face Transformers kernel loading docs <https://huggingface.co/docs/transformers/kernel_doc/loading_kernels>__
and the Qwen FlashQLA blog <https://qwen.ai/blog?id=flashqla>__.deepseek_v25.0.0load_balance_coeff / expert-bias auxiliary-loss-free load balancing
is not currently supported; non-null values are rejected.deepseek_v35.0.0load_balance_coeff / expert-bias auxiliary-loss-free load balancing
is not currently supported; non-null values are rejected.ZeRO compatibility: Stages 0, 1, and 2. Stage 3 is not supported.
Usage:
.. code-block:: json
{
"expert_parallel": {
"enabled": true,
"autoep_size": 4,
"preset_model": "mixtral"
}
}
How it works:
deepspeed.initialize(), AutoEP scans the model for MoE layers
using preset-defined patterns (router name, expert name, weight shapes).AutoEPMoELayer, which uses
TorchTitan's grouped GEMM kernels and AllToAll token dispatch.autoep_size.Constraints:
autoep_size must divide num_experts for all detected MoE layers.autoep_size=1 is valid: all experts remain local (no AllToAll), useful
for functional testing on a single GPU.tensor_parallel.autotp_size > 1) or tensor model parallelism from
mpu; support is planned as follow-up work.autoep_size.
To change autoep_size across runs for the same AutoEP-detected model
topology, convert the checkpoint to Universal Checkpoint format and load it
with checkpoint.load_universal; see the
Universal Checkpointing tutorial </tutorials/universal-checkpointing/>__
for the detailed flow and constraints.