Experts backends

All Mixture-of-Experts (MoE) implementations perform the same high-level computation. For each token, a router selects k experts. The token hidden state is then projected through the selected experts' parameters and aggregated with routing weights. The difference between experts backends is how those expert matrix multiplications execute.

The [ExpertsInterface] provides optimized experts backends. It decouples the experts implementation from the model code to simplify experimentation with different functions. Add new backends through the same interface.

experts backend	description	GPU	CPU
`"eager"`	Reference implementation that loops over selected experts and applies projections on their tokens.	Reasonable baseline performance without requiring compilation.	Slower than `grouped_mm` but faster than `batched_mm`.
`"batched_mm"`	Duplicates selected expert parameters for each token and projects all tokens in a single batched GEMM using torch.bmm.	Fastest for small inputs, especially with compilation. Uses more memory due to parameter duplication.	Not recommended (significantly slower than other backends).
`"grouped_mm"`	Orders tokens by selected experts and uses torch.nn.functional.grouped_mm to project all tokens in a single grouped GEMM (requires PyTorch 2.9+).	Best for larger inputs and more memory efficient as it avoids duplicating expert parameters. Fast with compilation.	Most efficient backend for all input sizes.
`"deepgemm"`	Sorts tokens by selected expert and projects all tokens in a single TMA-aligned grouped GEMM using the DeepGEMM kernels from kernels-community/deep-gemm.	Native backend for DeepSeek models on Hopper (SM90+) and Blackwell (SM100+); supports `bfloat16` and FP8/FP4-quantized experts.	Not supported (CUDA-only).
`"deepgemm_megamoe"`	Fuses expert-parallel dispatch, the gated MLP (up projection, SwiGLU, down projection), and the EP combine into a single DeepGEMM Mega MoE kernel, overlapping NVLink transfers with tensor-core compute.	Blackwell (SM100+) only, for FP4-quantized experts run with expert parallelism.	Not supported (CUDA-only).
`"sonicmoe"`	Fuses the routed `bfloat16` MoE forward (router dispatch, gated up projection, activation, down projection) into CuteDSL grouped-GEMM kernels (from the quack library) from kernels-community/sonic-moe.	State-of-the-art throughput on Hopper (SM90+) for `bfloat16` experts with a gated activation (SwiGLU/GeGLU/ReGLU), especially for training.	Not supported (CUDA-only).

The "batched_mm" and "grouped_mm" backends also run FP8 and FP4 (int8-packed) quantized experts through the Triton finegrained-fp8 kernel, reading either float32 or UE8M0 scales. They act as the fallback for quantized checkpoints when the "deepgemm" backend is unavailable.

[!NOTE] When using experts_implementation="grouped_mm" on GPU, the model automatically switches to "batched_mm" during the decode stage of generation (after prefill). This is because batched_mm is significantly faster on lower token count during autoregressive decoding on GPU. On CPU, grouped_mm remains active throughout generation as it is more efficient for all input sizes.

Set an experts backend

Use the experts_implementation argument in [~PreTrainedModel.from_pretrained] to instantiate a model with a specific experts backend.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen1.5-MoE-A2.7B",
    dtype="bfloat16",
    experts_implementation="batched_mm",
)

Switch between experts backends at runtime without reloading the model using [~PreTrainedModel.set_experts_implementation].

model.set_experts_implementation("eager")

Backbone-specific experts backend

Multimodal models can have multiple sub-configs (for example, different backbones). You can set a different experts backend per sub-config by passing a dict to experts_implementation at load time.

Keys in the mapping must match sub-config names.

from transformers import AutoModelForImageTextToText

experts_implementation_per_backbone = {
    "text_config": "grouped_mm",
    "vision_config": "eager",
}

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-Moe",
    experts_implementation=experts_implementation_per_backbone,
)

Set the experts backend globally with an empty key.

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen1.5-MoE-A2.7B",
    experts_implementation={"": "batched_mm"},
)

DeepGEMM

The "deepgemm" backend routes expert matmuls through the DeepGEMM kernels distributed by kernels-community/deep-gemm. It works with unquantized bfloat16 experts and with FP8/FP4-quantized experts loaded through Fine-grained FP8.

The "deepgemm" backend requires:

CUDA GPU with compute capability ≥ 9.0 (Hopper or newer).
CUDA runtime 12.3 or later on Hopper, 12.9 or later on Blackwell.
nvcc/nvrtc available on the system for the kernel's JIT compilation.
The kernels package.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V3",
    dtype="bfloat16",
    experts_implementation="deepgemm",
)

The kernel is loaded lazily on the first forward.

FP8 and FP4 quantized experts

DeepSeek-style checkpoints are usually pre-quantized and carry their own quantization config, so you don't need to pass a [FineGrainedFP8Config]. The "deepgemm" backend automatically picks the FP8 (or FP4 on Blackwell) grouped-GEMM kernel. DeepGEMM requires dynamic per-row activation scales (activation_scheme="dynamic") and rejects static (per-tensor) activation quantization.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V3",
    experts_implementation="deepgemm",
)

For FP4-packed expert weights (DeepSeek V4-style), the GPU must be SM100+ (Blackwell). The checkpoint config typically sets expert_dtype="fp4" and scale_fmt="ue8m0".

The main reason to pass a [FineGrainedFP8Config] for a pre-quantized checkpoint is to dequantize it back to bfloat16, in which case the experts run in bfloat16 rather than on the FP8/FP4 DeepGEMM path.

from transformers import AutoModelForCausalLM, FineGrainedFP8Config

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V3",
    quantization_config=FineGrainedFP8Config(dequantize=True),
    experts_implementation="deepgemm",
)

Fused Mega MoE on Blackwell

On Blackwell (SM100+), set experts_implementation="deepgemm_megamoe" to run a single fused kernel that combines expert-parallel dispatch, the up projection, SwiGLU, the down projection, and the EP combine, overlapping NVLink transfers with tensor-core compute.

This backend requires:

A Blackwell GPU (compute capability ≥ 10.0) with CUDA runtime 12.9 or later.
FP4-packed expert weights paired with UE8M0 weight scales (the pre-quantized checkpoint typically declares expert_dtype="fp4" and scale_fmt="ue8m0" in its config).
A torch.distributed process group for the expert-parallel group, which the tensor-parallel wrapping supplies automatically.

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V4",
    experts_implementation="deepgemm_megamoe",
    tp_plan="auto",
)

SonicMoE

The "sonicmoe" backend fuses the routed MoE forward (dispatch, gated up projection, activation, down projection) into a set of highly optimized CuteDSL grouped-GEMM kernels, built on the quack library and distributed by kernels-community/sonic-moe.

The "sonicmoe" backend requires:

CUDA GPU with compute capability ≥ 9.0 (Hopper or newer).
The kernels package and the nvidia-cutlass-dsl package.
Experts with a gated activation (silu, gelu, or relu, mapped to SwiGLU/GeGLU/ReGLU).

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen1.5-MoE-A2.7B",
    dtype="bfloat16",
    experts_implementation="sonicmoe",
)

If the requirements aren't met, the forward raises ImportError and you should pick a different experts_implementation.

torch.compile

The "eager", "batched_mm", and "grouped_mm" backends are compatible with torch.compile to varying degrees. The following table summarizes their compatibility. The "deepgemm", "deepgemm_megamoe", and "sonicmoe" backends route through external CUDA kernels and aren't covered by this table.

Implementation	compilation modes	dtypes	`fullgraph=True`
`grouped_mm`	`None`, `max-autotune-no-cudagraphs`	`bfloat16`	Yes
`grouped_mm` (fallback)	`None`, `max-autotune-no-cudagraphs`	`bfloat16`, `float16`, `float32`	Yes
`batched_mm`	all	`bfloat16`, `float16`, `float32`	Yes
`eager`	all	`bfloat16`, `float16`, `float32`	No

Notes:

The grouped_mm experts backend currently only supports bfloat16 when compiled with torch.compile. Additionally, it is not compatible with CUDA graphs, so you must use mode=None or mode="max-autotune-no-cudagraphs" when compiling.
The eager experts backend uses a data-dependent operation to find which experts are used in a forward pass. This operation is not compatible with full graph compilation (fullgraph=True).

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen1.5-MoE-A2.7B",
    dtype="bfloat16",
    experts_implementation="grouped_mm",
).eval().cuda()

# Works for grouped_mm (no CUDA graphs)
model.forward = torch.compile(model.forward, mode="max-autotune-no-cudagraphs")

Benchmarks

This benchmark compares different input sizes and experts implementations with and without torch.compile.