docs/design/attention_backends.md
This document is auto-generated by tools/pre_commit/generate_attention_backend_docs.py.
It shows the feature support for each registered attention backend
based on the checks in AttentionBackend.validate_configuration().
Do not edit this file manually. Run the following command to regenerate it:
python tools/pre_commit/generate_attention_backend_docs.py
There are two ways to specify the backend from the command line:
Option 1: Using --attention-backend (simple)
vllm serve <model> --attention-backend FLASH_ATTN
Option 2: Using --attention-config.backend / -ac.backend (structured config)
# Dot notation
vllm serve <model> --attention-config.backend FLASH_ATTN
vllm serve <model> -ac.backend FLASH_ATTN
# JSON format
vllm serve <model> --attention-config '{"backend": "FLASH_ATTN"}'
vllm serve <model> -ac '{"backend": "FLASH_ATTN"}'
Note:
--attention-backendand--attention-config.backendare mutually exclusive. Use one or the other, not both.
Use AttentionConfig with the LLM class:
from vllm import LLM
from vllm.config import AttentionConfig
from vllm.v1.attention.backends.registry import AttentionBackendEnum
# Method 1: Using AttentionConfig with enum
llm = LLM(
model="Qwen/Qwen3-0.6B",
attention_config=AttentionConfig(backend=AttentionBackendEnum.FLASH_ATTN),
)
# Method 2: Using attention_backend parameter with string
llm = LLM(
model="Qwen/Qwen3-0.6B",
attention_backend="FLASH_ATTN",
)
When you explicitly set a backend via --attention-backend or AttentionConfig:
Example error when selecting an incompatible backend:
ValueError: Selected backend FLASHMLA is not valid for this configuration.
Reason: ['compute capability not supported']
When no backend is specified (the default):
When no backend is explicitly selected, vLLM chooses the first compatible backend from these priority-ordered lists.
Priority is 1 = highest (tried first).
Blackwell (SM 10.x):
| Priority | Backend |
|---|---|
| 1 | FLASHINFER |
| 2 | FLASH_ATTN |
| 3 | TRITON_ATTN |
| 4 | FLEX_ATTENTION |
| 5 | TURBOQUANT |
Ampere/Hopper (SM 8.x-9.x):
| Priority | Backend |
|---|---|
| 1 | FLASH_ATTN |
| 2 | FLASHINFER |
| 3 | TRITON_ATTN |
| 4 | FLEX_ATTENTION |
| 5 | TURBOQUANT |
Blackwell (SM 10.x):
| Priority | Backend |
|---|---|
| 1 | FLASHINFER_MLA |
| 2 | CUTLASS_MLA |
| 3 | FLASH_ATTN_MLA |
| 4 | FLASHMLA |
| 5 | TRITON_MLA |
| 6 | FLASHINFER_MLA_SPARSE* |
| 7 | FLASHMLA_SPARSE |
Ampere/Hopper (SM 8.x-9.x):
| Priority | Backend |
|---|---|
| 1 | FLASH_ATTN_MLA |
| 2 | FLASHMLA |
| 3 | FLASHINFER_MLA |
| 4 | TRITON_MLA |
| 5 | FLASHMLA_SPARSE |
* For sparse MLA, FP8 KV cache always prefers
FLASHINFER_MLA_SPARSE. With BF16 KV cache,FLASHINFER_MLA_SPARSEis preferred for low query-head counts (<= 16), whileFLASHMLA_SPARSEis preferred otherwise.Note: ROCm and CPU platforms have their own selection logic. See the platform-specific documentation for details.
| Column | Description |
|---|---|
| Dtypes | Supported model data types (fp16, bf16, fp32) |
| KV Dtypes | Supported KV cache data types (auto, fp8, fp8_e4m3, etc.) |
| Block Sizes | Supported KV cache block sizes (%N means multiples of N) |
| Head Sizes | Supported attention head sizes |
| Sink | Attention sink support (for StreamingLLM) |
| Sparse | Sparse attention support (MLA only) |
| MM Prefix | Multimodal prefix full attention support |
| DCP | Decode Context Parallelism support (--decode-context-parallel-size) |
| Attention Types | Supported attention patterns (Decoder, Encoder, Enc-Dec) |
| Compute Cap. | Required CUDA compute capability (N/A for non-CUDA backends) |
Symbols: ✅ = Supported, ❌ = Not supported
| Backend | Version | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | MM Prefix | DCP | Attention Types | Compute Cap. |
|---|---|---|---|---|---|---|---|---|---|---|
CPU_ATTN | fp16, bf16, fp32 | auto | Any | 32, 64, 80, 96, 112, 128, 160, 192, 224, 256, 512 | ❌ | ❌ | ❌ | All | N/A | |
FLASHINFER | Native† | fp16, bf16 | auto, float16, bfloat16, fp8, fp8_e4m3, fp8_e5m2 | 16, 32, 64 | 64, 128, 256 | ❌ | ❌ | ✅ | Decoder | 7.x-9.x |
FLASHINFER | TRTLLM† | fp16, bf16 | auto, float16, bfloat16, fp8, fp8_e4m3, fp8_e5m2 | 16, 32, 64 | 64, 128, 256 | ✅ | ❌ | ✅ | Decoder | 10.x |
FLASH_ATTN | FA2* | fp16, bf16 | auto, float16, bfloat16 | %16 | Any | ❌ | ❌ | ✅ | All | ≥8.0 |
FLASH_ATTN | FA3* | fp16, bf16 | auto, float16, bfloat16, fp8, fp8_e4m3, fp8_e5m2 | %16 | Any | ✅ | ❌ | ✅ | All | 9.x |
FLASH_ATTN | FA4* | fp16, bf16 | auto, float16, bfloat16 | %16 | Any | ✅ | ❌ | ✅ | All | ≥10.0 |
FLASH_ATTN_DIFFKV | fp16, bf16 | auto | Any | Any | ❌ | ❌ | ✅ | Decoder | Any | |
FLEX_ATTENTION | fp16, bf16, fp32 | auto, float16, bfloat16 | Any | Any | ❌ | ✅ | ❌ | Decoder, Encoder Only | Any | |
ROCM_AITER_FA | fp16, bf16 | auto, float16, bfloat16, fp8, fp8_e4m3, fp8_e5m2 | 16, 32 | 64, 128, 256 | ❌ | ❌ | ❌ | Decoder | N/A | |
ROCM_AITER_UNIFIED_ATTN | fp16, bf16 | auto | %16 | Any | ✅ | ✅ | ❌ | All | N/A | |
ROCM_ATTN | fp16, bf16, fp32 | auto, float16, bfloat16, fp8, fp8_e4m3, fp8_e5m2 | %16 | 32, 64, 80, 96, 128, 160, 192, 224, 256 | ❌ | ✅ | ❌ | Decoder, Encoder, Encoder Only | N/A | |
TREE_ATTN | fp16, bf16 | auto, float16, bfloat16 | %16 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | Decoder | Any | |
TRITON_ATTN | fp16, bf16, fp32 | auto, float16, bfloat16, fp8, fp8_e4m3, fp8_e5m2, int8_per_token_head, fp8_per_token_head | %16 | Any | ✅ | ✅ | ❌ | All | Any | |
TURBOQUANT | fp16, bf16 | turboquant_k8v4, turboquant_4bit_nc, turboquant_k3v4_nc, turboquant_3bit_nc | 16, 32, 64, 128 | Any | ❌ | ❌ | ❌ | Decoder | Any |
† FlashInfer uses TRTLLM attention on Blackwell (SM100), which supports sinks. Disable via
--attention-config.use_trtllm_attention=0.* Specify the FlashAttention version via
--attention-config.flash_attn_version=2,3, or4. Default is FA4 on SM100+ (Blackwell), FA3 on SM90 (Hopper), FA2 otherwise.
MLA uses separate backends for prefill and decode phases.
The prefill backend is selected at runtime based on hardware and configuration.
| Backend | Description | Compute Cap. | Enable | Disable | Notes |
|---|---|---|---|---|---|
| TRT-LLM Ragged‡ | TensorRT-LLM ragged attention | 10.x | Default on SM100 | -ac.use_trtllm_ragged_deepseek_prefill=0 | DeepSeek R1 dims only |
| FlashInfer | FlashInfer CUTLASS backend | 10.x | -ac.disable_flashinfer_prefill=0 | -ac.disable_flashinfer_prefill=1 | DeepSeek R1 dims only |
| cuDNN | cuDNN-based attention | 10.x | -ac.use_cudnn_prefill=1 | -ac.use_cudnn_prefill=0 | |
| FlashAttention | FlashAttention varlen (FA2/FA3) | Any | Default fallback | Use other backends | FA3 on SM90, FA2 otherwise |
‡ TRT-LLM Ragged is the default on Blackwell (SM100). On other GPUs, FlashAttention is used as the default.
| Backend | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | Sparse | MM Prefix | DCP | Attention Types | Compute Cap. |
|---|---|---|---|---|---|---|---|---|---|---|
CUTLASS_MLA | fp16, bf16 | auto, float16, bfloat16, fp8, fp8_e4m3 | 128 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 10.x |
FLASHINFER_MLA | fp16, bf16 | auto, float16, bfloat16, fp8, fp8_e4m3 | 32, 64 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | 10.x |
FLASHINFER_MLA_SPARSE | fp16, bf16 | auto, float16, bfloat16, fp8, fp8_e4m3 | 32, 64 | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 10.x |
FLASHMLA | fp16, bf16 | auto, float16, bfloat16, fp8, fp8_e4m3 | 64 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 9.x-10.x |
FLASHMLA_SPARSE | bf16 | auto, bfloat16, fp8_ds_mla | 64 | 512, 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 9.x-10.x |
FLASH_ATTN_MLA | fp16, bf16 | auto, float16, bfloat16 | %16 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 9.x |
ROCM_AITER_MLA | fp16, bf16 | auto, float16, bfloat16, fp8, fp8_e4m3, fp8_e5m2 | %1 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | N/A |
ROCM_AITER_MLA_SPARSE | fp16, bf16 | auto, float16, bfloat16 | 1 | Any | ❌ | ✅ | ❌ | ❌ | Decoder | N/A |
ROCM_AITER_TRITON_MLA | fp16, bf16 | auto | Any | Any | ❌ | ❌ | ❌ | ❌ | Decoder | N/A |
TRITON_MLA | fp16, bf16 | auto, float16, bfloat16, fp8, fp8_e4m3 | %16 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | Any |
XPU_MLA_SPARSE | fp16, bf16 | auto, float16, bfloat16 | Any | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | Any |