docs/MLA.md
Multi-head Latent Attention (MLA) is an efficient attention mechanism that reduces KV cache memory usage by compressing key-value states into a low-rank latent space. This technique was introduced in DeepSeek V2 and is also used in DeepSeek V3 and GLM-4.7-Flash models.
MLA compresses the key-value cache by:
kv_lora_rank dimensions)This results in significant memory savings compared to standard multi-head attention, enabling longer context lengths with the same GPU memory.
MLA is automatically enabled for the following model architectures when using PagedAttention on CUDA:
| Model | Architecture | MLA Dimensions |
|---|---|---|
| DeepSeek V2 | deepseekv2 | kv_lora_rank varies |
| DeepSeek V3 | deepseekv3 | kv_lora_rank=512, kpe_head_dim=64 |
| GLM-4.7-Flash | glm4moelite | kv_lora_rank=512, kpe_head_dim=64 |
MLA decode optimization requires:
When these conditions are met, MLA is automatically used during the decode phase for optimal performance.
MLA provides two key optimizations:
Reduced KV Cache Memory: The compressed latent representation uses significantly less memory than full key-value states, allowing for:
Optimized Decode Kernels: Custom FlashInfer-based MLA kernels accelerate single-token generation by:
If you encounter issues or want to compare performance, you can disable MLA by setting the environment variable:
MISTRALRS_NO_MLA=1 mistralrs ...
When disabled, the model falls back to standard PagedAttention with full KV cache storage.
When MLA is enabled, PagedAttention uses a specialized cache layout:
kv_lora_rank dimensions) + rotary position embeddings (kpe_head_dim dimensions)During single-token generation (decode phase):
During prompt processing (prefill phase):