docs_new/docs/advanced_features/quantized_kv_cache.mdx
Quantized KV cache reduces the memory footprint of key-value cache storage by using lower-precision data types (FP8 or FP4) instead of the default model precision in BF16. During autoregressive generation, LLMs cache previously computed key-value pairs to avoid redundant calculations. The KV cache typically consumes a significant portion of GPU memory, especially for long sequences.
Quantized KV cache is a memory optimization technique that primarily benefits throughput by allowing more tokens to be cached, but may introduce minimal accuracy degradation depending on the quantization format used.
<Warning> **Performance Warning**: When quantized KV cache must be dequantized before use in attention operations, performance can be extremely slow if dequantization is not fused with the attention kernel. Always verify that your chosen attention backend supports quantized KV cache. Backends without fused support may experience significant throughput degradation, potentially negating the memory benefits.Backend Support: Not all attention backends support quantized KV cache. Refer to Attention Backend for which backends support it. </Warning>
SGLang supports the following quantized KV cache formats:
OCP (Open Compute Project) specifies two common 8-bit floating point formats:
OCP (Open Compute Project) specifies MXFP4 (Microscaling FP4), a 4-bit floating-point format:
To enable quantized KV cache, use the --kv-cache-dtype argument when launching the server:
# Enable FP8 E5M2 KV cache
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-0528 \
--kv-cache-dtype fp8_e5m2 \
# Enable FP8 E4M3 KV cache
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-0528 \
--kv-cache-dtype fp8_e4m3 \
# Enable FP4 E2M1 KV cache
python3 -m sglang.launch_server \
--model-path nvidia/DeepSeek-R1-0528-NVFP4 \
--kv-cache-dtype fp4_e2m1 \
FP8 quantization requires scaling factors to properly quantize and dequantize the KV cache.
<Note> Currently, only per-tensor (scalar) scaling factors are supported. </Note>Scaling factors can be:
k_scale and v_scale parameters that are automatically loaded--quantization-param-path.The JSON file should follow this format:
{
"kv_cache": {
"dtype": "float8_e4m3fn",
"scaling_factor": {
"0": {
"0": 1.0,
"1": 1.0
}
}
}
}
Where the outer keys in scaling_factor are tensor parallel ranks and inner keys are layer indices.
Quantized KV cache provides significant memory savings:
This enables longer context lengths or more concurrent requests within the same memory budget.
FP8 E4M3 quantization typically introduces minimal accuracy degradation. The impact depends on model architecture, sequence length, and quantization format (generally, E4M3 has better accuracy than E5M2).
FP4 (MXFP4) quantization provides significant memory savings with varying accuracy impact depending on model size and dataset complexity. Preliminary accuracy test results from PR #10078 (MLA) and PR #12612 (MHA) show:
Large Models (e.g., Qwen3-235B-A22B, DeepSeek-R1-0528)
On large-scale models, FP4 maintains accuracy close to FP8/BF16, especially on simpler datasets:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "20%"}} /> <col style={{width: "20%"}} /> <col style={{width: "20%"}} /> <col style={{width: "20%"}} /> <col style={{width: "20%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Model</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Dataset</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>KV16</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>KV8 (FP8 E4M3)</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>KV4 (FP4 E2M1)</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Qwen3-235B-A22B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>gsm8k</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.9168</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>0.9181</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.9186</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Qwen3-235B-A22B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>aime25</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.7733</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>0.7333</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.6000</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Qwen3-235B-A22B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>gpqa_diamond</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.7010</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>0.6899</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.6778</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>DeepSeek-R1-0528</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>gsm8k</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.9157</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>0.9154</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.9124</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>DeepSeek-R1-0528</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>aime25</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.5067</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>0.4934</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.4000</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>DeepSeek-R1-0528</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>gpqa_diamond</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.7707</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>0.7697</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.7273</td> </tr> </tbody> </table>Smaller Models (e.g., GPT-OSS-120B)
On smaller models, FP4 shows more pronounced accuracy drops, particularly on challenging datasets:
<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "20%"}} /> <col style={{width: "20%"}} /> <col style={{width: "20%"}} /> <col style={{width: "20%"}} /> <col style={{width: "20%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Model</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>Dataset</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>KV16</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>KV8 (FP8 E4M3)</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>KV4 (FP4 E2M1)</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>GPT-OSS-120B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>gsm8k</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.9161</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>0.9163</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.9152</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>GPT-OSS-120B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>aime25</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.7533</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>0.7667</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.3533</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>GPT-OSS-120B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>gpqa_diamond</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.5081</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>0.5434</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.02)"}}>0.3202</td> </tr> </tbody> </table>Key Observations:
fp8_e4m3 for better accuracy (recommended), fp8_e5m2 for larger dynamic range, or fp4_e2m1 for maximum memory savings (experimental)