Back to Vllm

FP8 ViT Encoder Attention

docs/features/quantization/fp8_vit_attn.md

0.21.04.0 KB
Original Source

FP8 ViT Encoder Attention

For visual understanding workloads with large images (e.g. QHD, 4K) and relatively short text prompts/generation, the ViT encoder attention can become a significant bottleneck, especially when the text model is quantized (e.g. NVFP4). vLLM supports optional FP8 quantization for the ViT encoder attention via the FlashInfer cuDNN backend. Q/K/V are quantized on-the-fly to FP8 before the cuDNN attention call.

!!! note - Currently supports Qwen3-VL family models only (qwen3_vl, qwen3_vl_moe, qwen3_5, qwen3_5_moe, and other models using Qwen3 ViT). - Dynamic scaling is not compatible with ViT full CUDA graphs. - Performance gains are mostly visible at QHD/4K resolutions or multi-image requests. Smaller images may see no speedup due to quantization overhead (3 quantization kernel launches + un-padding). - FP8 tensor-core speedup is more pronounced on GB300 than GB200.

Requirements

  • FlashInfer cuDNN backend with cuDNN >= 9.17.1.

Usage

Enable FP8 ViT attention by passing --mm-encoder-attn-dtype fp8 together with --mm-encoder-attn-backend FLASHINFER:

bash
vllm serve $MODEL \
    --mm-encoder-attn-backend FLASHINFER \
    --mm-encoder-attn-dtype fp8

By default (no scale file), dynamic scaling is used: a 16-entry circular buffer of observed Q/K/V amax values drives per-forward scale updates. This matches BF16 accuracy without any calibration but adds a small per-forward overhead.

For production, calibrate static scales on a representative dataset once and reuse them to avoid the dynamic overhead:

bash
# Step 1: calibrate and save scales (runs dynamic scaling for 16 passes,
# then dumps the learned scales to JSON).
vllm bench mm-processor \
    --model $MODEL --mm-encoder-attn-backend FLASHINFER \
    --mm-encoder-attn-dtype fp8 \
    --mm-encoder-fp8-scale-save-path /path/to/scales.json \
    --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat \
    --num-prompts 100

# Step 2: serve with static scales (no dynamic overhead).
vllm serve $MODEL \
    --mm-encoder-attn-backend FLASHINFER \
    --mm-encoder-attn-dtype fp8 \
    --mm-encoder-fp8-scale-path /path/to/scales.json

Saved scales are multiplied by --mm-encoder-fp8-scale-save-margin (default 1.5) to leave headroom against activation outliers not present in the calibration set. The default has been validated to generalize across datasets (e.g. VisionArena-Chat calibration maintains BF16 accuracy on ChartQA).

Scale File Format

json
{
    "visual.blocks.0.attn.attn": {"q": 224.0, "k": 198.0, "v": 210.0},
    "visual.blocks.1.attn.attn": {"q": 218.0, "k": 195.0, "v": 207.0}
}

Keys q_scale / k_scale / v_scale are accepted as aliases.

Performance

Core cuDNN attention kernel (PyTorch profiler, cudnn_generated_fort_native_sdpa_sm100_flash_fprop, head_dim=128, seq_len=8192):

HardwareBF16FP8Speedup
GB200350 us312 us1.12x
GB300300 us211 us1.42x

End-to-end encoder forward time (Qwen3-VL-30B-A3B-Instruct on GB200, 3 images/request):

ResolutionBF16 medianFP8 medianSpeedup
HD (720x1280)31.77 ms36.39 ms0.87x
FullHD (1080x1920)57.99 ms58.73 ms~same
QHD (1440x2560)131.83 ms122.30 ms1.08x
4K (2160x3840)543.44 ms460.31 ms1.18x

Crossover is around FullHD with 3 images/request. At QHD and above, FP8 wins.

Accuracy

ChartQA, Qwen3-VL-8B-Instruct, 500 samples. FP8 static uses scales calibrated on VisionArena-Chat (with default 1.5x margin):

MetricBF16FP8 dynamicFP8 static
relaxed_accuracy0.7800.7760.780
anywhere_accuracy0.8060.8160.814
exact_match0.5840.5820.578

All three configurations match within statistical noise, confirming that static scales calibrated on one dataset generalize to another.