Back to Vllm

IndexCache

docs/features/index_cache.md

0.21.02.5 KB
Original Source

IndexCache

IndexCache reduces redundant top-k computation in DeepSeek-V3.2 (DSA) models by caching and reusing top-k indices across layers.

Background

DeepSeek-V3.2 uses a DeepSeek Sparse Attention (DSA) mechanism where top-k token selection is computed per layer. For deep models with many layers, this computation can be expensive. IndexCache allows skipping redundant top-k computations by reusing indices from previous layers.

See: IndexCache Paper

Usage

CLI

bash
vllm serve deepseek-ai/DeepSeek-V3.2 \
    --hf-overrides '{"use_index_cache": true, "index_topk_freq": 4}' ...

Configuration Reference

ParameterTypeDefaultDescription
use_index_cacheboolfalseEnable IndexCache. Must be set to true to use this feature
index_topk_freqint1Frequency (in layers) at which top-k is computed. 1 = compute on every layer (disabled), 4 = compute on 1/4 of layers
index_topk_patternstrnullPer-layer F/S pattern. Overrides index_topk_freq if set. Each character maps to one DSA layer: F = Full, S = Shared

Configuration Examples

Using index_topk_freq (compute every N layers):

bash
vllm serve deepseek-ai/DeepSeek-V3.2 \
    --hf-overrides '{"use_index_cache": true, "index_topk_freq": 4}' ...

Using index_topk_pattern (explicit per-layer control):

bash
# custom pattern for 61 layers: F = compute, S = reuse
vllm serve deepseek-ai/DeepSeek-V3.2 \
    --hf-overrides '{"use_index_cache": true, "index_topk_pattern": "FFSFSSSFSSFFFSSSFFFSFSSSSSSFFSFFSFFSSFFFFFFSFFFFFSFFSSSSSSFSF"}'

How It Works

  1. When IndexCache is enabled, layers marked with "F" (Full) calculate and store top-k indices
  2. Subsequent layers marked with "S" (Shared) receive the cached indices from the previous layer instead of recomputing
  3. The cached indices are passed through the layer stack, reducing total computation

Requirements

  • DeepSeek-V3.2 or compatible DSA model
  • use_index_cache: true via --hf-overrides