docs_new/docs/advanced_features/hisparse_guide.mdx
HiSparse reduces per-request GPU memory consumption during the decode phase by maintaining only a small "hot" KV buffer on GPU while keeping complete KV data in CPU pinned memory. Combined with PD disaggregation, it enables significantly higher decode concurrency.
Prerequisites: HiSparse works with models that use DeepSeek Sparse Attention (DSA) architectures (e.g., DeepSeek-V3.2, GLM-5.1) and DeepSeek V4. These models natively select a subset of tokens for attention, making it possible to keep only the top-k KV on GPU while storing the full KV in host memory — without accuracy loss. Additionally, HiSparse currently requires PD disaggregation mode and is enabled on the decode instance only.
In long-context LLM inference, each decoding request holds a full-length KV cache on GPU, limiting the number of concurrent requests a decode instance can serve. HiSparse addresses this by:
Each decode step follows this flow:
seq_len ≤ device_buffer_size): fast path, all KV already in bufferIn PD disaggregation mode, the prefill instance transfers KV cache directly into the decode instance's host pool via RDMA, bypassing the GPU entirely on the decode side. This eliminates the transient GPU memory spike during KV transfer and removes the staging DMA step.
Prefill GPU ──RDMA──▶ Decode Host Pool (CPU pinned memory)
│
▼
alloc device buffer (4KB)
│
▼
swap-in kernel (on-demand top-k)
For DeepSeek V4, the direct-to-host path writes only C4 KV into the decode host pool. The c4_indexer and C128 KV remain device-to-device transfers.
Pass as a JSON string via --hisparse-config:
Example: --hisparse-config='{"top_k": 2048, "device_buffer_size": 6144, "host_to_device_ratio": 10}'
HiSparse currently requires PD disaggregation mode and is enabled only on the decode instance.
python3 -m sglang.launch_server \
--model-path /path/to/model \
--trust-remote-code \
--port 8000 --host 0.0.0.0 \
--context-length 81920 \
--chunked-prefill-size 65536 \
--tp-size 8 --dp-size 8 --enable-dp-attention \
--mem-fraction-static 0.85 \
--disaggregation-mode prefill \
--disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3 \
--nnodes 1 --node-rank 0
python3 -m sglang.launch_server \
--model-path /path/to/model \
--trust-remote-code \
--port 8000 --host 0.0.0.0 \
--context-length 81920 \
--tp-size 8 --dp-size 8 --enable-dp-attention \
--mem-fraction-static 0.85 \
--disable-radix-cache \
--disaggregation-mode decode \
--disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3 \
--dist-init-addr 127.0.0.1:5757 \
--nnodes 1 --node-rank 0 \
--enable-hisparse \
--hisparse-config='{"top_k": 2048, "device_buffer_size": 6144, "host_to_device_ratio": 10}'
Note: For DSA models,
--kv-cache-dtypedefaults toauto, which resolves tofp8_e4m3on SM100+ (Blackwell) andbfloat16on older architectures. The DSA decode backend is automatically selected based on KV dtype (bfloat16→flashmla_sparse,fp8_e4m3→flashmla_kv). DSA backend flags apply only to DSA models; DeepSeek V4 uses its owndsv4attention backend.
python3 -m sglang.bench_serving \
--backend sglang \
--dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json \
--dataset-name random \
--random-input 40000 \
--random-output 20000 \
--num-prompts 200 \
--max-concurrency 200 \
--request-rate 40 \
--random-range-ratio 1.0 \
--host 127.0.0.1 \
--port 20000 \
--model /path/to/model \
--flush-cache \
--enable-hisparse; it is unaware of HiSparse.--enable-hisparse and --hisparse-config are required for HiSparse.--kv-cache-dtype bfloat16 uses flashmla_sparse, and --kv-cache-dtype fp8_e4m3 uses flashmla_kv.dsv4 attention backend and fp8_e4m3 KV cache by default.host_to_device_ratio should be configured based on the host machine's available memory. For example:
host_to_device_ratio: 5host_to_device_ratio: 10We would like to thank the SGLang team and community for the implementation and generous support, especially Zhiqiang Xie, Zhangheng Huang, Tingwei Huang, Shangming Cai, Teng Ma, and many others. We also thank the Alibaba Cloud TairKVCache team and the AntGroup SCT Inference team for their valuable contributions.