doc/en/DeepSeek-V4-Flash.md
This tutorial demonstrates how to run DeepSeek-V4-Flash model inference using SGLang integrated with KT-Kernel for CPU-GPU heterogeneous inference. The hybrid path splits MXFP4 routed experts between CPU (KT-Kernel cpuinfer) and GPU (sglang kt-num-gpu-experts), enabling deployment on consumer-grade hardware.
Validated Configuration (this tutorial):
Supported GPU architectures (auto-detected at startup; non-validated configurations should work but have not been benchmarked end-to-end):
| Arch | Compute Cap | MXFP4 MoE | NSA sparse MLA | Validated |
|---|---|---|---|---|
| Hopper (H100 / H200) | SM_90 | triton_kernels | flash_mla wheel | — |
| Datacenter Blackwell (B100 / B200) | SM_100 | trtllm-fp4 | Triton fallback | — |
| Consumer Blackwell (RTX 5090) | SM_120 | triton_kernels | Triton fallback | ✓ |
| Ada Lovelace (RTX 4090 / L20 / L40) | SM_89 | triton_kernels | Triton fallback | — |
| Ampere (A100 / A6000) | SM_80 / SM_86 | triton_kernels | Triton fallback | — |
KT-Kernel installed:
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule update --init --recursive
cd kt-kernel && ./install.sh
SGLang installed (kvcache-ai fork):
./install.sh # from ktransformers root
CUDA 12.8+ and flashinfer ≥ 0.6.9 (flashinfer-python and flashinfer-cubin must be the same version):
pip install --upgrade flashinfer-python flashinfer-cubin
This upgrade is required (even though sglang-kt pins flashinfer_python==0.6.3) because V4-Flash's MXFP4 MoE module imports mxfp8_quantize, trtllm_fp4_block_scale_routed_moe, etc., which only exist in flashinfer ≥ 0.6.9;
mkdir -p /path/to/models
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
--local-dir /path/to/models/DeepSeek-V4-Flash
numactl --interleave=all python -m sglang.launch_server \
--host 127.0.0.1 \
--port 30000 \
--model /path/to/models/DeepSeek-V4-Flash \
--kt-weight-path /path/to/models/DeepSeek-V4-Flash \
--kt-method MXFP4 \
--kt-num-gpu-experts 144 \
--kt-cpuinfer 8 \
--kt-threadpool-count 2 \
--kt-gpu-prefill-token-threshold 4096 \
--kt-enable-dynamic-expert-update \
--tensor-parallel-size 8 \
--attention-backend flashinfer \
--mem-fraction-static 0.80 \
--chunked-prefill-size 2048 \
--max-running-requests 4 \
--max-total-tokens 32768 \
--watchdog-timeout 3000 \
--disable-shared-experts-fusion \
--cuda-graph-bs 1 2 4 \
--cuda-graph-max-bs 4 \
--trust-remote-code
It takes about 4-5 minutes to start the server (weight load + CUDA Graph capture).
See KT-Kernel Parameters for detailed parameter tuning guidelines.
curl -s -X POST http://127.0.0.1:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Explain quantum computing in detail:",
"sampling_params": {"temperature": 0.0, "max_new_tokens": 256}
}'
The kt CLI ships with an OpenAI-compatible chat client that talks to the SGLang server's /v1/chat/completions endpoint:
kt chat --host 127.0.0.1 --port 30000 --temperature 0.7 --max-tokens 2048