Running DeepSeek-V4-Flash with SGLang and KT-Kernel

This tutorial demonstrates how to run DeepSeek-V4-Flash model inference using SGLang integrated with KT-Kernel for CPU-GPU heterogeneous inference. The hybrid path splits MXFP4 routed experts between CPU (KT-Kernel cpuinfer) and GPU (sglang kt-num-gpu-experts), enabling deployment on consumer-grade hardware.

Running DeepSeek-V4-Flash with SGLang and KT-Kernel

Hardware Requirements

Validated Configuration (this tutorial):

GPU: 8× NVIDIA RTX 5090 (32GB VRAM each, SM_120)
CPU: x86 CPU with AVX512 support
RAM: ≥256GB system memory
Storage: ~340GB for model weights

Supported GPU architectures (auto-detected at startup; non-validated configurations should work but have not been benchmarked end-to-end):

Arch	Compute Cap	MXFP4 MoE	NSA sparse MLA	Validated
Hopper (H100 / H200)	SM_90	triton_kernels	flash_mla wheel	—
Datacenter Blackwell (B100 / B200)	SM_100	trtllm-fp4	Triton fallback	—
Consumer Blackwell (RTX 5090)	SM_120	triton_kernels	Triton fallback	✓
Ada Lovelace (RTX 4090 / L20 / L40)	SM_89	triton_kernels	Triton fallback	—
Ampere (A100 / A6000)	SM_80 / SM_86	triton_kernels	Triton fallback	—

Prerequisites

KT-Kernel installed:

bash

git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule update --init --recursive
cd kt-kernel && ./install.sh

SGLang installed (kvcache-ai fork):

bash

./install.sh   # from ktransformers root

CUDA 12.8+ and flashinfer ≥ 0.6.9 (flashinfer-python and flashinfer-cubin must be the same version):
bash
```
pip install --upgrade flashinfer-python flashinfer-cubin
```
This upgrade is required (even though sglang-kt pins flashinfer_python==0.6.3) because V4-Flash's MXFP4 MoE module imports mxfp8_quantize, trtllm_fp4_block_scale_routed_moe, etc., which only exist in flashinfer ≥ 0.6.9;

Step 1: Download Model Weights

bash

mkdir -p /path/to/models
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir /path/to/models/DeepSeek-V4-Flash

Step 2: Launch SGLang Server

Launch Command (8× RTX 5090 Example)

bash

numactl --interleave=all python -m sglang.launch_server \
  --host 127.0.0.1 \
  --port 30000 \
  --model /path/to/models/DeepSeek-V4-Flash \
  --kt-weight-path /path/to/models/DeepSeek-V4-Flash \
  --kt-method MXFP4 \
  --kt-num-gpu-experts 144 \
  --kt-cpuinfer 8 \
  --kt-threadpool-count 2 \
  --kt-gpu-prefill-token-threshold 4096 \
  --kt-enable-dynamic-expert-update \
  --tensor-parallel-size 8 \
  --attention-backend flashinfer \
  --mem-fraction-static 0.80 \
  --chunked-prefill-size 2048 \
  --max-running-requests 4 \
  --max-total-tokens 32768 \
  --watchdog-timeout 3000 \
  --disable-shared-experts-fusion \
  --cuda-graph-bs 1 2 4 \
  --cuda-graph-max-bs 4 \
  --trust-remote-code

It takes about 4-5 minutes to start the server (weight load + CUDA Graph capture).

See KT-Kernel Parameters for detailed parameter tuning guidelines.

Step 3: Send Inference Requests

Decode

bash

curl -s -X POST http://127.0.0.1:30000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Explain quantum computing in detail:",
    "sampling_params": {"temperature": 0.0, "max_new_tokens": 256}
  }'

Interactive Chat (kt chat)

The kt CLI ships with an OpenAI-compatible chat client that talks to the SGLang server's /v1/chat/completions endpoint:

bash

kt chat --host 127.0.0.1 --port 30000 --temperature 0.7 --max-tokens 2048

Running DeepSeek-V4-Flash with SGLang and KT-Kernel

Running DeepSeek-V4-Flash with SGLang and KT-Kernel

Table of Contents

Hardware Requirements

Prerequisites

Step 1: Download Model Weights

Step 2: Launch SGLang Server

Launch Command (8× RTX 5090 Example)

Step 3: Send Inference Requests

Decode

Interactive Chat (kt chat)