Back to Ktransformers

Running DeepSeek-V4-Flash with SGLang and KT-Kernel

doc/en/DeepSeek-V4-Flash.md

0.6.24.3 KB
Original Source

Running DeepSeek-V4-Flash with SGLang and KT-Kernel

This tutorial demonstrates how to run DeepSeek-V4-Flash model inference using SGLang integrated with KT-Kernel for CPU-GPU heterogeneous inference. The hybrid path splits MXFP4 routed experts between CPU (KT-Kernel cpuinfer) and GPU (sglang kt-num-gpu-experts), enabling deployment on consumer-grade hardware.

Table of Contents

Hardware Requirements

Validated Configuration (this tutorial):

  • GPU: 8× NVIDIA RTX 5090 (32GB VRAM each, SM_120)
  • CPU: x86 CPU with AVX512 support
  • RAM: ≥256GB system memory
  • Storage: ~340GB for model weights

Supported GPU architectures (auto-detected at startup; non-validated configurations should work but have not been benchmarked end-to-end):

ArchCompute CapMXFP4 MoENSA sparse MLAValidated
Hopper (H100 / H200)SM_90triton_kernelsflash_mla wheel
Datacenter Blackwell (B100 / B200)SM_100trtllm-fp4Triton fallback
Consumer Blackwell (RTX 5090)SM_120triton_kernelsTriton fallback
Ada Lovelace (RTX 4090 / L20 / L40)SM_89triton_kernelsTriton fallback
Ampere (A100 / A6000)SM_80 / SM_86triton_kernelsTriton fallback

Prerequisites

  1. KT-Kernel installed:

    bash
    git clone https://github.com/kvcache-ai/ktransformers.git
    cd ktransformers
    git submodule update --init --recursive
    cd kt-kernel && ./install.sh
    
  2. SGLang installed (kvcache-ai fork):

    bash
    ./install.sh   # from ktransformers root
    
  3. CUDA 12.8+ and flashinfer ≥ 0.6.9 (flashinfer-python and flashinfer-cubin must be the same version):

    bash
    pip install --upgrade flashinfer-python flashinfer-cubin
    

    This upgrade is required (even though sglang-kt pins flashinfer_python==0.6.3) because V4-Flash's MXFP4 MoE module imports mxfp8_quantize, trtllm_fp4_block_scale_routed_moe, etc., which only exist in flashinfer ≥ 0.6.9;

Step 1: Download Model Weights

bash
mkdir -p /path/to/models
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir /path/to/models/DeepSeek-V4-Flash

Step 2: Launch SGLang Server

Launch Command (8× RTX 5090 Example)

bash
numactl --interleave=all python -m sglang.launch_server \
  --host 127.0.0.1 \
  --port 30000 \
  --model /path/to/models/DeepSeek-V4-Flash \
  --kt-weight-path /path/to/models/DeepSeek-V4-Flash \
  --kt-method MXFP4 \
  --kt-num-gpu-experts 144 \
  --kt-cpuinfer 8 \
  --kt-threadpool-count 2 \
  --kt-gpu-prefill-token-threshold 4096 \
  --kt-enable-dynamic-expert-update \
  --tensor-parallel-size 8 \
  --attention-backend flashinfer \
  --mem-fraction-static 0.80 \
  --chunked-prefill-size 2048 \
  --max-running-requests 4 \
  --max-total-tokens 32768 \
  --watchdog-timeout 3000 \
  --disable-shared-experts-fusion \
  --cuda-graph-bs 1 2 4 \
  --cuda-graph-max-bs 4 \
  --trust-remote-code

It takes about 4-5 minutes to start the server (weight load + CUDA Graph capture).

See KT-Kernel Parameters for detailed parameter tuning guidelines.

Step 3: Send Inference Requests

Decode

bash
curl -s -X POST http://127.0.0.1:30000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Explain quantum computing in detail:",
    "sampling_params": {"temperature": 0.0, "max_new_tokens": 256}
  }'

Interactive Chat (kt chat)

The kt CLI ships with an OpenAI-compatible chat client that talks to the SGLang server's /v1/chat/completions endpoint:

bash
kt chat --host 127.0.0.1 --port 30000 --temperature 0.7 --max-tokens 2048