kt-kernel/scripts/README.md
KT-Kernel provides weight conversion tools for CPU-GPU hybrid inference (e.g., integrating KTransformers with SGLang). Both tools work together to enable heterogeneous expert placement:
convert_cpu_weights.py): Quantize weights to INT4/INT8 with AMX optimization for CPU-resident "cold" expertsconvert_gpu_weights.py): Apply GPTQ/RTN quantization (W4A16/W8A16) for GPU-resident "hot" expertsConvert weights to INT4/INT8 format optimized for AMX inference on CPU. These quantized weights are used for "cold" experts (less frequently accessed) that run on CPU in hybrid inference scenarios.
⚠️ Precision Warning: Quantizing directly from FP8 to INT4/INT8 may cause significant accuracy degradation. For best results, use the original BF16 model as the source for INT4/INT8 quantization.
python scripts/convert_cpu_weights.py \
--input-path /path/to/bf16/model \
--input-type bf16 \
--output /path/to/output \
--quant-method int4
python scripts/convert_cpu_weights.py \
--input-path /path/to/fp16/model \
--input-type fp16 \
--output /path/to/output \
--quant-method int8
python scripts/convert_cpu_weights.py \
--input-path /path/to/fp8/model \
--input-type fp8 \
--output /path/to/output \
--quant-method int4
By default, the converted weights are saved in SafeTensors format with NUMA-aware layout:
output_dir/
├── model-00001-of-00050.safetensors
├── model-00002-of-00050.safetensors
├── ...
├── config.json
└── tokenizer files...
Each expert's weights are split across NUMA nodes for optimal memory access:
blk.{layer}.ffn_{proj}_exps.{expert}.numa.{numa_idx}.weight: Quantized weightsblk.{layer}.ffn_{proj}_exps.{expert}.numa.{numa_idx}.scale: Quantization scalesFor systems with insufficient memory to complete full model quantization, use the --no-merge-safetensor flag to keep weights in layer folder structure without merging into safetensor files:
python scripts/convert_cpu_weights.py \
--input-path /path/to/model \
--input-type bf16 \
--output /path/to/output \
--quant-method int4 \
--no-merge-safetensor
This will save quantized weights in the following folder structure:
output_dir/
├── _layer_0/
│ ├── _numa_0/
│ │ ├── INT4_down_0_*.kt
│ │ ├── INT4_gate_0_*.kt
│ │ └── INT4_up_0_*.kt
│ └── _numa_1/
│ └── ...
├── _layer_1/
│ └── ...
└── ...
When to use --no-merge-safetensor:
For memory-constrained systems that are unable to complete quantization despite enabling low memory mode with --no-merge-safetensor, restart the script with the --resume-layer arg to specify the layer from which to continue the conversion process. In the example below, we skip layers 0-11 and resume conversion starting with layer 12.
python scripts/convert_cpu_weights.py \
--input-path /path/to/model \
--input-type bf16 \
--output /path/to/output \
--quant-method int4 \
--no-merge-safetensor
--resume-layer 12
python scripts/convert_cpu_weights.py \
--input-path /mnt/data/models/DeepSeek-V3.1 \
--input-type fp8 \
--output /mnt/data/models/DeepSeek-V3.1-INT4 \
--quant-method int4 \
--cpuinfer-threads 60 \
--threadpool-count 2
python scripts/convert_cpu_weights.py \
--input-path /mnt/data/models/Qwen3-Next-80B-A3B-Instruct \
--input-type bf16 \
--output /mnt/data/models/Qwen3-Next-80B-A3B-Instruct-INT4 \
--quant-method int4 \
--cpuinfer-threads 60 \
--threadpool-count 2 \
--no-merge-safetensor
GPU weight quantization requires additional dependencies. Install them before proceeding:
pip install accelerate transformers llmcompressor datasets
Required packages:
accelerate: For distributed model loading and device mappingtransformers: For model and tokenizer loadingllmcompressor: For quantization (supports GPTQ and RTN methods)datasets: For calibration data loading (GPTQ only)Documentation: This tool is based on llmcompressor. For more details, see llmcompressor quantization guide.
Apply weight quantization to model weights for GPU-resident "hot" experts (frequently accessed) in CPU-GPU hybrid inference. This tool works together with convert_cpu_weights.py to enable heterogeneous expert placement:
This approach maximizes throughput and resource utilization by intelligently distributing experts across CPUs and GPUs.
Pros:
Cons:
Pros:
Cons:
python scripts/convert_gpu_weights.py \
--model_id /path/to/model \
--output_dir /path/to/output \
--quant_method GPTQ \
--quant_type W4A16
python scripts/convert_gpu_weights.py \
--model_id /path/to/model \
--output_dir /path/to/output \
--quant_method RTN \
--quant_type W4A16
Understanding memory requirements is crucial for successful quantization. The requirements differ significantly between RTN and GPTQ methods.
RTN only requires memory for quantization parameters (scales/zero-points):
| Component | Requirement |
|---|---|
| DRAM (CPU Memory) | ≥ Total model parameters |
| VRAM (GPU Memory) | ≥ Single layer parameters |
Example: DeepSeek-R1-0528-BF16 (684B parameters)
GPTQ requires additional memory for Hessian matrices during calibration:
| Component | Requirement |
|---|---|
| DRAM (CPU Memory) | ≥ Total model parameters |
| VRAM (GPU Memory) | ≥ Single layer parameters × 2 |
The Hessian matrix is approximately the same size as the layer weights and is used to increase accuracy recovery.
Example: DeepSeek-R1-0528-BF16 (684B parameters)
| Method | Speed | VRAM | Accuracy | Use Case |
|---|---|---|---|---|
| RTN | Fast | Low (~22GB) | Good | Testing, prototyping |
| GPTQ | Slow | High (~45GB) | Better | Production deployment |
For GPTQ quantization, control the calibration process for better quantization quality:
python scripts/convert_gpu_weights.py \
--model_id /path/to/model \
--output_dir /path/to/output \
--quant_method GPTQ \
--quant_type W4A16 \
--num_calibration_samples 512 \
--max_sequence_length 2048 \
--dataset HuggingFaceH4/ultrachat_200k \
--dataset_split train_sft
Options (GPTQ only):
--num_calibration_samples: Number of samples for calibration (default: 512)--max_sequence_length: Maximum sequence length (default: 2048)--dataset: HuggingFace dataset for calibration--dataset_split: Dataset split to use--dampening_frac: Dampening fraction to reduce quantization noise (default: 0.1)Use --max_gpu_memory to limit GPU memory usage and offload remaining layers to CPU:
python scripts/convert_gpu_weights.py \
--model_id /path/to/model \
--output_dir /path/to/output \
--quant_method GPTQ \
--quant_type W4A16 \
--max_gpu_memory "40GiB"
Recommended settings for GPTQ:
| GPU VRAM | Suggested --max_gpu_memory | Notes |
|---|---|---|
| 24 GiB | 10-12 GiB | Reserve ~50% for Hessian |
| 48 GiB | 24-30 GiB | Reserve ~40% for Hessian |
| 80 GiB | 40-50 GiB | Reserve ~40% for Hessian |
Recommended settings for RTN:
| GPU VRAM | Suggested --max_gpu_memory | Notes |
|---|---|---|
| 24 GiB | 18-20 GiB | No Hessian needed |
| 48 GiB | 40-45 GiB | No Hessian needed |
| 80 GiB | 70-75 GiB | No Hessian needed |
Options:
--max_gpu_memory: Maximum GPU memory for model weights per device (e.g., '40GiB')--max_cpu_memory: Maximum CPU memory (default: 1000GiB when --max_gpu_memory is set)Important: llmcompressor does not support disk offloading. Ensure your machine has enough GPU + CPU memory to load the entire model. If you still encounter OOM:
--num_calibration_samples (GPTQ only, e.g., 256)--max_sequence_length (GPTQ only, e.g., 1024)--force_cpu to run entirely on CPU (slower but avoids GPU OOM)python scripts/convert_gpu_weights.py \
--model_id /mnt/data/models/Qwen3-Next-80B-A3B-Instruct \
--output_dir /mnt/data/models/Qwen3-Next-80B-A3B-Instruct-GPTQ-W4A16 \
--quant_method GPTQ \
--quant_type W4A16 \
--num_calibration_samples 512 \
--max_sequence_length 2048 \
--max_gpu_memory "40GiB" \
--trust_remote_code
python scripts/convert_gpu_weights.py \
--model_id /mnt/data/models/DeepSeek-R1-0528-BF16 \
--output_dir /mnt/data/models/DeepSeek-R1-0528-RTN-W4A16 \
--quant_method RTN \
--quant_type W4A16 \
--max_gpu_memory "70GiB" \
--trust_remote_code
python scripts/convert_gpu_weights.py \
--model_id /mnt/data/models/GLM-4.5-Air \
--output_dir /mnt/data/models/GLM-4.5-Air-GPTQ-W8A16 \
--quant_method GPTQ \
--quant_type W8A16 \
--dataset "tatsu-lab/alpaca" \
--dataset_split "train" \
--num_calibration_samples 256 \
--max_gpu_memory "40GiB" \
--trust_remote_code