kt-kernel/scripts/README.md
KT-Kernel provides weight conversion tools for CPU-GPU hybrid inference (e.g., integrating KTransformers with SGLang). Both tools work together to enable heterogeneous expert placement:
convert_cpu_weights.py): Quantize weights to INT4/INT8 with AMX optimization for CPU-resident "cold" expertsconvert_gpu_weights.py): Apply GPTQ/RTN quantization (W4A16/W8A16) for GPU-resident "hot" expertsconvert_kt_to_sglang_adapter.py): Convert KT SFT fused expert LoRA checkpoints into adapter-only SafeTensors directoriesKT SFT fused expert LoRA saves MoE expert LoRA tensors in fused_expert_lora.safetensors using compact 3D tensors:
layers.{L}.experts.gate_lora_a
layers.{L}.experts.gate_lora_b
layers.{L}.experts.up_lora_a
layers.{L}.experts.up_lora_b
layers.{L}.experts.down_lora_a
layers.{L}.experts.down_lora_b
Use convert_kt_to_sglang_adapter.py to convert raw KT SFT output into one merged SGLang adapter directory:
python scripts/convert_kt_to_sglang_adapter.py /path/to/kt_adapter /path/to/sglang_adapter \
--base-model-name-or-path /path/to/base_model \
--lora-alpha 16 \
--overwrite
Output:
sglang_adapter/
├── adapter_config.json
└── adapter_model.safetensors
The converter merges the existing non-expert adapter_model.safetensors with expanded expert tensors from fused_expert_lora.safetensors. Pass this merged directory to SGLang with:
--enable-lora \
--lora-paths my_lora=/path/to/sglang_adapter
The KTransformers SGLang fork will auto-split the merged adapter internally at server startup. Users do not need to pass separate expert and non-expert adapter paths in the normal workflow.
Optional split outputs for debugging:
python scripts/convert_kt_to_sglang_adapter.py /path/to/kt_adapter /path/to/sglang_adapter \
--base-model-name-or-path /path/to/base_model \
--expert-output-dir /path/to/expert_adapter \
--nonexpert-output-dir /path/to/nonexpert_adapter \
--overwrite
Existing PEFT prefixes such as base_model.model. are stripped to match SGLang's loader. Scaling is not folded into the LoRA B tensors. Runtime scaling remains lora_alpha / r; if the input directory has no adapter_config.json, pass --lora-alpha explicitly.
This script only converts adapter files. Serving compatibility depends on the KTransformers SGLang runtime branch being used.
The unit tests use synthetic tensors and run without model files. To validate a real KT adapter directory, set these environment variables:
export KT_LORA_ADAPTER_DIR=/path/to/kt_adapter
export KT_LORA_BASE_MODEL=/path/to/base_model
export KT_LORA_ALPHA=16 # required only if the input has no adapter_config.json
Then run:
python -m pytest kt-kernel/test/per_commit/test_convert_kt_to_sglang_adapter_integration.py -q
To run a large adapter conversion smoke test, also set:
export KT_LORA_LARGE_ADAPTER_DIR=/path/to/large_kt_adapter
These integration tests check real fused tensor splitting, optional adapter_model.safetensors merging, adapter_config.json compatibility with sglang.srt.lora.lora_config.LoRAConfig, and large-file readability. They intentionally do not start an SGLang server or validate runtime FusedMoE LoRA application.
Convert weights to INT4/INT8 format optimized for AMX inference on CPU. These quantized weights are used for "cold" experts (less frequently accessed) that run on CPU in hybrid inference scenarios.
⚠️ Precision Warning: Quantizing directly from FP8 to INT4/INT8 may cause significant accuracy degradation. For best results, use the original BF16 model as the source for INT4/INT8 quantization.
python scripts/convert_cpu_weights.py \
--input-path /path/to/bf16/model \
--input-type bf16 \
--output /path/to/output \
--quant-method int4
python scripts/convert_cpu_weights.py \
--input-path /path/to/fp16/model \
--input-type fp16 \
--output /path/to/output \
--quant-method int8
python scripts/convert_cpu_weights.py \
--input-path /path/to/fp8/model \
--input-type fp8 \
--output /path/to/output \
--quant-method int4
By default, the converted weights are saved in SafeTensors format with NUMA-aware layout:
output_dir/
├── model-00001-of-00050.safetensors
├── model-00002-of-00050.safetensors
├── ...
├── config.json
└── tokenizer files...
Each expert's weights are split across NUMA nodes for optimal memory access:
blk.{layer}.ffn_{proj}_exps.{expert}.numa.{numa_idx}.weight: Quantized weightsblk.{layer}.ffn_{proj}_exps.{expert}.numa.{numa_idx}.scale: Quantization scalesFor systems with insufficient memory to complete full model quantization, use the --no-merge-safetensor flag to keep weights in layer folder structure without merging into safetensor files:
python scripts/convert_cpu_weights.py \
--input-path /path/to/model \
--input-type bf16 \
--output /path/to/output \
--quant-method int4 \
--no-merge-safetensor
This will save quantized weights in the following folder structure:
output_dir/
├── _layer_0/
│ ├── _numa_0/
│ │ ├── INT4_down_0_*.kt
│ │ ├── INT4_gate_0_*.kt
│ │ └── INT4_up_0_*.kt
│ └── _numa_1/
│ └── ...
├── _layer_1/
│ └── ...
└── ...
When to use --no-merge-safetensor:
For memory-constrained systems that are unable to complete quantization despite enabling low memory mode with --no-merge-safetensor, restart the script with the --resume-layer arg to specify the layer from which to continue the conversion process. In the example below, we skip layers 0-11 and resume conversion starting with layer 12.
python scripts/convert_cpu_weights.py \
--input-path /path/to/model \
--input-type bf16 \
--output /path/to/output \
--quant-method int4 \
--no-merge-safetensor
--resume-layer 12
python scripts/convert_cpu_weights.py \
--input-path /mnt/data/models/DeepSeek-V3.1 \
--input-type fp8 \
--output /mnt/data/models/DeepSeek-V3.1-INT4 \
--quant-method int4 \
--cpuinfer-threads 60 \
--threadpool-count 2
python scripts/convert_cpu_weights.py \
--input-path /mnt/data/models/Qwen3-Next-80B-A3B-Instruct \
--input-type bf16 \
--output /mnt/data/models/Qwen3-Next-80B-A3B-Instruct-INT4 \
--quant-method int4 \
--cpuinfer-threads 60 \
--threadpool-count 2 \
--no-merge-safetensor
GPU weight quantization requires additional dependencies. Install them before proceeding:
pip install accelerate transformers llmcompressor datasets
Required packages:
accelerate: For distributed model loading and device mappingtransformers: For model and tokenizer loadingllmcompressor: For quantization (supports GPTQ and RTN methods)datasets: For calibration data loading (GPTQ only)Documentation: This tool is based on llmcompressor. For more details, see llmcompressor quantization guide.
Apply weight quantization to model weights for GPU-resident "hot" experts (frequently accessed) in CPU-GPU hybrid inference. This tool works together with convert_cpu_weights.py to enable heterogeneous expert placement:
This approach maximizes throughput and resource utilization by intelligently distributing experts across CPUs and GPUs.
Pros:
Cons:
Pros:
Cons:
python scripts/convert_gpu_weights.py \
--model_id /path/to/model \
--output_dir /path/to/output \
--quant_method GPTQ \
--quant_type W4A16
python scripts/convert_gpu_weights.py \
--model_id /path/to/model \
--output_dir /path/to/output \
--quant_method RTN \
--quant_type W4A16
Understanding memory requirements is crucial for successful quantization. The requirements differ significantly between RTN and GPTQ methods.
RTN only requires memory for quantization parameters (scales/zero-points):
| Component | Requirement |
|---|---|
| DRAM (CPU Memory) | ≥ Total model parameters |
| VRAM (GPU Memory) | ≥ Single layer parameters |
Example: DeepSeek-R1-0528-BF16 (684B parameters)
GPTQ requires additional memory for Hessian matrices during calibration:
| Component | Requirement |
|---|---|
| DRAM (CPU Memory) | ≥ Total model parameters |
| VRAM (GPU Memory) | ≥ Single layer parameters × 2 |
The Hessian matrix is approximately the same size as the layer weights and is used to increase accuracy recovery.
Example: DeepSeek-R1-0528-BF16 (684B parameters)
| Method | Speed | VRAM | Accuracy | Use Case |
|---|---|---|---|---|
| RTN | Fast | Low (~22GB) | Good | Testing, prototyping |
| GPTQ | Slow | High (~45GB) | Better | Production deployment |
For GPTQ quantization, control the calibration process for better quantization quality:
python scripts/convert_gpu_weights.py \
--model_id /path/to/model \
--output_dir /path/to/output \
--quant_method GPTQ \
--quant_type W4A16 \
--num_calibration_samples 512 \
--max_sequence_length 2048 \
--dataset HuggingFaceH4/ultrachat_200k \
--dataset_split train_sft
Options (GPTQ only):
--num_calibration_samples: Number of samples for calibration (default: 512)--max_sequence_length: Maximum sequence length (default: 2048)--dataset: HuggingFace dataset for calibration--dataset_split: Dataset split to use--dampening_frac: Dampening fraction to reduce quantization noise (default: 0.1)Use --max_gpu_memory to limit GPU memory usage and offload remaining layers to CPU:
python scripts/convert_gpu_weights.py \
--model_id /path/to/model \
--output_dir /path/to/output \
--quant_method GPTQ \
--quant_type W4A16 \
--max_gpu_memory "40GiB"
Recommended settings for GPTQ:
| GPU VRAM | Suggested --max_gpu_memory | Notes |
|---|---|---|
| 24 GiB | 10-12 GiB | Reserve ~50% for Hessian |
| 48 GiB | 24-30 GiB | Reserve ~40% for Hessian |
| 80 GiB | 40-50 GiB | Reserve ~40% for Hessian |
Recommended settings for RTN:
| GPU VRAM | Suggested --max_gpu_memory | Notes |
|---|---|---|
| 24 GiB | 18-20 GiB | No Hessian needed |
| 48 GiB | 40-45 GiB | No Hessian needed |
| 80 GiB | 70-75 GiB | No Hessian needed |
Options:
--max_gpu_memory: Maximum GPU memory for model weights per device (e.g., '40GiB')--max_cpu_memory: Maximum CPU memory (default: 1000GiB when --max_gpu_memory is set)Important: llmcompressor does not support disk offloading. Ensure your machine has enough GPU + CPU memory to load the entire model. If you still encounter OOM:
--num_calibration_samples (GPTQ only, e.g., 256)--max_sequence_length (GPTQ only, e.g., 1024)--force_cpu to run entirely on CPU (slower but avoids GPU OOM)python scripts/convert_gpu_weights.py \
--model_id /mnt/data/models/Qwen3-Next-80B-A3B-Instruct \
--output_dir /mnt/data/models/Qwen3-Next-80B-A3B-Instruct-GPTQ-W4A16 \
--quant_method GPTQ \
--quant_type W4A16 \
--num_calibration_samples 512 \
--max_sequence_length 2048 \
--max_gpu_memory "40GiB" \
--trust_remote_code
python scripts/convert_gpu_weights.py \
--model_id /mnt/data/models/DeepSeek-R1-0528-BF16 \
--output_dir /mnt/data/models/DeepSeek-R1-0528-RTN-W4A16 \
--quant_method RTN \
--quant_type W4A16 \
--max_gpu_memory "70GiB" \
--trust_remote_code
python scripts/convert_gpu_weights.py \
--model_id /mnt/data/models/GLM-4.5-Air \
--output_dir /mnt/data/models/GLM-4.5-Air-GPTQ-W8A16 \
--quant_method GPTQ \
--quant_type W8A16 \
--dataset "tatsu-lab/alpaca" \
--dataset_split "train" \
--num_calibration_samples 256 \
--max_gpu_memory "40GiB" \
--trust_remote_code