doc/en/SFT_Installation_Guide_KimiK2.5.md
This tutorial demonstrates how to perform LoRA Supervised Fine-Tuning (SFT) on Kimi-K2.5 using LlamaFactory with KTransformers as the backend, and then serve the fine-tuned model using SGLang.
The workflow is:
KTransformers + LlamaFactory LoRA SFT → (Optional) LlamaFactory Verification → SGLang Serving
We recommend to separate two conda environments:
| Environment | Purpose |
|---|---|
kt-kernel | Inference & serving (KTransformers + SGLang) |
kt-sft | Training (LlamaFactory + KTransformers SFT backend) |
kt-kernelconda create -n kt-kernel python=3.11
conda activate kt-kernel
git clone https://github.com/kvcache-ai/ktransformers.git
git checkout kimi_k2.5
git submodule update --init --recursive
cd kt-kernel && ./install.sh
Recommended for Kimi-K2.5:
# Option A: One-click install (from ktransformers root, installs sglang + kt-kernel)
./install.sh
# Option B: pip install
pip install sglang-kt
kt-sftconda create -n kt-sft python=3.11
conda activate kt-sft
git clone https://github.com/hiyouga/LlamaFactory.git
cd LlamaFactory
pip install -e .
conda install -y -c conda-forge libstdcxx-ng gcc_impl_linux-64
conda install -y -c nvidia/label/cuda-11.8.0 cuda-runtime
# Install matching wheels (recommended), from https://github.com/kvcache-ai/ktransformers/releases
pip install ktransformers-<matching-version>.whl
pip install flash_attn-<matching-version>.whl
KTransformers requires BF16 weights for SFT.
# Download Kimi-K2.5 (RAW-INT4 for both CPU and GPU)
huggingface-cli download moonshotai/Kimi-K2.5 \
--local-dir /path/to/kimi-k2.5
Kimi-K2.5 base model is in INT4 format, convert it to BF16 before SFT.
Example file:
examples/train_lora/kimik2_lora_sft_kt.yaml
Required fields:
stage: sft
finetuning_type: lora
bf16: true
use_kt: true
kt_optimize_rule: <rule.yaml>
cpu_infer: 32
chunk_size: 8192
Other fields (dataset, output_dir, learning rate, epochs) can be adjusted as usual.
Key requirements:
adapter_name_or_path: LoRA output directoryinfer_backend: ktransformersuse_kt and kt_optimize_rule as trainingThis YAML is used only for quick verification, not production serving.
conda activate kt-sft
cd LlamaFactory
USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml
After training, the LoRA adapter is saved to output_dir.
Before production deployment, the new PDF recommends a lightweight sanity check.
conda activate kt-sft
cd LlamaFactory
llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml
Purpose:
This is the major runtime update introduced by the new PDF.
python ktransformers/kt-kernel/scripts/convert_lora.py \
--base_path /path/to/kimi-base-model \
--lora_path /path/to/llamafactory/output_dir \
--output_path /path/to/lora_converted
To reduce CPU memory usage:
python ktransformers/kt-kernel/scripts/convert_cpu_weights.py \
--base_path /path/to/kimi-base-model \
--output_dir /path/to/kimi-base-model-int8
This produces:
/path/to/kimi-base-model-int8/int8
conda activate kt-kernel
python -m sglang.launch_server \
--enable-lora \
--lora-paths lora1=/path/to/lora_converted \
--lora-backend triton \
--model-path /path/to/kimi-base-model \
--tp 1 \
--trust-remote-code \
--context-length 4096 \
--kt-weight-path /path/to/kimi-base-model-int8/int8 \
--mem-fraction-static 0.9
Notes:
--kt-weight-path points to CPU INT8 weightstp, context-length, and memory parameters per machine