doc/en/SFT/Qwen3.5-SGLang-LoRA-Serving.md
Last updated: 2026-06-01
This guide documents the current KT-FT loop for Qwen3.5 MoE: train with KT SFT, convert the output once, and serve the fine-tuned result through SGLang with a single merged adapter path.
KT SFT raw output
-> convert_kt_to_sglang_adapter.py
-> <MERGED_ADAPTER_DIR>
-> sglang --lora-paths <name>=<MERGED_ADAPTER_DIR>
-> server auto-splits expert / non-expert internally
-> request model=<served_model>:<name>
Training-side KT SFT docs remain separate. This page focuses on the bridge from trained LoRA artifacts to online inference.
Current supported and validated workflow:
Qwen3.5-35B-A3BAfter LLaMA-Factory + KT training, the output directory contains two LoRA artifacts:
<KT_SFT_OUTPUT_DIR>/
adapter_model.safetensors # non-expert LoRA
fused_expert_lora.safetensors # expert LoRA in KT fused format
adapter_config.json
Do not pass this raw directory directly to SGLang serving.
Run the converter once to produce the serving input:
<MERGED_ADAPTER_DIR>/
adapter_config.json
adapter_model.safetensors
This merged directory contains both expert and non-expert LoRA tensors in one PEFT-style adapter. Pass only this directory to --lora-paths.
python kt-kernel/scripts/convert_kt_to_sglang_adapter.py \
<KT_SFT_OUTPUT_DIR> \
<MERGED_ADAPTER_DIR> \
--base-model-name-or-path /path/to/Qwen3.5-35B-A3B \
--overwrite
Example:
python kt-kernel/scripts/convert_kt_to_sglang_adapter.py \
saves/KT_FT_qwen35B_Moe_nekoqa_eod_240 \
saves/KT_FT_qwen35B_Moe_nekoqa_eod_240_sglang \
--base-model-name-or-path /mnt/data3/models/Qwen3.5-35B-A3B \
--overwrite
The converter reads fused_expert_lora.safetensors and the existing non-expert adapter_model.safetensors, then writes one merged adapter directory.
Optional split outputs for debugging:
python kt-kernel/scripts/convert_kt_to_sglang_adapter.py \
<KT_SFT_OUTPUT_DIR> \
<MERGED_ADAPTER_DIR> \
--base-model-name-or-path /path/to/Qwen3.5-35B-A3B \
--expert-output-dir <EXPERT_ADAPTER_DIR> \
--nonexpert-output-dir <NONEXPERT_ADAPTER_DIR> \
--overwrite
For normal serving, only <MERGED_ADAPTER_DIR> is needed.
Use the KTransformers SGLang fork from this repository and point PYTHONPATH at both kt-kernel/python and third_party/sglang/python.
cd /path/to/ktransformers
PYTHONPATH=/path/to/ktransformers/kt-kernel/python:/path/to/ktransformers/third_party/sglang/python:$PYTHONPATH \
python -m sglang.launch_server \
--host 127.0.0.1 \
--port 30006 \
--model-path /path/to/Qwen3.5-35B-A3B \
--tokenizer-path /path/to/Qwen3.5-35B-A3B \
--kt-weight-path /path/to/Qwen3.5-35B-A3B-AMXINT4 \
--kt-method AMXINT4 \
--kt-cpuinfer 60 \
--kt-threadpool-count 2 \
--kt-numa-nodes 0 1 \
--kt-num-gpu-experts 0 \
--attention-backend flashinfer \
--trust-remote-code \
--mem-fraction-static 0.98 \
--chunked-prefill-size 4096 \
--max-running-requests 2 \
--max-total-tokens 32000 \
--served-model-name qwen3.5-kt-ft \
--enable-mixed-chunk \
--tensor-parallel-size 4 \
--enable-p2p-check \
--disable-cuda-graph \
--disable-custom-all-reduce \
--enable-lora \
--lora-backend triton \
--lora-paths qwen35b_neko=/path/to/KT_FT_qwen35B_Moe_nekoqa_eod_240_sglang \
--log-level info
Important points:
--lora-paths.--kt-expert-lora-path in the normal user workflow.$TMPDIR/sglang_kt_lora_cache/ (or $SGLANG_KT_LORA_CACHE_DIR if set).--lora-backend triton for Qwen3.5 full-LoRA generation.Current constraints:
--kt-num-gpu-experts 0--kt-enable-dynamic-expert-update--kt-gpu-prefill-token-thresholdAMXINT4, AMXINT8, AMXBF16, or BF16The OpenAI-compatible request model field uses names, not paths.
--served-model-name qwen3.5-kt-ft
--lora-paths qwen35b_neko=/path/to/merged_adapter
Request behavior in the current single-adapter implementation:
model=qwen3.5-kt-ft
=> base + KT expert LoRA
model=qwen3.5-kt-ft:qwen35b_neko
=> base + KT expert LoRA + SGLang non-expert LoRA
The suffix after : must match the left-side name in --lora-paths.
curl -sS http://127.0.0.1:30006/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3.5-kt-ft:qwen35b_neko",
"messages": [{"role": "user", "content": "我回来了,你在干嘛?"}],
"temperature": 0.7,
"max_tokens": 160,
"chat_template_kwargs": {"enable_thinking": false}
}'
Startup logs should include lines similar to:
Prepared merged KT LoRA adapter ... for runtime: expert=... nonexpert=...
Loaded KT expert LoRA for layer ...
Using triton as backend of LoRA kernels.
The older split-runtime contract is still available for debugging:
--kt-expert-lora-path <EXPERT_ADAPTER_DIR> \
--enable-lora \
--lora-paths <NONEXPERT_LORA_NAME>=<NONEXPERT_ADAPTER_DIR>
This is not the recommended user-facing path. Normal users should pass one merged adapter directory through --lora-paths only.
Got LoRA adapter that has never been loaded: lora0The adapter name in the request must match the left side of --lora-paths. If you launched with qwen35b_neko=..., request model=qwen3.5-kt-ft:qwen35b_neko, not :lora0.
Make sure you are serving the intended merged adapter directory. For example, use the Neko adapter at ..._nekoqa_eod_240_sglang, not a generic sanity adapter such as ..._Moe_sglang.
connection refusedCheck that the server is listening on the port you curl, and remember the example above binds to 127.0.0.1, not 0.0.0.0.
python - <<'PY'
import inspect
import sglang.srt.models.qwen3_5 as qwen3_5
print(inspect.getfile(qwen3_5))
PY
The path should come from this repository's third_party/sglang.