doc/en/balance-serve.md
We are excited to announce the official release of the long-awaited KTransformers v0.2.4! In this version, we’ve added highly desired multi-concurrency support to the community through a major refactor of the whole architecture, updating more than 10,000 lines of code. By drawing inspiration from the excellent architecture of sglang, we have implemented high-performance asynchronous concurrent scheduling in C++, including features like continuous batching, chunked prefill, and more. Thanks to GPU sharing in concurrent scenarios, overall throughput is also improved to a certain extent. The following is a demonstration:
https://github.com/user-attachments/assets/faa3bda2-928b-45a7-b44f-21e12ec84b8a
</p>Implemented custom_flashinfer @Atream @ovowei @qiyuxinlin Implemented balance_serve engine based on FlashInfer @qiyuxinlin @ovowei Implemented a continuous batching scheduler in C++ @ErvinXie release: bump version v0.2.4 by @Atream @Azure-Tang @ErvinXie @qiyuxinlin @ovowei @KMSorSMS @SkqLiao
Visit the link to pull the image, using v0.2.4-AVX512 as an example.
docker pull approachingai/ktransformers:v0.2.4-AVX512
docker run -it --gpus all --privileged --shm-size 64g --name ktrans --network=host -v /mnt:/mnt approachingai/ktransformers:v0.2.4-AVX512 /bin/bash
# Open a new terminal
docker exec -it ktrans bash
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!
We recommend using Miniconda3/Anaconda3 for environment management:
# Download Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Create environment
conda create --name ktransformers python=3.11
conda activate ktransformers
# Install required libraries
conda install -c conda-forge libstdcxx-ng
# Verify GLIBCXX version (should include 3.4.32)
strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX
Note: Adjust the Anaconda path if your installation directory differs from
~/anaconda3
sudo apt install libtbb-dev libssl-dev libcurl4-openssl-dev libaio1 libaio-dev libfmt-dev libgflags-dev zlib1g-dev patchelf
pip3 install packaging ninja cpufeature numpy openai
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
# Clone repository
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule update --init --recursive
# Install single NUMA dependencies
USE_BALANCE_SERVE=1 bash ./install.sh
# For those who have two cpu and 1T RAM(Dual NUMA):
USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh
Use our optimized configuration for constrained VRAM:
python ktransformers/server/main.py \
--port 10002 \
--model_path <path_to_safetensor_config> \
--gguf_path <path_to_gguf_files> \
--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml \
--max_new_tokens 1024 \
--cache_lens 32768 \
--chunk_size 256 \
--max_batch_size 4 \
--backend_type balance_serve \
--force_think # useful for R1
It features the following arguments:
--max_new_tokens: Maximum number of tokens generated per request.--cache_lens: Total length of kvcache allocated by the scheduler. All requests share a kvcache space.--max_batch_size: Maximum number of requests (prefill + decode) processed in a single run by the engine. (Supported only by balance_serve)--chunk_size: Maximum number of tokens processed in a single run by the engine.
corresponding to 32768 tokens, and the space occupied will be released after the requests are completed.--backend_type: balance_serve is a multi-concurrency backend engine introduced in version v0.2.4. The original single-concurrency engine is ktransformers.--model_path: Path to safetensor config path (only config required, not model safetensors).ver 0.2.4, the last segment of ${model_path} directory name MUST be a local directory that contains the model's configuration files. Hugging Face links (e.g., deepseek-ai/DeepSeek-R1) are not supported at the moment.--force_think: Force responding the reasoning tag of DeepSeek R1.The relationship between max_batch_size, cache_lens, and max_new_tokens should satisfy:
cache_lens > max_batch_size * max_new_tokens, otherwise the concurrency will decrease.
curl -X POST http://localhost:10002/v1/chat/completions \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "hello"}
],
"model": "DeepSeek-R1",
"temperature": 0.3,
"top_p": 1.0,
"stream": true
}'