KT-Kernel

High-performance kernel operations for KTransformers, featuring CPU-optimized MoE inference with AMX, AVX, KML and blis (amd library) support.

Note
Features
Installation
- Option 1: Install from PyPI (Recommended for Most Users)
- Option 2: Install from Source (For Local Use or Custom Builds)
Verification
KT CLI Overview
Integration with SGLang
Direct Python API Usage
- Advanced Options
- Manual Configuration (Advanced)
Build Configuration
- Manual Installation (Without install.sh)
Error Troubleshooting
- CUDA Not Found
- hwloc Not Found
Weight Quantization
Before Commit!

Note

Current Support Status:

✅ Native Precision with AVX512/AMX: Supported with AVX512 CPUs in FP8, BF16 and RAWINT4 format - Guide
✅ Intel CPUs with AMX: Fully supported (using weights converted to INT4/INT8 format)
✅ Universal CPU (llamafile backend): Supported (using GGUF-format weights)
✅ AMD CPUs with BLIS: Supported (for int8 prefill & decode) - Guide

KT-CLI

We are developing a simpler way to use KTransformers. Check out the KT-CLI Guide for more details.

Features

CPU-Optimized MoE Kernels: High-throughput MoE expert kernels optimized for instruction sets.
AVX512 Native Precision Backend: FP8 / BF16 / INT4 native MoE backend for AVX512-capable servers.
AMX INT4/INT8 Backend: INT4 / INT8 quantized expert inference backend for AMX-capable servers.
Llamafile CPU Backend: AVX2/AVX512-based MoE backend built on Llamafile for universal CPU deployment.
NUMA-Aware Execution: Thread pool and memory layout designed for multi-socket / multi-NUMA machines.

Installation

Option 1: Install from PyPI (Recommended for Most Users)

Install the latest version with a single command:

bash

pip install kt-kernel

Note: Check the latest version on PyPI

Features:

✅ Automatic CPU detection: Detects your CPU and loads the optimal kernel variant
✅ CPU multi-variant support: Includes AMX, AVX512 (Base/VNNI/VBMI/BF16), and AVX2 variants
✅ CUDA support included: GPU acceleration for NVIDIA GPUs (SM 80, 86, 89, 90)
✅ No compilation needed: Pre-built wheels for Python 3.10, 3.11, 3.12
✅ Static CUDA runtime: No CUDA toolkit installation required
✅ Works on CPU-only systems: CUDA features automatically disabled when GPU not available

Requirements:

Python 3.10, 3.11, or 3.12
Linux x86-64 (manylinux_2_17 compatible)
CPU with AVX2 support (Intel Haswell 2013+, AMD Zen+)
Optional: NVIDIA GPU with compute capability 8.0+ for CUDA features

CUDA Installation (GPU Acceleration)

For NVIDIA GPU-accelerated inference:

bash

pip install kt-kernel

Features:

✅ Multi-architecture support: Single wheel supports SM 80/86/89/90 (Ampere, Ada, Hopper)
✅ Static CUDA runtime: No CUDA toolkit installation required
✅ Broad compatibility: Works with CUDA 11.8+ and 12.x drivers
✅ PyTorch compatible: Works with any PyTorch CUDA variant (cu118, cu121, cu124)

Requirements:

Python 3.10, 3.11, or 3.12
Linux x86-64 (manylinux_2_17 compatible)
NVIDIA GPU with compute capability 8.0+ (Ampere or newer)
- ✅ Supported: A100, RTX 3000/4000 series, H100
- ❌ Not supported: V100, P100, GTX 1000/2000 series (too old)
NVIDIA driver with CUDA 11.8+ or 12.x support (no CUDA toolkit needed)

GPU Compatibility Matrix:

GPU Architecture	Compute Capability	Supported	Example GPUs
Hopper	9.0	✅	H100, H200
Ada Lovelace	8.9	✅	RTX 4090, 4080, 4070
Ampere	8.6	✅	RTX 3090, 3080, 3070, 3060
Ampere	8.0	✅	A100, A30
Turing	7.5	❌	RTX 2080, T4
Volta	7.0	❌	V100

CUDA Driver Compatibility (for GPU features):

CUDA 11.8, 11.9, 12.0-12.6+: Full support
CUDA 11.0-11.7: Not supported (upgrade driver or use CPU-only)

CPU Variants Included:

The wheel includes 6 optimized variants that are automatically selected at runtime based on your CPU:

Variant	CPU Support	Performance	Auto-Selected When
AMX	Intel Sapphire Rapids+ (2023+)	⚡⚡⚡ Best	AMX instructions detected
AVX512+BF16	Ice Lake server, Zen 4+ (2021+)	⚡⚡⚡ Excellent	AVX512 + BF16 detected
AVX512+VBMI	Ice Lake client (2019+)	⚡⚡ Great	AVX512 + VBMI detected
AVX512+VNNI	Cascade Lake+ (2019+)	⚡⚡ Great	AVX512 + VNNI detected
AVX512 Base	Skylake-X+ (2017+)	⚡⚡ Good	AVX512 base detected
AVX2	Haswell+ (2013+), AMD Zen+	⚡ Good	Fallback for maximum compatibility

Verify installation:

python

import kt_kernel

# Check which CPU variant was loaded
print(f"CPU variant: {kt_kernel.__cpu_variant__}")
print(f"Version: {kt_kernel.__version__}")

# Check CUDA support
from kt_kernel import kt_kernel_ext
cpu_infer = kt_kernel_ext.CPUInfer(4)
has_cuda = hasattr(cpu_infer, 'submit_with_cuda_stream')
print(f"CUDA support: {has_cuda}")

print("✓ kt-kernel installed successfully!")

Environment Variables:

bash

# Override automatic CPU detection (for testing or debugging)
export KT_KERNEL_CPU_VARIANT=avx2  # Force specific variant

# Enable debug output to see detection process
export KT_KERNEL_DEBUG=1
python -c "import kt_kernel"

Option 2: Install from Source (For Local Use or Custom Builds)

Build from source for local installation or when you need AMD (BLIS), ARM (KML), or custom CUDA versions.

Prerequisites

First, initialize git submodules and create a conda environment:

bash

git submodule update --init --recursive
conda create -n kt-kernel python=3.11 -y
conda activate kt-kernel

Quick Installation (Recommended)

Simply run the install script - it will auto-detect your CPU and optimize for best performance:

bash

./install.sh

What happens automatically:

Auto-detects CPU capabilities (AMX, AVX512_VNNI, AVX512_BF16)
Installs system dependencies (cmake, libhwloc-dev, pkg-config)
Builds optimized binary for your CPU only (using -march=native)
Software fallbacks: Automatically enabled for CPUs without VNNI/BF16

Optional: Two-step installation

bash

./install.sh deps   # Install dependencies only
./install.sh build  # Build and install kt-kernel

CPU Requirements by Backend:

Backend	Minimum CPU Requirement	Example CPUs	Notes
LLAMAFILE	AVX2	Intel Haswell (2013+), AMD Zen+	Universal compatibility
RAWINT4	AVX512F + AVX512BW	Intel Skylake-X (2017+), Ice Lake, Cascade Lake	Software fallbacks for VNNI/BF16
AMXINT4/INT8	AMX	Intel Sapphire Rapids (2023+)	Best performance, requires AMX hardware
FP8	AVX512F + AVX512BW + AVX512_BF16 + AVX512_VBMI	Intel Cooper Lake (2020+), Sapphire Rapids (2023+); AMD Zen 4+ (e.g., EPYC 9355)	Native Precision (e.g., DeepSeek V3.2, MiniMax M2.1)
BF16	AVX512F + AVX512BW + AVX512_BF16	Intel Cooper Lake (2020+), Sapphire Rapids (2023+); AMD Zen 4+ (e.g., EPYC 9355)	Native Precision (e.g., Qwen3-235B-A22B, GLM-4.7)

Software Fallback Support (AVX512 backends):

✅ VNNI fallback: Uses AVX512BW instructions
✅ BF16 fallback: Uses AVX512F instructions
✅ Older AVX512 CPUs (Skylake-X, Cascade Lake) can run RAWINT4 with fallbacks

⚠️ Portability Note: The default build is optimized for your specific CPU and may not work on different/older CPUs. For portable builds or binary distribution, see Manual Configuration below.

⚠️ AMD BLIS backend users: See installation guide for AMD-specific setup.

Verification

After installation, verify that the CLI is working:

bash

kt version

Expected output:

KTransformers CLI v0.x.x

  Python:        3.11.x
  Platform:      Linux 5.15.0-xxx-generic
  CUDA:          12.x
  kt-kernel:     0.x.x (amx)
  sglang:        0.x.x

You can also verify the Python module directly:

bash

python -c "from kt_kernel import KTMoEWrapper; print('✓ kt-kernel installed successfully')"

KT CLI Overview

The kt command-line tool provides a unified interface for running and managing KTransformers models:

Command	Description
`kt run <model>`	Start model inference server with auto-optimized parameters
`kt chat`	Interactive chat with a running model server
`kt model`	Manage models and storage paths
`kt doctor`	Diagnose environment issues and check system compatibility
`kt config`	Manage CLI configuration
`kt version`	Show version information

Quick Start Example:

bash

# Start a model server (auto-detects hardware and applies optimal settings)
kt run m2

# In another terminal, chat with the model
kt chat

# Check system compatibility
kt doctor

Run kt --help for more options, or kt <command> --help for command-specific help.

Integration with SGLang

KT-Kernel can be used standalone via Direct Python API or integrated with SGLang for production deployment. This section describes SGLang integration to enable CPU-GPU heterogeneous inference, where "hot" experts run on GPU and "cold" experts run on CPU for optimal resource utilization.

Installation Steps

1. Install SGLang

Install the kvcache-ai fork of SGLang (required for kt-kernel support):

bash

# Option A: One-click install (from ktransformers root, installs sglang + kt-kernel)
./install.sh

# Option B: pip install
pip install kt-kernel sglang-kt

# Option C: From source (editable mode)
git clone --recursive https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
pip install -e "third_party/sglang/python[all]"

Important: Use sglang-kt (kvcache-ai fork), not the official sglang package. If you have the official version installed, uninstall it first: pip uninstall sglang -y

2. Prepare Weights

You need both GPU weights and CPU-side expert weights for heterogeneous inference. The exact format depends on the backend:

GPU Weights (for all backends):
Use the model weights required by SGLang for GPU inference (for example, the original or already-quantized model directory from Hugging Face).

CPU Weights (AMX backend: AMXINT4 / AMXINT8): Quantize weights to AMX-optimized INT4/INT8 format using the provided script:

bash

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/cpu-weights \
  --quant-method int8  # or int4 or moe_int8 (for amd now)

--input-path: Path to GPU-side original weights
--input-type: Depends on your GPU weights type (fp8, fp16, or bf16)

In SGLang integration, --kt-weight-path should point to this converted CPU weights directory.

Supported input formats: FP8, FP16, BF16 → INT4/INT8.

CPU Weights (LLAMAFILE backend: LLAMAFILE): LLAMAFILE uses pre-quantized GGUF weights on the CPU side directly, without running convert_cpu_weights.py. You need to:

Download a GGUF model directly from the web (e.g., GGUF repos on Hugging Face / Modelscope);
In SGLang integration, use that GGUF directory as --kt-weight-path. KT-Kernel supports multiple GGUF quantization formats such as Q4_KM, Q4_K, Q5_K, etc. Choose based on your latency and accuracy requirements.

3. Launch SGLang Server

Start the SGLang server with your normal SGLang parameters, and add the following KT-Kernel specific parameters to enable CPU-GPU heterogeneous inference:

KT-Kernel Parameters to Add:

--kt-method: Backend method (AMXINT4, AMXINT8, or LLAMAFILE)
--kt-weight-path: Path to the converted CPU weights
--kt-cpuinfer: Number of CPU inference threads (set to physical cores)
--kt-threadpool-count: Number of thread pools (set to NUMA node count)
--kt-num-gpu-experts: Number of experts to keep on GPU
--kt-max-deferred-experts-per-token: Deferred experts for pipelined execution

Example:

bash

python -m sglang.launch_server \
  [your normal SGLang parameters...] \
  --kt-method AMXINT8 \
  --kt-weight-path /path/to/cpu-weights \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32 \
  --kt-max-deferred-experts-per-token 2

See KT-Kernel Parameters section below for detailed parameter tuning guidelines.

Complete Example: Qwen3-30B-A3B

This example demonstrates the full workflow from downloading weights to launching the server, showing Native backend, AMX backend and LLAMAFILE backend options.

Hardware Configuration:

GPU: NVIDIA RTX 4090 24GB
CPU: 2x Intel Xeon Gold 6454S (64 physical cores total, 128 threads, 2 NUMA nodes)
Model: Qwen3-30B-A3B

How to verify your system configuration:

bash

# Check CPU configuration
lscpu | grep -E "^CPU\(s\)|Thread\(s\) per core|Socket\(s\)|NUMA node\(s\)"
# Expected output example:
CPU(s):                                  128
Thread(s) per core:                      2
Socket(s):                               2
NUMA node(s):                            2
# → Physical cores = CPU(s) / Thread(s) per core = 128 / 2 = 64

Parameter Rationale:

--kt-cpuinfer 64: Set to physical cores (64), not hyperthreads (128)
--kt-threadpool-count 2: 2 NUMA nodes detected (dual-socket system)
--kt-num-gpu-experts 32: With 24GB GPU memory, we can fit ~32 experts on GPU for this model (varies by model architecture and actual memory usage)
--kt-max-deferred-experts-per-token 2: Enable pipelined execution; allows CPU to process next batch while GPU completes current batch
--kt-gpu-prefill-token-threshold 2048: Use layerwise prefill strategy when token count exceeds 2048 (for native backends only)

Option A: Native Backend (BF16)

For AVX512 CPUs with BF16 support.

Step 1: Download model weights

bash

# Install huggingface-cli if not already installed
pip install huggingface-hub
# Download model from Hugging Face  
huggingface-cli download Qwen/Qwen3-30B-A3B --local-dir /mnt/data/models/Qwen3-30B-A3B

Step 2: Launch SGLang server

bash

python -m sglang.launch_server \
    --host 0.0.0.0 \
    --port 30000 \
    --model /mnt/data/models/Qwen3-30B-A3B \
    --kt-weight-path /mnt/data/models/Qwen3-30B-A3B \
    --kt-cpuinfer 64 \
    --kt-threadpool-count 2 \
    --kt-num-gpu-experts 32 \
    --kt-method BF16 \
    --attention-backend flashinfer \
    --trust-remote-code \
    --mem-fraction-static 0.80 \
    --chunked-prefill-size 16384 \
    --max-running-requests 4 \
    --served-model-name Qwen3 \
    --enable-mixed-chunk \
    --tensor-parallel-size 1 \
    --enable-p2p-check \
    --disable-shared-experts-fusion \
    --kt-gpu-prefill-token-threshold 4096 \
    --kt-enable-dynamic-expert-update

Option B: AMX Backend (AMXINT8)

For Intel CPUs with AMX instruction set support.

Step 1: Download model weights

bash

# Install huggingface-cli if not already installed
pip install huggingface-hub

# Download model from Hugging Face
huggingface-cli download Qwen/Qwen3-30B-A3B --local-dir /mnt/data/models/Qwen3-30B-A3B

Step 2: Convert to CPU weights (AMXINT8)

bash

python scripts/convert_cpu_weights.py \
  --input-path /mnt/data/models/Qwen3-30B-A3B \
  --input-type bf16 \
  --output /mnt/data/models/Qwen3-30B-A3B-INT8 \
  --quant-method int8

Step 3: Launch SGLang server

bash

python -m sglang.launch_server \
  --host 0.0.0.0 \
  --port 8000 \
  --model /mnt/data/models/Qwen3-30B-A3B \
  --trust-remote-code \
  --mem-fraction-static 0.92 \
  --chunked-prefill-size 4096 \
  --served-model-name Qwen3-30B-A3B \
  --enable-mixed-chunk \
  --kt-method AMXINT8 \
  --kt-weight-path /mnt/data/models/Qwen3-30B-A3B-INT8 \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32 \
  --kt-max-deferred-experts-per-token 2

Option C: LLAMAFILE Backend (GGUF)

For universal CPUs (no AMX required), using pre-quantized GGUF weights directly.

Step 1: Download GPU weights (original model)

bash

pip install huggingface-hub

huggingface-cli download Qwen/Qwen3-30B-A3B --local-dir /mnt/data/models/Qwen3-30B-A3B

Step 2: Download CPU weights (GGUF format)

bash

huggingface-cli download Qwen/Qwen3-30B-A3B-GGUF Qwen3-30B-A3B-Q4_K_M.gguf \
  --local-dir /mnt/data/models/Qwen3-30B-A3B-Q4_K_M

Step 3: Launch SGLang server

bash

python -m sglang.launch_server \
  --host 0.0.0.0 \
  --port 8000 \
  --model /mnt/data/models/Qwen3-30B-A3B \
  --trust-remote-code \
  --mem-fraction-static 0.92 \
  --chunked-prefill-size 4096 \
  --served-model-name Qwen3-30B-A3B \
  --enable-mixed-chunk \
  --kt-method LLAMAFILE \
  --kt-weight-path /mnt/data/models/Qwen3-30B-A3B-Q4_K_M \
  --kt-cpuinfer 64 \
  --kt-threadpool-count 2 \
  --kt-num-gpu-experts 32 \
  --kt-max-deferred-experts-per-token 2

KT-Kernel Parameters

Parameter	Description	Example Value
`--kt-method`	CPU inference backend method	`AMXINT4`, `AMXINT8`, `RAWINT4`, `FP8`, `FP8_PERCHANNEL`, `BF16` or `LLAMAFILE`
`--kt-weight-path`	Path to quantized CPU weights	`/path/to/cpu-weights`
`--kt-cpuinfer`	Number of CPU inference threads	`64` (adjust based on CPU cores)
`--kt-threadpool-count`	Number of thread pools for parallel execution	`2` (typically 1-4)
`--kt-num-gpu-experts`	Number of experts to keep on GPU	`32` (remaining experts go to CPU)
`--kt-max-deferred-experts-per-token`	Number of experts per token to defer for pipelined execution	`2` (0 to disable, 1-4 recommended)
`--kt-gpu-prefill-token-threshold`	Token count threshold for prefill strategy (native backend only)	~`1024-4096`
`--kt-enable-dynamic-expert-update`	Enable dynamic expert placement updates during prefill based on actual routing statistics	(flag, no value needed)
`--kt-expert-placement-strategy`	Strategy for initial GPU expert placement	`uniform`, `frequency`, `front-loading`, or `random`

Parameter Guidelines:

kt-method: Choose based on your CPU and weight format:
- AMXINT4: Best performance on AMX CPUs with INT4 quantized weights (May cause huge accuracy drop for some models, e.g., Qwen3-30B-A3B)
- AMXINT8: Higher accuracy with INT8 quantized weights on AMX CPUs
- RAWINT4: Native INT4 weights shared by CPU and GPU (currently supports Kimi-K2-Thinking model). See Kimi-K2-Thinking Native Tutorial for details.
- FP8, FP8_PERCHANNEL: FP8 weights shared by CPU and GPU
- BF16: BF16 weights shared by CPU and GPU
- LLAMAFILE: GGUF-based backend
kt-cpuinfer: Set to the number of physical CPU cores (not hyperthreads).
- Check physical cores: lscpu | grep -E "^CPU$s$|Thread$s$ per core"
- Physical cores = CPU(s) / Thread(s) per core
- Example: If CPU(s)=128 and Thread(s) per core=2, then physical cores = 64
- Important: Do NOT set to hyperthread count - this will degrade performance
kt-threadpool-count: Set to the number of NUMA nodes.
- Check NUMA count: lscpu | grep "NUMA node(s)"
- Or use: numactl --hardware | grep "available"
- Note: NUMA node count is NOT necessarily the number of physical CPUs
  - It represents memory domains, which may be divided within a single CPU or across multiple CPUs
  - Use the NUMA node count from lscpu, regardless of physical CPU count
- Typical values: 1-2 for single-socket, 2-4 for dual-socket systems
- This enables better memory bandwidth utilization across NUMA domains
kt-num-gpu-experts: Determine based on GPU memory and profiling:
- More GPU experts = lower latency but higher GPU memory usage (May cause OOM)
kt-max-deferred-experts-per-token: Enables pipelined execution:
- 0: Synchronous execution (simpler, higher latency)
- 1-4: Deferred execution (recommended range; good latency/quality balance, requires tuning)
- 5-7: Highest latency reduction but may introduce noticeable accuracy loss; use with care
kt-gpu-prefill-token-threshold (FP8 and RAWINT4 only): Controls prefill strategy for native FP8 and INT4 inference:
- ≤ threshold: Uses hybrid CPU+GPU prefill. No extra VRAM needed, but performance degrades slowly as token count increases.
- > threshold: Uses layerwise GPU prefill. Performance scales better with longer sequences, but requires one MoE layer extra VRAM (e.g., ~9GB+ for Kimi-K2-Thinking and ~3.6GB for MiniMax-M2.1).
- Only applicable when --kt-method RAWINT4 or --kt-method FP8 is used.
kt-enable-dynamic-expert-update: Enables dynamic expert placement updates during inference.
- During layerwise prefill, the system collects actual routing statistics and redistributes GPU experts accordingly.
- Requires --kt-gpu-prefill-token-threshold to be set, and prefill length must be ≥ the threshold value.
- Particularly effective at lower GPU expert ratios (10%-70%), where it can significantly outperform static strategies.
- See Expert Scheduling Tutorial for benchmarks and details.
kt-expert-placement-strategy: Determines which experts are placed on GPU at server startup.
- uniform: Distributes GPU experts evenly across all MoE layers. Default option, no prior statistics needed.
- frequency: Places the most frequently activated experts on GPU. Best performance when activation statistics are available; requires --init-expert-location pointing to a .pt statistics file.
- front-loading: Fills GPU experts from the first MoE layer onwards.
- random: Randomly selects experts with a fixed seed (42).
- See Expert Scheduling Tutorial for strategy comparison.

Direct Python API Usage

For standalone usage without SGLang, you can use KT-Kernel directly via Python API:

python

from kt_kernel import KTMoEWrapper

# Initialize the MoE wrapper
wrapper = KTMoEWrapper(
    layer_idx=0,
    num_experts=8,
    num_experts_per_tok=2,
    hidden_size=4096,
    moe_intermediate_size=14336,
    num_gpu_experts=2,
    cpuinfer_threads=32,
    threadpool_count=2,
    weight_path="/path/to/weights",
    chunked_prefill_size=512,
    method="AMXINT4"  # Options: "AMXINT4", "AMXINT8", "LLAMAFILE"
)

# Load weights (from disk - pre-quantized)
wrapper.load_weights(physical_to_logical_map)

# Or load weights from tensors (online quantization)
wrapper.load_weights_from_tensors(gate_proj, up_proj, down_proj, physical_to_logical_map)

# Run inference
output = wrapper.forward(hidden_states, topk_ids, topk_weights, cuda_stream)

# Or use async API for better performance
wrapper.submit_forward(hidden_states, topk_ids, topk_weights, cuda_stream)
# ... do other work ...
output = wrapper.sync_forward(hidden_states, cuda_stream)

Advanced Options

python

# Initialize with additional options
wrapper = KTMoEWrapper(
    layer_idx=0,
    num_experts=8,
    num_experts_per_tok=2,
    hidden_size=4096,
    moe_intermediate_size=14336,
    num_gpu_experts=2,
    cpuinfer_threads=32,
    threadpool_count=2,
    weight_path="/path/to/weights",
    chunked_prefill_size=512,
    method="AMXINT4",
    cpu_save=False,  # Keep weights in CPU memory after loading
    max_deferred_experts_per_token=0  # Number of experts to defer (for pipelined execution)
)

# Pre-allocate buffers for specific batch sizes (improves performance)
KTMoEWrapper.set_capture_batch_sizes([1, 2, 4, 8, 16])

# Query captured batch sizes
batch_sizes = KTMoEWrapper.get_capture_batch_sizes()

# Clear buffer cache to free memory
KTMoEWrapper.clear_buffer_cache()

Manual Configuration (Advanced)

For portable builds, binary distribution, or cross-machine deployment, you need to manually specify target instruction sets:

bash

# General distribution (works on any AVX512 CPU from 2017+)
export CPUINFER_CPU_INSTRUCT=AVX512
export CPUINFER_ENABLE_AMX=OFF
./install.sh build --manual

# Maximum compatibility (works on any CPU from 2013+)
export CPUINFER_CPU_INSTRUCT=AVX2
export CPUINFER_ENABLE_AMX=OFF
./install.sh build --manual

# Modern CPUs only (Ice Lake+, Zen 4+)
export CPUINFER_CPU_INSTRUCT=FANCY
export CPUINFER_ENABLE_AMX=OFF
./install.sh build --manual

Optional: Override VNNI/BF16 detection

bash

# Force enable/disable VNNI and BF16 (for testing fallbacks)
export CPUINFER_ENABLE_AVX512_VNNI=OFF
export CPUINFER_ENABLE_AVX512_BF16=OFF
./install.sh

See ./install.sh --help for all available options.

Build Configuration

Manual Installation (Without install.sh)

If you prefer manual installation without the install.sh script:

1. Install System Dependencies

Prerequisites:

cmake (recommended: conda install -y cmake)
libhwloc-dev and pkg-config

2. Set Build Configuration

Core Options:

Variable	Options	Description
`CPUINFER_CPU_INSTRUCT`	`NATIVE`, `AVX512`, `AVX2`, `FANCY`	CPU instruction set to use
`CPUINFER_ENABLE_AMX`	`ON`, `OFF`	Enable Intel AMX support
`CPUINFER_BUILD_TYPE`	`Release`, `Debug`, `RelWithDebInfo`	Build type (default: `Release`)
`CPUINFER_PARALLEL`	Number	Parallel build jobs (default: auto-detect)
`CPUINFER_VERBOSE`	`0`, `1`	Verbose build output (default: `0`)

Instruction Set Details:

Option	Target CPUs	Use Case
`NATIVE`	Your specific CPU only	Local builds (best performance, default)
`AVX512`	Skylake-X, Ice Lake, Cascade Lake, Zen 4+	General distribution
`AVX2`	Haswell (2013) and newer	Maximum compatibility
`FANCY`	Ice Lake+, Zen 4+	Modern CPUs with full AVX512 extensions

Example Configurations:

bash

# Local use - maximum performance (default behavior)
export CPUINFER_CPU_INSTRUCT=NATIVE
export CPUINFER_ENABLE_AMX=ON  # or OFF

# Distribution build - works on any AVX512 CPU
export CPUINFER_CPU_INSTRUCT=AVX512
export CPUINFER_ENABLE_AMX=OFF

# Maximum compatibility - works on CPUs since 2013
export CPUINFER_CPU_INSTRUCT=AVX2
export CPUINFER_ENABLE_AMX=OFF

# Debug build
export CPUINFER_BUILD_TYPE=Debug
export CPUINFER_VERBOSE=1

3. Build and Install

bash

# Editable installation (for development)
pip install -e .

# Standard installation
pip install .

Error Troubleshooting

CUDA Not Found

 -- Looking for a CUDA compiler - NOTFOUND
  CMake Error at CMakeLists.txt:389 (message):
    KTRANSFORMERS_USE_CUDA=ON but CUDA compiler not found

Make sure you have the CUDA toolkit installed and nvcc is in your system PATH.

Try export CMAKE_ARGS="-D CMAKE_CUDA_COMPILER=$(which nvcc)" and reinstall again.

hwloc Not Found

Run sudo apt install libhwloc-dev if on a Debian-based system or build from source: https://www.open-mpi.org/projects/hwloc/.

wget https://download.open-mpi.org/release/hwloc/v2.12/hwloc-2.12.2.tar.gz
tar -xzf hwloc-2.12.2.tar.gz
cd hwloc-2.12.2
./configure
make
sudo make install

Weight Quantization

For AMX backends (AMXINT4 / AMXINT8), CPU-side experts must be converted to AMX-friendly INT4/INT8 format using the provided script:

bash

python scripts/convert_cpu_weights.py \
  --input-path /path/to/model \
  --input-type bf16 \
  --output /path/to/output \
  --quant-method int4

Supported formats: FP8, FP16, BF16 → INT4/INT8

For LLAMAFILE backend (LLAMAFILE), CPU-side experts are loaded directly from GGUF weights. You do not need to run the AMX conversion script; instead, download a GGUF model from the web (e.g., a GGUF repo on Hugging Face) and point weight_path / SGLang --kt-weight-path (or --model when appropriate) to that GGUF directory. KT-Kernel supports multiple GGUF quantization types such as Q4_KM, Q4_K, Q5_K, etc.

For detailed documentation, advanced options, and low-memory mode, see scripts/README.md.

Before Commit!

Commit messages should follow the Conventional Commits specification: https://www.conventionalcommits.org/

Please format your code before committing:

shell

cmake -B build
cd build
make format

You may need a newer clang-format (at least version 18). In a conda environment:

shell

conda install -c conda-forge clang-format=18
rm -rf build

It's also recommended to install black for Python code formatting:

shell

conda install black