sgl-kernel/README.md
Kernel Library for LLM inference engines
<div align="center"> </div>sglang-kernel provides optimized compute primitives for LLM inference engines, enabling efficient inference for large language models and vision-language models through custom kernel operations. The source tree remains under the sgl-kernel/ directory and the Python import path remains sgl_kernel.
Requires torch == 2.11.0
# Latest version
pip3 install sglang-kernel --upgrade
Requires
make build
By default, make build uses all available CPU cores. You can override build parallelism and NVCC compile threads:
# Limit parallel jobs (controls both make and cmake parallelism)
make build MAX_JOBS=2
# Additionally limit NVCC internal threads (reduces CPU and peak memory)
make build MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"
m.def, and device binding with m.impl:How to write schema: Schema reference
// We need def with schema here for torch.compile
m.def(
"bmm_fp8(Tensor A, Tensor B, Tensor! D, Tensor A_scale, Tensor B_scale, Tensor workspace_buffer, "
"int cublas_handle) -> ()");
m.impl("bmm_fp8", torch::kCUDA, &bmm_fp8);
Third-party C++ libraries often use int and float, but PyTorch bindings require int64_t and double due to Python's type mapping.
Use make_pytorch_shim from sgl_kernel_torch_shim.h to handle conversions automatically:
// Add type conversion for int -> int64_t
template <>
struct pytorch_library_compatible_type<int> {
using type = int64_t;
static int convert_from_type(int64_t arg) {
TORCH_CHECK(arg <= std::numeric_limits<int>::max(), "value too large");
TORCH_CHECK(arg >= std::numeric_limits<int>::min(), "value too small");
return arg;
}
};
// Wrap your function
m.impl("fwd", torch::kCUDA, make_pytorch_shim(&mha_fwd));
@pytest.mark.skipif@pytest.mark.skipif(
skip_condition, reason="Nvfp4 Requires compute capability of 10 or above."
)
Add benchmarks using triton benchmark in benchmark/
We recommend using triton.testing.do_bench_cudagraph for kernel benchmarking:
Compared to triton.testing.do_bench, do_bench_cudagraph provides:
Run test suite
Analyze CUDA kernel sizes in compiled wheel files to identify oversized kernels and template-instantiation bloat:
This tool requires cubloaty (install with pip install cubloaty) to work.
# Install cubloaty
pip install cubloaty
# Analyze a wheel file
python analyze_whl_kernel_sizes.py path/to/sglang_kernel-*.whl
# Custom output file
python analyze_whl_kernel_sizes.py path/to/sglang_kernel-*.whl --output my_analysis.txt
The tool generates:
Use this to identify large kernels and potential template instantiation bloat.