docs/source/en/community_integrations/sglang.md
SGLang is a low-latency, high-throughput inference engine for large language models (LLMs). It also includes a frontend language for building agentic workflows.
Set model_impl="transformers" to load a Transformers modeling backend.
import sglang as sgl
llm = sgl.Engine("meta-llama/Llama-3.2-1B-Instruct", model_impl="transformers")
print(llm.generate(["The capital of France is"], {"max_new_tokens": 20})[0])
Pass --model-impl transformers to the sglang.launch_server command for online serving.
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Instruct \
--model-impl transformers \
--host 0.0.0.0 \
--port 30000
Setting model_impl="transformers" tells SGLang to skip its native model matching and use the Transformers model directly.
PreTrainedConfig.from_pretrained] loads the model's config.json from the Hub or your Hugging Face cache.AutoModel.from_config] resolves the model class based on the config._attn_implementation is set to "sglang". This routes attention calls through SGLang's RadixAttention kernels.The model benefits from all SGLang optimizations while using the Transformers model structure.
[!WARNING] Compatible models require
_supports_attention_backend=Trueso SGLang can control attention execution. See the Building a compatible model backend for inference guide for details.