docs_new/docs/hardware-platforms/nvidia_jetson.mdx
Before starting, ensure the following:
sudo nvpmodel -m 0
Clone the jetson-containers github repository:
git clone https://github.com/dusty-nv/jetson-containers.git
Run the installation script:
bash jetson-containers/install.sh
Build the container image:
jetson-containers build sglang
Run the container:
jetson-containers run $(autotag sglang)
Or you can also manually run a container with this command:
docker run --runtime nvidia -it --rm --network=host IMAGE_NAME
Launch the server:
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--device cuda \
--dtype half \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192
The quantization and limited context length (--dtype half --context-length 8192) are due to the limited computational resources in Nvidia jetson kit. A detailed explanation can be found in Server Arguments.
After launching the engine, refer to Chat completions to test the usability.
TorchAO is suggested to NVIDIA Jetson Orin.
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--device cuda \
--dtype bfloat16 \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192 \
--torchao-config int4wo-128
This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of --torchao-config int4wo-128 is also for memory efficiency.
Please refer to SGLang doc structured output.
Thanks to the support from Nurgaliyev Shakhizat, Dustin Franklin and Johnny Núñez Cano.