XPU - Sglang — ContextQMD

The document addresses how to set up the SGLang environment and run LLM inference on Intel GPU, see more context about Intel GPU support within PyTorch ecosystem.

Specifically, SGLang is optimized for Intel® Arc™ Pro B-Series Graphics and Intel® Arc™ B-Series Graphics.

Optimized Model List

A list of LLMs have been optimized on Intel GPU, and more are on the way:

<table style={{width: "100%", borderCollapse: "collapse", tableLayout: "fixed"}}> <colgroup> <col style={{width: "50%"}} /> <col style={{width: "50%"}} /> </colgroup> <thead> <tr style={{borderBottom: "2px solid #d55816"}}> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.02)"}}>Model Name</th> <th style={{textAlign: "left", padding: "10px 12px", fontWeight: 700, whiteSpace: "nowrap", backgroundColor: "rgba(255,255,255,0.05)"}}>BF16</th> </tr> </thead> <tbody> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Llama-3.2-3B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>[meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Llama-3.1-8B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>[meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)</td> </tr> <tr> <td style={{padding: "9px 12px", fontWeight: 500, backgroundColor: "rgba(255,255,255,0.02)"}}>Qwen2.5-1.5B</td> <td style={{padding: "9px 12px", backgroundColor: "rgba(255,255,255,0.05)"}}>[Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B)</td> </tr> </tbody> </table>

Note: The model identifiers listed in the table above have been verified on Intel® Arc™ B580 Graphics.

Installation

Install From Source

Currently SGLang XPU only supports installation from source. Please refer to "Getting Started on Intel GPU" to install XPU dependency.

bash

# Create and activate a conda environment
conda create -n sgl-xpu python=3.12 -y
conda activate sgl-xpu

# Set PyTorch XPU as primary pip install channel to avoid installing the larger CUDA-enabled version and prevent potential runtime issues.
pip3 install torch==2.11.0+xpu torchao torchvision torchaudio==2.11.0+xpu --index-url https://download.pytorch.org/whl/xpu
pip3 install xgrammar --no-deps # xgrammar will introduce CUDA-enabled triton which might conflict with XPU

# Clone the SGLang code
git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout <YOUR-DESIRED-VERSION>

# Use dedicated toml file
cd python
cp pyproject_xpu.toml pyproject.toml
# Install SGLang dependent libs, and build SGLang main package
pip install --upgrade pip setuptools
pip install -v . --extra-index-url https://download.pytorch.org/whl/xpu

Install Using Docker

The SGLang XPU Dockerfile is provided to facilitate the installation. Replace <secret> below with your HuggingFace access token.

bash

# Clone the SGLang repository
git clone https://github.com/sgl-project/sglang.git
cd sglang/docker

# Build the docker image
docker build -t sglang-xpu:latest -f xpu.Dockerfile .

# Initiate a docker container
docker run \
    -it \
    --privileged \
    --ipc=host \
    --network=host \
    --group-add $(getent group video | cut -d: -f3) \
    --device /dev/dri \
    -v /dev/dri/by-path:/dev/dri/by-path \
    -v /dev/shm:/dev/shm \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 30000:30000 \
    -e "HF_TOKEN=<secret>" \
    sglang-xpu:latest /bin/bash

Launch of the Serving Engine

Example command to launch SGLang serving:

bash

sglang serve                         \
    --model-path <MODEL_ID_OR_PATH>  \
    --trust-remote-code              \
    --disable-overlap-schedule       \
    --device xpu                     \
    --host 0.0.0.0                   \
    --tp 2                           \   # using multi GPUs
    --attention-backend intel_xpu    \   # using intel optimized XPU attention backend
    --page-size                      \   # intel_xpu attention backend supports [32, 64, 128]

Benchmarking with Requests

You can benchmark the performance via the bench_serving script. Run the command in another terminal.

bash

python -m sglang.bench_serving   \
    --dataset-name random        \
    --random-input-len 1024      \
    --random-output-len 1024     \
    --num-prompts 1              \
    --request-rate inf           \
    --random-range-ratio 1.0

The detail explanations of the parameters can be looked up by the command:

bash

python -m sglang.bench_serving -h

Additionally, the requests can be formed with OpenAI Completions API and sent via the command line (e.g. using curl) or via your own script.