docs_new/docs/developer_guide/bench_serving.mdx
This guide explains how to benchmark online serving throughput and latency using python -m sglang.bench_serving. It supports multiple inference backends via OpenAI-compatible and native endpoints, and produces both console metrics and optional JSONL outputs.
sglang / sglang-native: POST /generatesglang-oai, vllm, lmdeploy: POST /v1/completionssglang-oai-chat, vllm-chat, lmdeploy-chat: POST /v1/chat/completionstrt (TensorRT-LLM): POST /v2/models/ensemble/generate_streamgserver: Custom server (Not Implemented yet in this script)truss: POST /v1/models/model:predictIf --base-url is provided, requests are sent to it. Otherwise, --host and --port are used. When --model is not provided, the script will attempt to query GET /v1/models for an available model ID (OpenAI-compatible endpoints).
aiohttp, numpy, requests, tqdm, transformers, and for some datasets datasets, pillow, pybase64. Install as needed.OPENAI_API_KEY (used as Authorization: Bearer <key>)Run a basic benchmark against an sglang server exposing /generate:
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--num-prompts 1000 \
--model meta-llama/Llama-3.1-8B-Instruct
Or, using an OpenAI-compatible endpoint (completions):
python3 -m sglang.bench_serving \
--backend vllm \
--base-url http://127.0.0.1:8000 \
--num-prompts 1000 \
--model meta-llama/Llama-3.1-8B-Instruct
Select with --dataset-name:
sharegpt (default): loads ShareGPT-style pairs; optionally restrict with --sharegpt-context-len and override outputs with --sharegpt-output-lenrandom: random text lengths; sampled from ShareGPT token spacerandom-ids: random token ids (can lead to gibberish)image: generates images and wraps them in chat messages; supports custom resolutions, multiple formats, and different content typesgenerated-shared-prefix: synthetic dataset with shared long system prompts and short questionsmmmu: samples from MMMU (Math split) and includes imagesspeed-bench: SPEED-Bench (SPEculative Evaluation Dataset) — a unified benchmark for evaluating Speculative Decoding (SD) algorithms. Uses the Throughput split, which provides fixed-length input sequences (1K–32K tokens) grouped into three output-entropy categories (low_entropy, mixed, high_entropy). Requires a pre-downloaded JSONL file passed via --dataset-path.Common dataset flags:
--num-prompts N: number of requests
--random-input-len, --random-output-len, --random-range-ratio: for random/random-ids/image
--image-count: Number of images per request (for image dataset).
--apply-chat-template: apply tokenizer chat template when constructing prompts
--dataset-path PATH: file path for ShareGPT json; if blank and missing, it will be downloaded and cached
Generated Shared Prefix flags (for generated-shared-prefix):
--gsp-num-groups--gsp-prompts-per-group--gsp-system-prompt-len--gsp-question-len--gsp-output-len--gsp-group-distribution {uniform,zipf}: per-request prefix-group sampling distribution (default: uniform). With zipf, each request's group is sampled by rank with p(rank) = (1/rank**alpha) / sum_k(1/k**alpha); rank starts at 1 and group index 0 is the hottest. The on-disk dataset cache uses a distinct key per (group_distribution, zipf_alpha), so uniform-mode caches are never mixed with zipf-mode caches.--gsp-zipf-alpha FLOAT: Zipf exponent for --gsp-group-distribution=zipf. Must be a finite float strictly greater than 0; larger values concentrate requests on lower-ranked (hotter) groups. Required when the distribution is zipf; must be omitted otherwise.Image dataset flags (for image):
--image-count: Number of images per request--image-resolution: Image resolution; supports presets (4k, 1080p, 720p, 360p) or custom 'heightxwidth' format (e.g., 1080x1920, 512x768)--image-format: Image format (jpeg or png)--image-content: Image content type (random or blank)SPEED-Bench flags (for speed-bench):
--dataset-path PATH: path to the pre-downloaded SPEED-Bench Throughput JSONL (e.g., throughput_1k.jsonl). Use the SPEED-Bench measurement framework to generate it.--speed-bench-category: filter to one entropy category: low_entropy, mixed, or high_entropy (default: all)--speed-bench-output-len: fixed number of output tokens per request (default: 512)python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache
python -m sglang.bench_serving \
--backend sglang-oai-chat \
--dataset-name image \
--num-prompts 500 \
--image-count 3 \
--image-resolution 720p \
--random-input-len 512 \
--random-output-len 512
python -m sglang.launch_server --model-path Qwen/Qwen2.5-3B-Instruct
python3 -m sglang.bench_serving \
--backend sglang \
--dataset-name random \
--num-prompts 3000 \
--random-input 1024 \
--random-output 1024 \
--random-range-ratio 0.5
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct \
--speculative-algorithm EAGLE --speculative-draft-model-path <draft-model-path>
python3 -m sglang.bench_serving \
--backend sglang \
--dataset-name speed-bench \
--dataset-path /path/to/throughput_1k.jsonl \
--speed-bench-category mixed \
--speed-bench-output-len 512 \
--num-prompts 512
--model is required unless the backend exposes GET /v1/models, in which case the first model ID is auto-selected.--tokenizer defaults to --model. Both can be HF model IDs or local paths.SGLANG_USE_MODELSCOPE=true enables fetching via ModelScope (weights are skipped for speed).--request-rate: requests per second. inf sends all immediately (burst). Non-infinite rate uses a Poisson process for arrival times.--max-concurrency: caps concurrent in-flight requests regardless of arrival rate.--disable-stream: switch to non-streaming mode when supported; TTFT then equals total latency for chat completions.--output-file FILE.jsonl: append JSONL results to file; auto-named if unspecified--output-details: include per-request arrays (generated texts, errors, ttfts, itls, input/output lens)--extra-request-body '{"top_p":0.9,"temperature":0.6}': merged into payload (sampling params, etc.)--disable-ignore-eos: pass through EOS behavior (varies by backend)--warmup-requests N: run warmup requests with short output first (default 1)--flush-cache: call /flush_cache (sglang) before main run--profile: call /start_profile and /stop_profile (requires server to enable profiling, e.g., SGLANG_TORCH_PROFILER_DIR)--lora-name name1 name2 ...: randomly pick one per request and pass to backend (e.g., lora_path for sglang)--tokenize-prompt: send integer IDs instead of text (currently supports --backend sglang only)If your target endpoint requires OpenAI-style auth, set:
export OPENAI_API_KEY=sk-...yourkey...
The script will add Authorization: Bearer $OPENAI_API_KEY automatically for OpenAI-compatible routes.
Printed after each run:
(latency - ttft)/(tokens-1)The script also retokenizes generated text with the configured tokenizer and reports "retokenized" counts.
When --output-file is set, one JSON object is appended per run. Base fields:
accept_length when available (sglang)With --output-details, an extended object also includes arrays:
input_lens, output_lensttfts, itls (per request: ITL arrays)generated_texts, errors/generate (streaming):python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random \
--random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.5 \
--num-prompts 2000 \
--request-rate 100 \
--max-concurrency 512 \
--output-file sglang_random.jsonl --output-details
python3 -m sglang.bench_serving \
--backend vllm \
--base-url http://127.0.0.1:8000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--num-prompts 1000 \
--sharegpt-output-len 256
python3 -m sglang.bench_serving \
--backend vllm-chat \
--base-url http://127.0.0.1:8000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random \
--num-prompts 500 \
--apply-chat-template
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model your-vlm-model \
--dataset-name image \
--image-count 2 \
--image-resolution 720p \
--random-input-len 128 --random-output-len 256 \
--num-prompts 200 \
--apply-chat-template
4a) Images with custom resolution:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model your-vlm-model \
--dataset-name image \
--image-count 1 \
--image-resolution 512x768 \
--random-input-len 64 --random-output-len 128 \
--num-prompts 100 \
--apply-chat-template
4b) 1080p images with PNG format and blank content:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model your-vlm-model \
--dataset-name image \
--image-count 1 \
--image-resolution 1080p \
--image-format png \
--image-content blank \
--random-input-len 64 --random-output-len 128 \
--num-prompts 100 \
--apply-chat-template
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name generated-shared-prefix \
--gsp-num-groups 64 --gsp-prompts-per-group 16 \
--gsp-system-prompt-len 2048 --gsp-question-len 128 --gsp-output-len 256 \
--num-prompts 1024
Zipfian / power-law prefix popularity (opt-in via --gsp-group-distribution=zipf):
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name generated-shared-prefix \
--gsp-num-groups 64 --gsp-prompts-per-group 16 \
--gsp-system-prompt-len 2048 --gsp-question-len 128 --gsp-output-len 256 \
--gsp-group-distribution zipf --gsp-zipf-alpha 1.2 \
--seed 42
zipf mode samples each request's prefix group from the rank-based distribution p(rank) = (1/rank**alpha) / sum_k(1/k**alpha) with rank starting at 1, so group index 0 is the hottest. The total request count stays num_groups * prompts_per_group — identical to uniform mode — and only the per-request group assignment changes. alpha must be a finite float strictly greater than 0; larger values concentrate requests on lower-ranked (hotter) groups.
The on-disk dataset cache at ~/.cache/sglang/benchmark/gen_shared_prefix_*.pkl includes group_distribution and zipf_alpha in its key, so uniform-mode and zipf-mode runs (or two zipf runs with different alpha) never share a cache file. Uniform-mode filenames are unchanged from the legacy format, so existing caches remain valid.
This flag controls prefix-popularity shape only. It does not by itself reproduce any production trace or guarantee an observed cache-hit rate for a given engine.
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random \
--tokenize-prompt \
--random-input-len 2048 --random-output-len 256 --random-range-ratio 0.2
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--profile \
--flush-cache
python3 -m sglang.bench_serving \
--backend trt \
--base-url http://127.0.0.1:8000 \
--model your-trt-llm-model \
--dataset-name random \
--num-prompts 100 \
--disable-ignore-eos
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model model-name \
--dataset-name mooncake \
--mooncake-slowdown-factor 1.0 \
--mooncake-num-rounds 1000 \
--mooncake-workload conversation|mooncake|agent|synthetic
--use-trace-timestamps true \
--random-output-len 256
When benchmarking pure decode performance in a PD disaggregation setup, you can bypass the prefill node entirely by using --fake-prefill. This requires the decode server to be started with --disaggregation-transfer-backend fake:
# Step 1: Start a decode-only server with fake transfer backend
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode decode \
--disaggregation-transfer-backend fake \
--port 30001
# Step 2: Run bench_serving with --fake-prefill
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30001 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random \
--num-prompts 500 \
--random-input-len 1024 --random-output-len 256 \
--fake-prefill
Similarly, bench_one_batch_server also supports --fake-prefill:
python3 -m sglang.bench_one_batch_server \
--base-url http://127.0.0.1:30001 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--batch-size 32 --input-len 1024 --output-len 256 \
--fake-prefill
The --fake-prefill flag automatically injects special sentinel values into each request, telling the decode server to skip real KV transfer and generate fake KV data locally.
--backend, server URL/port, --model, and authentication. Check warmup errors printed by the script.--request-rate and --max-concurrency; verify server batch size/scheduling; ensure streaming is enabled if appropriate.pillow, datasets, pybase64).OPENAI_API_KEY or disable auth on your server.RLIMIT_NOFILE) to help with many concurrent connections./server_info is queried post-run to report speculative decoding accept length when available.