docs_new/docs/developer_guide/bench_serving.mdx
This guide explains how to benchmark online serving throughput and latency using python -m sglang.bench_serving. It supports multiple inference backends via OpenAI-compatible and native endpoints, and produces both console metrics and optional JSONL outputs.
sglang / sglang-native: POST /generatesglang-oai, vllm, lmdeploy: POST /v1/completionssglang-oai-chat, vllm-chat, lmdeploy-chat: POST /v1/chat/completionstrt (TensorRT-LLM): POST /v2/models/ensemble/generate_streamgserver: Custom server (Not Implemented yet in this script)truss: POST /v1/models/model:predictIf --base-url is provided, requests are sent to it. Otherwise, --host and --port are used. When --model is not provided, the script will attempt to query GET /v1/models for an available model ID (OpenAI-compatible endpoints).
aiohttp, numpy, requests, tqdm, transformers, and for some datasets datasets, pillow, pybase64. Install as needed.OPENAI_API_KEY (used as Authorization: Bearer <key>)Run a basic benchmark against an sglang server exposing /generate:
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--num-prompts 1000 \
--model meta-llama/Llama-3.1-8B-Instruct
Or, using an OpenAI-compatible endpoint (completions):
python3 -m sglang.bench_serving \
--backend vllm \
--base-url http://127.0.0.1:8000 \
--num-prompts 1000 \
--model meta-llama/Llama-3.1-8B-Instruct
Select with --dataset-name:
sharegpt (default): loads ShareGPT-style pairs; optionally restrict with --sharegpt-context-len and override outputs with --sharegpt-output-lenrandom: random text lengths; sampled from ShareGPT token spacerandom-ids: random token ids (can lead to gibberish)image: generates images and wraps them in chat messages; supports custom resolutions, multiple formats, and different content typesgenerated-shared-prefix: synthetic dataset with shared long system prompts and short questionsmmmu: samples from MMMU (Math split) and includes imagesCommon dataset flags:
--num-prompts N: number of requests
--random-input-len, --random-output-len, --random-range-ratio: for random/random-ids/image
--image-count: Number of images per request (for image dataset).
--apply-chat-template: apply tokenizer chat template when constructing prompts
--dataset-path PATH: file path for ShareGPT json; if blank and missing, it will be downloaded and cached
Generated Shared Prefix flags (for generated-shared-prefix):
--gsp-num-groups--gsp-prompts-per-group--gsp-system-prompt-len--gsp-question-len--gsp-output-lenImage dataset flags (for image):
--image-count: Number of images per request--image-resolution: Image resolution; supports presets (4k, 1080p, 720p, 360p) or custom 'heightxwidth' format (e.g., 1080x1920, 512x768)--image-format: Image format (jpeg or png)--image-content: Image content type (random or blank)python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache
python -m sglang.bench_serving \
--backend sglang-oai-chat \
--dataset-name image \
--num-prompts 500 \
--image-count 3 \
--image-resolution 720p \
--random-input-len 512 \
--random-output-len 512
python -m sglang.launch_server --model-path Qwen/Qwen2.5-3B-Instruct
python3 -m sglang.bench_serving \
--backend sglang \
--dataset-name random \
--num-prompts 3000 \
--random-input 1024 \
--random-output 1024 \
--random-range-ratio 0.5
--model is required unless the backend exposes GET /v1/models, in which case the first model ID is auto-selected.--tokenizer defaults to --model. Both can be HF model IDs or local paths.SGLANG_USE_MODELSCOPE=true enables fetching via ModelScope (weights are skipped for speed).--request-rate: requests per second. inf sends all immediately (burst). Non-infinite rate uses a Poisson process for arrival times.--max-concurrency: caps concurrent in-flight requests regardless of arrival rate.--disable-stream: switch to non-streaming mode when supported; TTFT then equals total latency for chat completions.--output-file FILE.jsonl: append JSONL results to file; auto-named if unspecified--output-details: include per-request arrays (generated texts, errors, ttfts, itls, input/output lens)--extra-request-body '{"top_p":0.9,"temperature":0.6}': merged into payload (sampling params, etc.)--disable-ignore-eos: pass through EOS behavior (varies by backend)--warmup-requests N: run warmup requests with short output first (default 1)--flush-cache: call /flush_cache (sglang) before main run--profile: call /start_profile and /stop_profile (requires server to enable profiling, e.g., SGLANG_TORCH_PROFILER_DIR)--lora-name name1 name2 ...: randomly pick one per request and pass to backend (e.g., lora_path for sglang)--tokenize-prompt: send integer IDs instead of text (currently supports --backend sglang only)If your target endpoint requires OpenAI-style auth, set:
export OPENAI_API_KEY=sk-...yourkey...
The script will add Authorization: Bearer $OPENAI_API_KEY automatically for OpenAI-compatible routes.
Printed after each run:
(latency - ttft)/(tokens-1)The script also retokenizes generated text with the configured tokenizer and reports "retokenized" counts.
When --output-file is set, one JSON object is appended per run. Base fields:
accept_length when available (sglang)With --output-details, an extended object also includes arrays:
input_lens, output_lensttfts, itls (per request: ITL arrays)generated_texts, errors/generate (streaming):python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random \
--random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.5 \
--num-prompts 2000 \
--request-rate 100 \
--max-concurrency 512 \
--output-file sglang_random.jsonl --output-details
python3 -m sglang.bench_serving \
--backend vllm \
--base-url http://127.0.0.1:8000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--num-prompts 1000 \
--sharegpt-output-len 256
python3 -m sglang.bench_serving \
--backend vllm-chat \
--base-url http://127.0.0.1:8000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random \
--num-prompts 500 \
--apply-chat-template
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model your-vlm-model \
--dataset-name image \
--image-count 2 \
--image-resolution 720p \
--random-input-len 128 --random-output-len 256 \
--num-prompts 200 \
--apply-chat-template
4a) Images with custom resolution:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model your-vlm-model \
--dataset-name image \
--image-count 1 \
--image-resolution 512x768 \
--random-input-len 64 --random-output-len 128 \
--num-prompts 100 \
--apply-chat-template
4b) 1080p images with PNG format and blank content:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model your-vlm-model \
--dataset-name image \
--image-count 1 \
--image-resolution 1080p \
--image-format png \
--image-content blank \
--random-input-len 64 --random-output-len 128 \
--num-prompts 100 \
--apply-chat-template
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name generated-shared-prefix \
--gsp-num-groups 64 --gsp-prompts-per-group 16 \
--gsp-system-prompt-len 2048 --gsp-question-len 128 --gsp-output-len 256 \
--num-prompts 1024
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random \
--tokenize-prompt \
--random-input-len 2048 --random-output-len 256 --random-range-ratio 0.2
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--profile \
--flush-cache
python3 -m sglang.bench_serving \
--backend trt \
--base-url http://127.0.0.1:8000 \
--model your-trt-llm-model \
--dataset-name random \
--num-prompts 100 \
--disable-ignore-eos
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30000 \
--model model-name \
--dataset-name mooncake \
--mooncake-slowdown-factor 1.0 \
--mooncake-num-rounds 1000 \
--mooncake-workload conversation|mooncake|agent|synthetic
--use-trace-timestamps true \
--random-output-len 256
When benchmarking pure decode performance in a PD disaggregation setup, you can bypass the prefill node entirely by using --fake-prefill. This requires the decode server to be started with --disaggregation-transfer-backend fake:
# Step 1: Start a decode-only server with fake transfer backend
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode decode \
--disaggregation-transfer-backend fake \
--port 30001
# Step 2: Run bench_serving with --fake-prefill
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 --port 30001 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random \
--num-prompts 500 \
--random-input-len 1024 --random-output-len 256 \
--fake-prefill
Similarly, bench_one_batch_server also supports --fake-prefill:
python3 -m sglang.bench_one_batch_server \
--base-url http://127.0.0.1:30001 \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--batch-size 32 --input-len 1024 --output-len 256 \
--fake-prefill
The --fake-prefill flag automatically injects special sentinel values into each request, telling the decode server to skip real KV transfer and generate fake KV data locally.
--backend, server URL/port, --model, and authentication. Check warmup errors printed by the script.--request-rate and --max-concurrency; verify server batch size/scheduling; ensure streaming is enabled if appropriate.pillow, datasets, pybase64).OPENAI_API_KEY or disable auth on your server.RLIMIT_NOFILE) to help with many concurrent connections./server_info is queried post-run to report speculative decoding accept length when available.