experimental/sgl-router/README.md
Slim, KV-aware, OpenAI-compatible router for SGLang workers.
Serves a single model and routes across its workers. Exposes
/v1/tokenize, /v1/detokenize, /v1/models, /v1/chat/completions
(buffered and SSE), plus /healthz / /readyz and /metrics. Worker
pools come from either a static URL list or Kubernetes EndpointSlice
discovery.
cd experimental/sgl-router
cargo build --release
The router is configured entirely through CLI flags (run
sgl-router --help for the full list). It serves exactly one model, so
--model-id is required, along with exactly one discovery backend.
--tokenizer-path is optional: give it a local tokenizer.json path or a
HuggingFace repo id, and when omitted the router downloads the tokenizer
for --model-id from HuggingFace (honoring HF_TOKEN / HF_HOME).
Static worker list:
sgl-router \
--host 0.0.0.0 --port 30000 \
--model-id qwen3 \
--tokenizer-path /models/qwen3/tokenizer.json \
--worker-urls http://10.0.0.1:30000 http://10.0.0.2:30000
Kubernetes EndpointSlice discovery:
sgl-router \
--host 0.0.0.0 --port 30000 \
--model-id qwen3 \
--tokenizer-path /models/qwen3/tokenizer.json \
--service-discovery \
--service-discovery-namespace prod \
--selector app=engines-qwen3
Omit --service-discovery-namespace to watch all namespaces (requires
cluster-wide RBAC). For prefill/decode disaggregation, replace --selector
with --prefill-selector and --decode-selector.
Apache-2.0.