docs_new/docs/advanced_features/llm-d.mdx
llm-d is a Kubernetes-native distributed inference framework for serving large language models at scale across a fleet of inference servers. SGLang is a supported inference engine in llm-d: llm-d coordinates a fleet of SGLang instances across a cluster so that performance holds up under real production traffic, achieving the fastest "time to state-of-the-art (SOTA) performance" for key OSS models across most hardware accelerators.
llm-d is a CNCF Sandbox project founded by Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA.
A single SGLang server is fast, and RadixAttention already maximizes cache reuse within each replica. But at scale the picture changes: across many replicas, cache locality breaks under round-robin load balancing as related requests scatter and radix-cache hit rates collapse, long prompts inflate time-to-first-token, and accelerators sit underused. llm-d adds the cluster-level layer that the engine does not aim to provide on its own:
These capabilities are composable. Most teams start by adding prefix-aware routing over an existing SGLang pool, then layer in the rest as specific bottlenecks appear.
llm-d builds on the Gateway API Inference Extension, so SGLang pools are managed through standard Kubernetes resources (Gateway, HTTPRoute, InferencePool) and work with supported gateway providers such as Istio, GKE Inference Gateway, and agentgateway, rather than a bespoke routing tier.
llm-d publishes reproducible benchmarks from production-scale deployments on Prism. One representative result: prefix-aware routing delivered 3x higher output throughput and 2x faster TTFT than round-robin load balancing (Llama 3.1 70B). The mechanism carries over directly to SGLang, where RadixAttention makes the cluster-level cache hit rate a function of routing.
Questions and contributions are welcome on GitHub and Slack.
SGLang is supported across the well-lit paths — including intelligent inference scheduling, precise prefix-cache routing (SGLang publishes KV-cache events that the llm-d Router subscribes to), tiered KV-cache management, prefill/decode disaggregation, flow control, and autoscaling. The one current exception is Multi-Node Wide Expert Parallelism, which is vLLM-specific today. See the llm-d documentation for the latest per-engine support status.