doc/source/serve/llm/architecture/serving-patterns/prefill-decode.md
(serve-llm-architecture-prefill-decode)=
Prefill-decode (PD) disaggregation is a serving pattern that separates the prefill phase (processing input prompts) from the decode phase (generating tokens). This pattern was first pioneered in DistServe and optimizes resource utilization by scaling each phase independently based on its specific requirements.
---
width: 700px
name: pd-architecture
---
Prefill-decode disaggregation architecture with PDDecodeServer orchestrating remote prefill and local decode.
In prefill-decode disaggregation:
PDPrefillServer): Processes input prompts and generates initial KV cache.PDDecodeServer): Orchestrates the flow. Initiates prefill remotely, then runs decode locally on its own engine using the transferred KV cache.Prefill and decode have different computational patterns:
| Phase | Characteristics | Resource Needs |
|---|---|---|
| Prefill | Processes the entire prompt at once | High compute, lower memory |
| Parallel token processing | Benefits from high FLOPS | |
| Short duration per request | Can use fewer replicas when decode-limited | |
| Decode | Generates one token at a time | Lower compute, high memory |
| Auto-regressive generation | Benefits from large batch sizes | |
| Long duration (many tokens) | Needs more replicas |
Disaggregation enables:
PDDecodeServer is the decode-side LLM server that orchestrates the disaggregated flow. It owns a real engine and holds a handle to the prefill deployment:
class PDDecodeServer(PDOrchestratorMixin, LLMServer):
"""Decode-side server with orchestration."""
def __init__(
self,
llm_config: LLMConfig,
prefill_server: DeploymentHandle,
):
self._prefill_handle = prefill_server
# Initialize real decode engine
super().__init__(llm_config)
async def chat(
self,
request: ChatCompletionRequest,
) -> AsyncGenerator[str, None]:
"""Handle chat completion with PD flow.
Flow:
1. Send request to prefill deployment (remote)
2. Prefill processes prompt, returns KV metadata
3. Run decode locally on own engine with KV metadata
4. Stream tokens to client
"""
...
Key responsibilities:
PDPrefillServer extends LLMServer for the prefill side. It is a standard LLM server with an additional prewarm_prefill method for optional connector warm-up.
KV transfer connectors (such as NIXL) require a handshake between each prefill and decode replica that happens eagerly upon the first request. This can cause queing when traffic is high. Pre-warming allows to mitigate this cold-start problem by sending a tiny dummy request through the full prefill-to-decode path for every prefill replica so that the connector establishes its connections eagerly at startup before marking the replica as healthy. Enable it by setting experimental_configs={"_prewarm_prefill_decode": True} in the decode LLMConfig.
prefill_config = LLMConfig(
model_loading_config=dict(
model_id="llama-3.1-8b",
model_source="meta-llama/Llama-3.1-8B-Instruct"
),
engine_kwargs=dict(
kv_transfer_config={
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
},
),
)
Standard LLMServer configured for decode:
decode_config = LLMConfig(
model_loading_config=dict(
model_id="llama-3.1-8b",
model_source="meta-llama/Llama-3.1-8B-Instruct"
),
engine_kwargs=dict(
kv_transfer_config={
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
},
),
)
---
width: 700px
name: pd-flow
---
Prefill-decode request flow showing KV cache transfer between phases.
Detailed request flow:
/v1/chat/completions.PDDecodeServer.PDDecodeServer calls prefill deployment.
PDDecodeServer runs decode on its own engine with the KV metadata.
:::{note} The KV cache transfer is transparent to the client. From the client's perspective, it's a standard OpenAI API call. :::
Prefill-decode disaggregation works best when:
Consider alternatives when:
The latency of KV cache transfer between prefill and decode affects overall request latency and it's mostly determined by network bandwidth. NIXL has different backend plugins, but its performance on different network stacks isn't mature yet. You should inspect your deployment to verify NIXL uses the right network backend for your environment.
The system must balance load between prefill and decode phases. Mismatched scaling can lead to:
Monitor both phases and adjust replica counts and autoscaling policies accordingly.
../overview - High-level architecture overview../core - Core components and protocolsdata-parallel - Data parallel attention architecture../../user-guides/prefill-decode - Practical deployment guide