doc/source/serve/llm/architecture/serving-patterns/prefill-decode.md
(serve-llm-architecture-prefill-decode)=
Prefill-decode (PD) disaggregation is a serving pattern that separates the prefill phase (processing input prompts) from the decode phase (generating tokens). This pattern was first pioneered in DistServe and optimizes resource utilization by scaling each phase independently based on its specific requirements.
---
width: 700px
name: pd-architecture
---
Prefill-decode disaggregation architecture with PDProxyServer coordinating prefill and decode deployments.
In prefill-decode disaggregation:
Prefill and decode have different computational patterns:
| Phase | Characteristics | Resource Needs |
|---|---|---|
| Prefill | Processes the entire prompt at once | High compute, lower memory |
| Parallel token processing | Benefits from high FLOPS | |
| Short duration per request | Can use fewer replicas when decode-limited | |
| Decode | Generates one token at a time | Lower compute, high memory |
| Auto-regressive generation | Benefits from large batch sizes | |
| Long duration (many tokens) | Needs more replicas |
Disaggregation enables:
PDProxyServer orchestrates the disaggregated serving:
class PDProxyServer:
"""Proxy server for prefill-decode disaggregation."""
def __init__(
self,
prefill_handle: DeploymentHandle,
decode_handle: DeploymentHandle,
):
self.prefill_handle = prefill_handle
self.decode_handle = decode_handle
async def chat(
self,
request: ChatCompletionRequest,
) -> AsyncGenerator[str, None]:
"""Handle chat completion with PD flow.
Flow:
1. Send request to prefill deployment
2. Prefill processes prompt, transfers KV to decode
3. Decode generates tokens, streams to client
"""
# Prefill phase
prefill_result = await self.prefill_handle.chat.remote(request)
# Extract KV cache metadata
kv_metadata = prefill_result["kv_metadata"]
# Decode phase with KV reference
async for chunk in self.decode_handle.chat.remote(
request,
kv_metadata=kv_metadata
):
yield chunk
Key responsibilities:
Standard LLMServer configured for prefill:
prefill_config = LLMConfig(
model_loading_config=dict(
model_id="llama-3.1-8b",
model_source="meta-llama/Llama-3.1-8B-Instruct"
),
engine_kwargs=dict(
# Prefill-specific configuration
kv_transfer_config={
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
},
),
)
Standard LLMServer configured for decode:
decode_config = LLMConfig(
model_loading_config=dict(
model_id="llama-3.1-8b",
model_source="meta-llama/Llama-3.1-8B-Instruct"
),
engine_kwargs=dict(
# Decode-specific configuration
kv_transfer_config={
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
},
),
)
---
width: 700px
name: pd-flow
---
Prefill-decode request flow showing KV cache transfer between phases.
Detailed request flow:
/v1/chat/completions.PDProxyServer.PDProxyServer calls prefill deployment.
PDProxyServer calls decode deployment with KV metadata.
:::{note} The KV cache transfer is transparent to the client. From the client's perspective, it's a standard OpenAI API call. :::
Prefill-decode disaggregation works best when:
Consider alternatives when:
The latency of KV cache transfer between prefill and decode affects overall request latency and it's mostly determined by network bandwidth. NIXL has different backend plugins, but its performance on different network stacks isn't mature yet. You should inspect your deployment to verify NIXL uses the right network backend for your environment.
The system must balance load between prefill and decode phases. Mismatched scaling can lead to:
Monitor both phases and adjust replica counts and autoscaling policies accordingly.
../overview - High-level architecture overview../core - Core components and protocolsdata-parallel - Data parallel attention architecture../../user-guides/prefill-decode - Practical deployment guide