doc/source/serve/llm/user-guides/prefill-decode.md
(prefill-decode-guide)=
Deploy LLMs with separated prefill and decode phases for better resource utilization and cost optimization.
:::{warning} This feature requires vLLM v1, which is the default engine. For legacy deployments using vLLM v0, upgrade to v1 first. :::
Prefill/decode disaggregation separates the prefill phase (processing input prompts) from the decode phase (generating tokens). This separation provides:
vLLM provides several KV transfer backends for disaggregated serving:
Consider this pattern when:
NIXLConnector provides network-based KV cache transfer between prefill and decode servers with minimal configuration.
If you use ray-project/ray-llm Docker images, NIXL is already installed. Otherwise, install it:
uv pip install nixl
The NIXL wheel comes bundled with its supported backends (UCX, libfabric, EFA, etc.). These shared binaries may not be the latest version for your hardware and network stack. If you need the latest versions, install NIXL from source against the target backend library. See the NIXL installation guide for details.
The following example shows how to deploy with NIXLConnector:
from ray.serve.llm import LLMConfig, build_pd_openai_app
import ray.serve as serve
# Configure prefill instance
prefill_config = LLMConfig(
model_loading_config={
"model_id": "meta-llama/Llama-3.1-8B-Instruct"
},
engine_kwargs={
"kv_transfer_config": {
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
}
}
)
# Configure decode instance
decode_config = LLMConfig(
model_loading_config={
"model_id": "meta-llama/Llama-3.1-8B-Instruct"
},
engine_kwargs={
"kv_transfer_config": {
"kv_connector": "NixlConnector",
"kv_role": "kv_both",
}
}
)
pd_config = dict(
prefill_config=prefill_config,
decode_config=decode_config,
)
app = build_pd_openai_app(pd_config)
serve.run(app)
For production deployments, use a YAML configuration file:
:language: yaml
Deploy with:
serve deploy nixl_config.yaml
kv_connector: Set to "NixlConnector" to use NIXL.kv_role: Set to "kv_both" for both prefill and decode instances.LMCacheConnectorV1 provides advanced caching with support for multiple storage backends.
Install LMCache:
uv pip install lmcache
This configuration uses LMCache with a NIXL-based storage backend for network communication.
The following is an example Ray Serve configuration for LMCache with NIXL:
:language: yaml
Create the LMCache configuration for the prefill instance (lmcache_prefiller.yaml):
:language: yaml
Create the LMCache configuration for the decode instance (lmcache_decoder.yaml):
:language: yaml
:::{note}
The LMCACHE_CONFIG_FILE environment variable must point to an existing configuration file that's accessible within the Ray Serve container or worker environment. Ensure these configuration files are properly mounted or available in your deployment environment.
:::
This configuration uses LMCache with Mooncake store, a high-performance distributed storage system.
The following is an example Ray Serve configuration for LMCache with Mooncake:
:language: yaml
Create the LMCache configuration for Mooncake (lmcache_mooncake.yaml):
:language: yaml
:::{warning} For Mooncake deployments:
chmod 644).kv_connector: Set to "LMCacheConnectorV1".kv_role: Set to "kv_producer" for prefill, "kv_consumer" for decode.kv_buffer_size: Size of the KV cache buffer.LMCACHE_CONFIG_FILE: Environment variable that specifies the configuration file path.Before deploying with LMCacheConnectorV1, start the required services:
# Start etcd server if not already running
docker run -d --name etcd-server \
-p 2379:2379 -p 2380:2380 \
quay.io/coreos/etcd:latest \
etcd --listen-client-urls http://0.0.0.0:2379 \
--advertise-client-urls http://localhost:2379
# For Mooncake backend, start the Mooncake master
# See https://docs.lmcache.ai/kv_cache/mooncake.html for details
mooncake_master --port 49999
Test with a chat completion request:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Explain the benefits of prefill/decode disaggregation"}
],
"max_tokens": 100,
"temperature": 0.7
}'
LMCACHE_CONFIG_FILE environment variable points to an existing file.Quickstart <../quick-start> - Basic LLM deployment examples