docs/features/disagg_prefill.md
This page introduces you to the disaggregated prefilling feature in vLLM.
!!! note This feature is experimental and subject to change.
Two main reasons:
tp and pp) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.!!! note Disaggregated prefill DOES NOT improve throughput.
Please refer to examples/online_serving/disaggregated_prefill.sh for the example usage of disaggregated prefilling.
Now supports 6 types of connectors:
ExampleConnector: refer to examples/offline_inference/disaggregated-prefill-v1/run.sh for the example usage of ExampleConnector disaggregated prefilling.
LMCacheConnectorV1: refer to examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh for the example usage of LMCacheConnectorV1 disaggregated prefilling which uses NIXL as the underlying KV transmission.
NixlConnector: refer to tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv. For detailed usage guide, see NixlConnector Usage Guide. For feature compatibility details, see NixlConnector Compatibility Matrix.
P2pNcclConnector: refer to examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh for the example usage of P2pNcclConnector disaggregated prefilling.
MooncakeConnector: refer to examples/online_serving/disaggregated_serving/mooncake_connector/run_mooncake_connector.sh for the example usage of ExampleConnector disaggregated prefilling. For detailed usage guide, see MooncakeConnector Usage Guide.
MultiConnector: take advantage of the kv_connector_extra_config: dict[str, Any] already present in KVTransferConfig to stash all the connectors we want in an ordered list of kwargs.such as:
--kv-transfer-config '{"kv_connector":"MultiConnector","kv_role":"kv_both","kv_connector_extra_config":{"connectors":[{"kv_connector":"NixlConnector","kv_role":"kv_both"},{"kv_connector":"ExampleConnector","kv_role":"kv_both","kv_connector_extra_config":{"shared_storage_path":"local_storage"}}]}}'
For NixlConnector, you may also specify one or multiple NIXL_Backend. Such as:
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_buffer_device":"cuda", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'
OffloadingConnector: enable offloading of KV data to CPU memory, customizing the CPU block size (in tokens) and total CPU memory bytes to allocate:
--kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 64, "cpu_bytes_to_use": 1000000000}}'
FlexKVConnectorV1: refer to examples/offline_inference/prefix_caching_flexkv.py for the example usage of FlexKVConnectorV1. FlexKV is a distributed KV Store and multi-level cache management system for ultra-large-scale LLM inference.
--kv-transfer-config '{"kv_connector":"FlexKVConnectorV1","kv_role":"kv_both"}'
Please refer to benchmarks/disagg_benchmarks for disaggregated prefilling benchmarks.
We implement disaggregated prefilling by running 2 vLLM instances. One for prefill (we call it prefill instance) and one for decode (we call it decode instance), and then use a connector to transfer the prefill KV caches and results from prefill instance to decode instance.
All disaggregated prefilling implementation is under vllm/distributed/kv_transfer.
Key abstractions for disaggregated prefilling:
insert KV cache and drop_select KV cache. The semantics of insert and drop_select are similar to SQL, where insert inserts a KV cache into the buffer, and drop_select returns the KV cache that matches the given condition and drop it from the buffer.send_tensor and recv_tensor.!!! note
insert is non-blocking operation but drop_select is blocking operation.
Here is a figure illustrating how the above 3 abstractions are organized:
The workflow of disaggregated prefilling is as follows:
The buffer corresponds to insert API in LookupBuffer, and the drop_select corresponds to drop_select API in LookupBuffer.
Now every process in vLLM will have a corresponding connector. Specifically, we have:
Here is a figure illustrating how the above 2 connectors are organized:
The figure below shows how the worker connector works with the attention module to achieve layer-by-layer KV cache store and load:
Disaggregated prefilling is highly related to infrastructure, so vLLM relies on third-party connectors for production-level disaggregated prefilling (and vLLM team will actively review and merge new PRs for third-party connectors).
We recommend three ways of implementations:
Connector, and call third-party libraries to send and receive KV caches, and many many more (like editing vLLM's model input to perform customized prefilling, etc.). This approach gives you the most control, but at the risk of being incompatible with future vLLM versions.LookupBuffer and support the insert and drop_select APIs just like SQL.Pipe and support the send_tensor and recv_tensor APIs, just like torch.distributed.