docs/features/mooncake_store_connector_usage.md
MooncakeStoreConnector is a KV cache connector that uses MooncakeDistributedStore as a shared KV cache pool. Unlike MooncakeConnector which does direct point-to-point KV transfer between prefiller and decoder, MooncakeStoreConnector enables KV cache offloading to an external distributed store, supporting:
Install mooncake through pip:
uv pip install mooncake-transfer-engine
Refer to the Mooncake official repository for more installation instructions and building from source.
The Mooncake master manages metadata and coordinates the distributed store. Start it before launching vLLM:
mooncake_master --port 50051
Default ports:
Multiple vLLM instances can share the same master server.
Create a JSON configuration file (e.g., mooncake_config.json):
{
"metadata_server": "P2PHANDSHAKE",
"master_server_address": "127.0.0.1:50051",
"global_segment_size": "80GB",
"local_buffer_size": "4GB",
"protocol": "rdma",
"device_name": ""
}
protocol: Use "rdma" for best performance. "tcp" works as a fallback.global_segment_size: CPU memory contributed to the distributed pool (per GPU).local_buffer_size: Private buffer for this node's own operations (per GPU).Set the config path via environment variable:
export MOONCAKE_CONFIG_PATH=/path/to/mooncake_config.json
Use MooncakeStoreConnector to offload KV cache to CPU memory, extending the effective cache size:
MOONCAKE_CONFIG_PATH=mooncake_config.json \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both"}'
In disaggregated prefill-decode mode, use MultiConnector to combine MooncakeConnector (point-to-point KV transfer) with MooncakeStoreConnector (shared KV cache pool). This enables both direct P2P transfer between prefiller and decoder, and cross-instance prefix cache sharing via the distributed store.
Prefiller Node:
MOONCAKE_CONFIG_PATH=mooncake_config.json \
VLLM_MOONCAKE_BOOTSTRAP_PORT=50052 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8100 \
--kv-transfer-config '{
"kv_connector": "MultiConnector",
"kv_role": "kv_producer",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnector",
"kv_role": "kv_producer"
},
{
"kv_connector": "MooncakeStoreConnector",
"kv_role": "kv_producer"
}
]
}
}'
Decoder Node:
MOONCAKE_CONFIG_PATH=mooncake_config.json \
VLLM_MOONCAKE_BOOTSTRAP_PORT=50053 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8200 \
--kv-transfer-config '{
"kv_connector": "MultiConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnector",
"kv_role": "kv_consumer"
},
{
"kv_connector": "MooncakeStoreConnector",
"kv_role": "kv_consumer"
}
]
}
}'
Proxy:
A disaggregation proxy is required to route requests between prefiller and decoder nodes. The proxy assigns do_remote_prefill=True / do_remote_decode=True to coordinate P2P transfer via MooncakeConnector. Refer to the MooncakeConnector usage guide for proxy setup details.
| Variable | Description | Default |
|---|---|---|
MOONCAKE_CONFIG_PATH | Path to Mooncake JSON config file | (required) |
VLLM_MOONCAKE_BOOTSTRAP_PORT | Bootstrap port for MooncakeConnector P2P transfer (disagg mode only) | 8998 |
load_async (bool): Enable asynchronous loading for better compute-I/O overlap. Default: true.enable_cross_layers_blocks (bool): Enable cross-layer block packing for reduced store operations. Default: false.discard_partial_chunks (bool): Discard partial block chunks during store. Default: true.lookup_rpc_port (int): Custom port for the ZMQ lookup RPC socket. Default: 0.When running with data parallelism, set a fixed PYTHONHASHSEED so that block hashes are consistent across DP ranks:
PYTHONHASHSEED=0 vllm serve ...
Without this, identical prompts may produce different block hashes on different DP ranks, preventing cross-instance prefix cache hits.