docs/features/mooncake_store_connector_usage.md
MooncakeStoreConnector is a KV cache connector that uses MooncakeDistributedStore as a shared KV cache pool. Unlike MooncakeConnector which does direct point-to-point KV transfer between prefiller and decoder, MooncakeStoreConnector enables KV cache offloading to an external distributed store, supporting:
Install mooncake through pip:
uv pip install mooncake-transfer-engine
Refer to the Mooncake official repository for more installation instructions and building from source.
The Mooncake master manages metadata and coordinates the distributed store. Start it before launching vLLM:
mooncake_master --port 50051
Default ports:
Multiple vLLM instances can share the same master server.
Create a JSON configuration file (e.g., mooncake_config.json):
{
"mode": "embedded",
"metadata_server": "P2PHANDSHAKE",
"master_server_address": "127.0.0.1:50051",
"global_segment_size": "80GB",
"local_buffer_size": "4GB",
"protocol": "rdma",
"device_name": "",
"enable_offload": false
}
mode: Topology selection. "embedded" (default, PR-40900 baseline) has each
vLLM rank contribute global_segment_size to the pool in-process.
"standalone-store" makes ranks pure requesters — an external
mooncake_client process owns the CPU pool and (optionally) the SSD tier.protocol: Use "rdma" for best performance. "tcp" works as a fallback.global_segment_size: CPU memory contributed to the distributed pool (per
GPU). Must be > 0 in embedded mode and 0 in standalone-store mode.local_buffer_size: Private buffer for this node's own operations (per GPU).enable_offload: When true, vLLM allocates a DirectIO staging buffer so
large prefills do not exceed the owner's SSD-write budget. Set this together
with the matching --enable_offload=true flag on mooncake_master and on
the external mooncake_client (if any).Set the config path via environment variable:
export MOONCAKE_CONFIG_PATH=/path/to/mooncake_config.json
Use MooncakeStoreConnector to offload KV cache to CPU memory, extending the effective cache size:
MOONCAKE_CONFIG_PATH=mooncake_config.json \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both"}'
In disaggregated prefill-decode mode, use MultiConnector to combine MooncakeConnector (point-to-point KV transfer) with MooncakeStoreConnector (shared KV cache pool). This enables both direct P2P transfer between prefiller and decoder, and cross-instance prefix cache sharing via the distributed store.
Prefiller Node:
MOONCAKE_CONFIG_PATH=mooncake_config.json \
VLLM_MOONCAKE_BOOTSTRAP_PORT=50052 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8100 \
--kv-transfer-config '{
"kv_connector": "MultiConnector",
"kv_role": "kv_producer",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnector",
"kv_role": "kv_producer"
},
{
"kv_connector": "MooncakeStoreConnector",
"kv_role": "kv_both"
}
]
}
}'
Decoder Node:
MOONCAKE_CONFIG_PATH=mooncake_config.json \
VLLM_MOONCAKE_BOOTSTRAP_PORT=50053 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8200 \
--kv-transfer-config '{
"kv_connector": "MultiConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnector",
"kv_role": "kv_consumer"
},
{
"kv_connector": "MooncakeStoreConnector",
"kv_role": "kv_consumer"
}
]
}
}'
Proxy:
A disaggregation proxy is required to route requests between prefiller and decoder nodes. The proxy assigns do_remote_prefill=True / do_remote_decode=True to coordinate P2P transfer via MooncakeConnector. Refer to the MooncakeConnector usage guide for proxy setup details.
Disk offloading is most commonly run in standalone-store mode: an external
mooncake_client process owns the CPU pool and the SSD tier, and each vLLM
rank is a pure requester. This avoids per-rank duplication of the SSD pool
and keeps DirectIO budget tracking on a single process.
Three things need to be aligned for end-to-end disk offloading:
mooncake_master is started with --enable_offload=true.mooncake_client (the owner) is started with --enable_offload=true
plus an SSD path via MOONCAKE_OFFLOAD_FILE_STORAGE_PATH."enable_offload": true in the JSON config file (this is
read by the connector and is not an environment variable).Example mooncake_config.json for the vLLM side:
{
"mode": "standalone-store",
"metadata_server": "P2PHANDSHAKE",
"master_server_address": "127.0.0.1:50051",
"global_segment_size": 0,
"local_buffer_size": "4GB",
"protocol": "rdma",
"device_name": "mlx5_0",
"enable_offload": true
}
Steer this rank to the local owner segment with:
export MOONCAKE_PREFERRED_SEGMENT=127.0.0.1:50053
The owner's SSD directory, on-disk eviction policy, and the DirectIO staging
buffer size are controlled on the mooncake_client side via the standard
Mooncake environment variables (MOONCAKE_OFFLOAD_FILE_STORAGE_PATH,
MOONCAKE_BUCKET_EVICTION_POLICY, MOONCAKE_USE_URING,
MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES,
MOONCAKE_OFFLOAD_TOTAL_SIZE_LIMIT_BYTES, etc.). Those are independent of
the vLLM JSON config.
| Variable | Description | Default |
|---|---|---|
MOONCAKE_CONFIG_PATH | Path to Mooncake JSON config file | (required) |
VLLM_MOONCAKE_BOOTSTRAP_PORT | Bootstrap port for MooncakeConnector P2P transfer (disagg mode only) | 8998 |
MOONCAKE_PREFERRED_SEGMENT | Pin this rank's replicas to a specific owner segment (host:port); used in standalone-store mode | — |
MOONCAKE_REQUESTER_LOCAL_HOSTNAME | Override the hostname the vLLM rank registers with Mooncake as a requester. Defaults to the rank's resolved IP. | — |
VLLM_MOONCAKE_STORE_TIER_LOG | When 1, logs a per-batch tier summary (memory vs disk hits) for observability | disabled |
VLLM_MOONCAKE_DISK_STAGING_USABLE_RATIO | Fraction of the owner's DirectIO staging buffer that the requester will fill in a single batch_get_into_multi_buffers call. Lower → more conservative pre-split, more round trips. | 0.9 |
load_async (bool): Enable asynchronous loading for better compute-I/O overlap. Default: true.enable_cross_layers_blocks (bool): Enable cross-layer block packing for reduced store operations. Default: false.lookup_rpc_port (int): Custom port for the ZMQ lookup RPC socket. Default: 0.The MooncakeStoreConnector relies on consistent block hashes across all vLLM processes sharing the distributed store. Because Python randomizes its hash seed per process by default, identical prompts can produce different block hashes on different processes — preventing cross-process prefix cache hits.
Set a fixed PYTHONHASHSEED on every instance that shares the store (DP ranks, separate prefiller/decoder nodes, and any other vLLM process pointed at the same Mooncake store):
PYTHONHASHSEED=0 vllm serve ...