Back to Vllm

MooncakeStoreConnector Usage Guide

docs/features/mooncake_store_connector_usage.md

0.21.05.5 KB
Original Source

MooncakeStoreConnector Usage Guide

MooncakeStoreConnector is a KV cache connector that uses MooncakeDistributedStore as a shared KV cache pool. Unlike MooncakeConnector which does direct point-to-point KV transfer between prefiller and decoder, MooncakeStoreConnector enables KV cache offloading to an external distributed store, supporting:

  • CPU offloading: Extend effective KV cache capacity by offloading to CPU memory via Mooncake's transfer engine.
  • Prefix caching across instances: Hash-based deduplication allows multiple vLLM instances to share cached KV blocks through the store.
  • Single-node and multi-node deployment: Works both as a standalone KV cache extension and in disaggregated prefill-decode setups.

Prerequisites

Install Mooncake

Install mooncake through pip:

bash
uv pip install mooncake-transfer-engine

Refer to the Mooncake official repository for more installation instructions and building from source.

Start the Mooncake Master Server

The Mooncake master manages metadata and coordinates the distributed store. Start it before launching vLLM:

bash
mooncake_master --port 50051

Default ports:

  • RPC: 50051

Multiple vLLM instances can share the same master server.

Configure Mooncake

Create a JSON configuration file (e.g., mooncake_config.json):

json
{
  "metadata_server": "P2PHANDSHAKE",
  "master_server_address": "127.0.0.1:50051",
  "global_segment_size": "80GB",
  "local_buffer_size": "4GB",
  "protocol": "rdma",
  "device_name": ""
}
  • protocol: Use "rdma" for best performance. "tcp" works as a fallback.
  • global_segment_size: CPU memory contributed to the distributed pool (per GPU).
  • local_buffer_size: Private buffer for this node's own operations (per GPU).

Set the config path via environment variable:

bash
export MOONCAKE_CONFIG_PATH=/path/to/mooncake_config.json

Usage

Single-Node KV Cache Offloading

Use MooncakeStoreConnector to offload KV cache to CPU memory, extending the effective cache size:

bash
MOONCAKE_CONFIG_PATH=mooncake_config.json \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both"}'

Disaggregated Prefill-Decode (XpYd)

In disaggregated prefill-decode mode, use MultiConnector to combine MooncakeConnector (point-to-point KV transfer) with MooncakeStoreConnector (shared KV cache pool). This enables both direct P2P transfer between prefiller and decoder, and cross-instance prefix cache sharing via the distributed store. Prefiller Node:

bash
MOONCAKE_CONFIG_PATH=mooncake_config.json \
VLLM_MOONCAKE_BOOTSTRAP_PORT=50052 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --port 8100 \
    --kv-transfer-config '{
        "kv_connector": "MultiConnector",
        "kv_role": "kv_producer",
        "kv_connector_extra_config": {
            "connectors": [
                {
                    "kv_connector": "MooncakeConnector",
                    "kv_role": "kv_producer"
                },
                {
                    "kv_connector": "MooncakeStoreConnector",
                    "kv_role": "kv_producer"
                }
            ]
        }
    }'

Decoder Node:

bash
MOONCAKE_CONFIG_PATH=mooncake_config.json \
VLLM_MOONCAKE_BOOTSTRAP_PORT=50053 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --port 8200 \
    --kv-transfer-config '{
        "kv_connector": "MultiConnector",
        "kv_role": "kv_consumer",
        "kv_connector_extra_config": {
            "connectors": [
                {
                    "kv_connector": "MooncakeConnector",
                    "kv_role": "kv_consumer"
                },
                {
                    "kv_connector": "MooncakeStoreConnector",
                    "kv_role": "kv_consumer"
                }
            ]
        }
    }'

Proxy:

A disaggregation proxy is required to route requests between prefiller and decoder nodes. The proxy assigns do_remote_prefill=True / do_remote_decode=True to coordinate P2P transfer via MooncakeConnector. Refer to the MooncakeConnector usage guide for proxy setup details.

Environment Variables

VariableDescriptionDefault
MOONCAKE_CONFIG_PATHPath to Mooncake JSON config file(required)
VLLM_MOONCAKE_BOOTSTRAP_PORTBootstrap port for MooncakeConnector P2P transfer (disagg mode only)8998

KV Transfer Config

KV Role Options

  • kv_producer: For prefiller instances that store KV caches to the pool.
  • kv_consumer: For decoder instances that load KV caches from the pool.
  • kv_both: The instance both stores and loads KV caches. Use this for single-node CPU offloading.

kv_connector_extra_config

  • load_async (bool): Enable asynchronous loading for better compute-I/O overlap. Default: true.
  • enable_cross_layers_blocks (bool): Enable cross-layer block packing for reduced store operations. Default: false.
  • discard_partial_chunks (bool): Discard partial block chunks during store. Default: true.
  • lookup_rpc_port (int): Custom port for the ZMQ lookup RPC socket. Default: 0.

Notes

Cross-DP Prefix Cache Hits

When running with data parallelism, set a fixed PYTHONHASHSEED so that block hashes are consistent across DP ranks:

bash
PYTHONHASHSEED=0 vllm serve ...

Without this, identical prompts may produce different block hashes on different DP ranks, preventing cross-instance prefix cache hits.