LMCache Examples

This folder demonstrates how to use LMCache with vLLM v1 for KV cache offloading, disaggregated prefilling, and KV cache sharing.

Integration modes

LMCache integrates with vLLM v1 in two ways:

In-process mode (LMCacheConnectorV1): LMCache runs inside the vLLM process and is configured through environment variables or a YAML config file (LMCACHE_CONFIG_FILE). This is the simplest way to add single-node CPU/disk offloading.
Multi-process (MP) mode (LMCacheMPConnector): LMCache runs as a standalone server (lmcache server) that owns the KV cache storage; one or more vLLM instances connect to it. This is the recommended mode for distributed KV storage and for sharing KV cache across instances. See the LMCache docs for the full MP setup.

python cpu_offload_lmcache.py - CPU offloading with LMCacheConnectorV1 for vLLM v1.

bash cpu_offload_lmcache_mp.sh - CPU offloading with LMCacheMPConnector, using a standalone lmcache server. vLLM provides a built-in shortcut for this setup via --kv-offloading-backend lmcache and --kv-offloading-size <GiB>.

This example demonstrates how to run LMCache with disaggregated prefill using NIXL on a single node.

Run cd disagg_prefill_lmcache_v1 to get into disagg_prefill_lmcache_v1 folder, and then run

bash

bash disagg_example_nixl.sh

to run disaggregated prefill and benchmark the performance.

disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh - Launches individual vLLM servers for prefill/decode, and also launches the proxy server.
disagg_prefill_lmcache_v1/disagg_proxy_server.py - FastAPI proxy server that coordinates between prefiller and decoder
disagg_prefill_lmcache_v1/disagg_example_nixl.sh - Main script to run the example

disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml - Configuration for prefiller server
disagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml - Configuration for decoder server

The main script generates several log files:

The kv_cache_sharing_lmcache_v1.py example demonstrates how to share KV caches between vLLM v1 instances through a centralized LMCache server.