doc/source/serve/llm/user-guides/prefix-aware-routing.md
(prefix-aware-routing-guide)=
Optimize LLM inference with cache locality using prefix-aware request routing.
:::{warning} This API is in alpha and may change before becoming stable. :::
LLM inference can benefit significantly from cache locality optimization. When one replica processes multiple prompts that share a prefix, the engine can reuse previously computed KV-cache entries, reducing computation overhead and improving response times. This technique is known as Automatic Prefix Caching (APC) in vLLM.
The PrefixCacheAffinityRouter routes requests with similar prefixes to the same replicas, maximizing KV cache hit rates.
Use prefix-aware routing when:
The PrefixCacheAffinityRouter implements a multi-tier routing strategy that balances cache locality with load distribution:
First, it evaluates whether the current load is balanced across replicas by comparing queue lengths. If the difference between the highest and lowest queue lengths is below the imbalanced_threshold, it proceeds with prefix cache-aware routing.
When load is balanced, the router uses a prefix tree to find replicas that have previously processed similar input text:
When load is imbalanced (queue length difference exceeds threshold), the router prioritizes load balancing over cache locality and falls back to the standard Power of Two Choices algorithm.
The router maintains a distributed prefix tree actor that:
The following example shows how to deploy an LLM with prefix-aware routing:
:start-after: __prefix_aware_example_start__
:end-before: __prefix_aware_example_end__
:language: python
The PrefixCacheAffinityRouter provides several configuration parameters to tune its behavior:
imbalanced_threshold (default: infinity): Queue length difference threshold for considering load balanced. Lower values prioritize load balancing over cache locality.
match_rate_threshold (default: 0.1): Minimum prefix match rate (0.0-1.0) required to use prefix cache-aware routing. Higher values require stronger prefix matches before routing for cache locality.
do_eviction (default: False): Enable automatic eviction of old prefix tree entries to approximate the LLM engine's eviction policy.
eviction_threshold_chars (default: 400,000): Maximum number of characters in the prefix tree before the LLM engine triggers an eviction.
eviction_target_chars (default: 360,000): Target number of characters to reduce the prefix tree to during eviction.
eviction_interval_secs (default: 10): Interval in seconds between eviction checks when eviction is enabled.
enable_prefix_caching=True in your engine_kwargs for the router to have any effectimbalanced_threshold and match_rate_threshold based on your workload characteristicsArchitecture: Request routing <../architecture/routing-policies>