backend/onyx/llm/prompt_cache/README.md
A comprehensive prompt-caching mechanism for enabling cost savings across multiple LLM providers by leveraging provider-side prompt token caching.
The prompt caching framework provides a unified interface for enabling prompt caching across different LLM providers. It supports both implicit caching (automatic provider-side caching) and explicit caching (with cache control parameters).
str and Sequence[ChatCompletionMessage] inputsfrom onyx.llm.prompt_cache import process_with_prompt_cache
from onyx.llm.models import SystemMessage, UserMessage
# Assume you have an LLM instance with a config property
# llm = get_your_llm_instance()
# Define cacheable prefix (static context) using Pydantic message models
cacheable_prefix = [
SystemMessage(role="system", content="You are a helpful assistant."),
UserMessage(role="user", content="Context: ...") # Static context
]
# Define suffix (dynamic user input)
suffix = [UserMessage(role="user", content="What is the weather?")]
# Process with caching - pass llm_config, not the llm instance
processed_prompt, cache_metadata = process_with_prompt_cache(
llm_config=llm.config,
cacheable_prefix=cacheable_prefix,
suffix=suffix,
continuation=False,
)
# Make LLM call with processed prompt
response = llm.invoke(processed_prompt)
# Both prefix and suffix can be strings
cacheable_prefix = "You are a helpful assistant. Context: ..."
suffix = "What is the weather?"
processed_prompt, cache_metadata = process_with_prompt_cache(
llm_config=llm.config,
cacheable_prefix=cacheable_prefix,
suffix=suffix,
continuation=False,
)
response = llm.invoke(processed_prompt)
When continuation=True, the suffix is appended to the last message of the cacheable prefix:
# Without continuation (default)
# Result: [system_msg, prefix_user_msg, suffix_user_msg]
# With continuation=True
# Result: [system_msg, prefix_user_msg + suffix_user_msg]
processed_prompt, _ = process_with_prompt_cache(
llm_config=llm.config,
cacheable_prefix=cacheable_prefix,
suffix=suffix,
continuation=True, # Merge suffix into last prefix message
)
Note: If cacheable_prefix is a string, it remains in its own content block even when continuation=True.
cache_control parameter)cache_control={"type": "ephemeral"} to the last message of the cacheable prefixcache_control parameter)cache_control={"type": "ephemeral"} to all content blocks in cacheable messages. String content is converted to array format with the cache control attached.ENABLE_PROMPT_CACHING: Enable/disable prompt caching (default: true)
export ENABLE_PROMPT_CACHING=false # Disable caching
processor.py: Main entry point (process_with_prompt_cache)cache_manager.py: Cache metadata storage and retrievalmodels.py: Pydantic models for cache metadata (CacheMetadata)providers/: Provider-specific adaptersutils.py: Shared utility functionsEach provider has its own adapter in providers/:
| File | Class | Description |
|---|---|---|
base.py | PromptCacheProvider | Abstract base class for all providers |
openai.py | OpenAIPromptCacheProvider | Implicit caching (no transformation) |
anthropic.py | AnthropicPromptCacheProvider | Explicit caching with cache_control on last message |
vertex.py | VertexAIPromptCacheProvider | Explicit caching with cache_control on all content blocks |
noop.py | NoOpPromptCacheProvider | Fallback for unsupported providers |
Each adapter implements:
supports_caching(): Whether caching is supportedprepare_messages_for_caching(): Transform messages for cachingextract_cache_metadata(): Extract metadata from responsesget_cache_ttl_seconds(): Cache TTLCache Static Content: Use cacheable prefix for system prompts, static context, and instructions that don't change between requests.
Keep Dynamic Content in Suffix: User queries, search results, and other dynamic content should be in the suffix.
Monitor Cache Effectiveness: Check logs for cache hits/misses and adjust your caching strategy accordingly.
Provider Selection: Different providers have different caching characteristics - choose based on your use case.
The framework is best-effort - if caching fails, it gracefully falls back to non-cached behavior:
See backend/tests/external_dependency_unit/llm/test_prompt_caching.py for detailed integration test examples.