docs/gateway/guides/inference-caching.mdx
The TensorZero Gateway supports caching of inference responses to improve latency and reduce costs. When caching is enabled, identical requests will be served from the cache instead of being sent to the model provider, resulting in faster response times and lower token usage.
The TensorZero Gateway supports the following cache modes:
off (default): Disable caching completelyon: Both read from and write to cachewrite_only: Only write to cache but don't serve cached responsesread_only: Only read from cache but don't write new entriesYou can also optionally specify a maximum age for cache entries in seconds for inference reads. This parameter is ignored for inference writes.
See the API Reference for more details.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:3000/openai/v1", api_key="not-used")
response = client.chat.completions.create(
model="tensorzero::model_name::openai::gpt-4o-mini",
messages=[
{
"role": "user",
"content": "What is the capital of Japan?",
}
],
extra_body={
"tensorzero::cache_options": {
"enabled": "on", # read and write to cache
"max_age_s": 3600, # optional: cache entries >1h (>3600s) old are disregarded for reads
},
},
)
print(response)
If ClickHouse is the primary data store for the gateway, we store cache data in ClickHouse.
If Postgres is configured to be the primary data store for the gateway, and Valkey is available (i.e. TENSORZERO_VALKEY_URL is set), we store cache data in Valkey.
Valkey cache entries have a configurable TTL (time-to-live) that defaults to 24 hours (86400 seconds).
You can change this in tensorzero.toml:
[gateway.cache.valkey]
ttl_s = 86400 # 24 hours (default)
If you use a single Valkey instance for both rate limiting and caching, we recommend keeping cache TTL under 48 hours.
Rate limiting keys have a minimum TTL of 48 hours, so the volatile-ttl eviction policy will correctly evict cache entries before rate limiting keys under memory pressure.
See Deploy Valkey / Redis for more details on eviction policies.
max_age_s parameter applies to the retrieval of cached responses.
When using ClickHouse, old entries are not automatically deleted.
When using Valkey, entries expire according to the configured TTL (cache.valkey.ttl_s). The default is 24h.This guide focuses on caching by TensorZero.
Separately, many model providers support some form of caching. Some of those are enabled automatically (e.g. OpenAI), whereas others require manual configuration (e.g. Anthropic).
See the guides for Anthropic and AWS Bedrock to learn more about enabling prompt caching at the model provider level.