docs/reference/prompt-caching.md
Prompt caching means the model provider can reuse unchanged prompt prefixes (usually system/developer instructions and other stable context) across turns instead of re-processing them every time. OpenClaw normalizes provider usage into cacheRead and cacheWrite where the upstream API exposes those counters directly.
Status surfaces can also recover cache counters from the most recent transcript
usage log when the live session snapshot is missing them, so /status can keep
showing a cache line after partial session metadata loss. Existing nonzero live
cache values still take precedence over transcript fallback values.
Why this matters: lower token cost, faster responses, and more predictable performance for long-running sessions. Without caching, repeated prompts pay the full prompt cost on every turn even when most input did not change.
The sections below cover every cache-related knob that affects prompt reuse and token cost.
Provider references:
cacheRetention (global default, model, and per-agent)Set cache retention as a global default for all models:
agents:
defaults:
params:
cacheRetention: "long" # none | short | long
Override per-model:
agents:
defaults:
models:
"anthropic/claude-opus-4-6":
params:
cacheRetention: "short" # none | short | long
Per-agent override:
agents:
list:
- id: "alerts"
params:
cacheRetention: "none"
Config merge order:
agents.defaults.params (global default — applies to all models)agents.defaults.models["provider/model"].params (per-model override)agents.list[].params (matching agent id; overrides by key)contextPruning.mode: "cache-ttl"Prunes old tool-result context after cache TTL windows so post-idle requests do not re-cache oversized history.
agents:
defaults:
contextPruning:
mode: "cache-ttl"
ttl: "1h"
See Session Pruning for full behavior.
Heartbeat can keep cache windows warm and reduce repeated cache writes after idle gaps.
agents:
defaults:
heartbeat:
every: "55m"
Per-agent heartbeat is supported at agents.list[].heartbeat.
cacheRetention is supported.cacheRetention: "short" for Anthropic model refs when unset.cache_read_input_tokens and cache_creation_input_tokens, so OpenClaw can show both cacheRead and cacheWrite.cacheRetention: "short" maps to the default 5-minute ephemeral cache, and cacheRetention: "long" upgrades to the 1-hour TTL only on direct api.anthropic.com hosts.prompt_cache_key to keep cache routing stable across turns and uses prompt_cache_retention: "24h" only when cacheRetention: "long" is selected on direct OpenAI hosts.prompt_cache_key only when their model config explicitly sets compat.supportsPromptCacheKey: true; cacheRetention: "none" still suppresses it.usage.prompt_tokens_details.cached_tokens (or input_tokens_details.cached_tokens on Responses API events). OpenClaw maps that to cacheRead.cacheWrite stays 0 on OpenAI paths even when the provider is warming a cache.x-request-id, openai-processing-ms, and x-ratelimit-*, but cache-hit accounting should come from the usage payload, not from headers.4864 cached-token plateau in current live probes, while tool-heavy or MCP-style transcripts often plateau near 4608 cached tokens even on exact repeats.anthropic-vertex/*) support cacheRetention the same way as direct Anthropic.cacheRetention: "long" maps to the real 1-hour prompt-cache TTL on Vertex AI endpoints.anthropic-vertex matches direct Anthropic defaults.amazon-bedrock/*anthropic.claude*) support explicit cacheRetention pass-through.cacheRetention: "none" at runtime.For openrouter/anthropic/* model refs, OpenClaw injects Anthropic
cache_control on system/developer prompt blocks to improve prompt-cache
reuse only when the request is still targeting a verified OpenRouter route
(openrouter on its default endpoint, or any provider/base URL that resolves
to openrouter.ai).
For openrouter/deepseek/*, openrouter/moonshot*/*, and openrouter/zai/*
model refs, contextPruning.mode: "cache-ttl" is allowed because OpenRouter
handles provider-side prompt caching automatically. OpenClaw does not inject
Anthropic cache_control markers into those requests.
DeepSeek cache construction is best-effort and can take a few seconds. An
immediate follow-up may still show cached_tokens: 0; verify with a repeated
same-prefix request after a short delay and use usage.prompt_tokens_details.cached_tokens
as the cache-hit signal.
If you repoint the model at an arbitrary OpenAI-compatible proxy URL, OpenClaw stops injecting those OpenRouter-specific Anthropic cache markers.
If the provider does not support this cache mode, cacheRetention has no effect.
api: "google-generative-ai") reports cache hits
through upstream cachedContentTokenCount; OpenClaw maps that to cacheRead.cacheRetention is set on a direct Gemini model, OpenClaw automatically
creates, reuses, and refreshes cachedContents resources for system prompts
on Google AI Studio runs. This means you no longer need to pre-create a
cached-content handle manually.params.cachedContent (or legacy params.cached_content) on the configured
model.cachedContents resource rather than
injecting cache markers into the request.stats.cached;
OpenClaw maps that to cacheRead.stats.input value, OpenClaw derives input tokens
from stats.input_tokens - stats.cached.OpenClaw splits the system prompt into a stable prefix and a volatile
suffix separated by an internal cache-prefix boundary. Content above the
boundary (tool definitions, skills metadata, workspace files, and other
relatively static context) is ordered so it stays byte-identical across turns.
Content below the boundary (for example HEARTBEAT.md, runtime timestamps, and
other per-turn metadata) is allowed to change without invalidating the cached
prefix.
Key design choices:
HEARTBEAT.md so
heartbeat churn does not bust the stable prefix.If you see unexpected cacheWrite spikes after a config or workspace change,
check whether the change lands above or below the cache boundary. Moving
volatile content below the boundary (or stabilizing it) often resolves the
issue.
OpenClaw also keeps several cache-sensitive payload shapes deterministic before the request reaches the provider:
listTools() order changes do not churn the tools block and
bust prompt-cache prefixes.Keep a long-lived baseline on your main agent, disable caching on bursty notifier agents:
agents:
defaults:
model:
primary: "anthropic/claude-opus-4-6"
models:
"anthropic/claude-opus-4-6":
params:
cacheRetention: "long"
list:
- id: "research"
default: true
heartbeat:
every: "55m"
- id: "alerts"
params:
cacheRetention: "none"
cacheRetention: "short".contextPruning.mode: "cache-ttl".OpenClaw exposes dedicated cache-trace diagnostics for embedded agent runs.
For normal user-facing diagnostics, /status and other usage summaries can use
the latest transcript usage entry as a fallback source for cacheRead /
cacheWrite when the live session entry does not have those counters.
OpenClaw keeps one combined live cache regression gate for repeated prefixes, tool turns, image turns, MCP-style tool transcripts, and an Anthropic no-cache control.
src/agents/live-cache-regression.live.test.tssrc/agents/live-cache-regression-baseline.tsRun the narrow live gate with:
OPENCLAW_LIVE_TEST=1 OPENCLAW_LIVE_CACHE_TEST=1 pnpm test:live:cache
The baseline file stores the most recent observed live numbers plus the provider-specific regression floors used by the test. The runner also uses fresh per-run session IDs and prompt namespaces so previous cache state does not pollute the current regression sample.
These tests intentionally do not use identical success criteria across providers.
cacheWrite.cacheRead only. cacheWrite remains 0.gpt-5.4-mini:
cacheRead >= 4608, hit rate >= 0.90cacheRead >= 4096, hit rate >= 0.85cacheRead >= 3840, hit rate >= 0.82cacheRead >= 4096, hit rate >= 0.85Fresh combined live verification on 2026-04-04 landed at:
cacheRead=4864, hit rate 0.966cacheRead=4608, hit rate 0.896cacheRead=4864, hit rate 0.954cacheRead=4608, hit rate 0.891Recent local wall-clock time for the combined gate was about 88s.
Why the assertions differ:
diagnostics.cacheTrace configdiagnostics:
cacheTrace:
enabled: true
filePath: "~/.openclaw/logs/cache-trace.jsonl" # optional
includeMessages: false # default true
includePrompt: false # default true
includeSystem: false # default true
Defaults:
filePath: $OPENCLAW_STATE_DIR/logs/cache-trace.jsonlincludeMessages: trueincludePrompt: trueincludeSystem: trueOPENCLAW_CACHE_TRACE=1 enables cache tracing.OPENCLAW_CACHE_TRACE_FILE=/path/to/cache-trace.jsonl overrides output path.OPENCLAW_CACHE_TRACE_MESSAGES=0|1 toggles full message payload capture.OPENCLAW_CACHE_TRACE_PROMPT=0|1 toggles prompt text capture.OPENCLAW_CACHE_TRACE_SYSTEM=0|1 toggles system prompt capture.session:loaded, prompt:before, stream:context, and session:after.cacheRead and cacheWrite (for example /usage full and session usage summaries).cacheRead and cacheWrite when caching is active.cacheRead on cache hits and cacheWrite to remain 0; OpenAI does not publish a separate cache-write token field.cacheWrite on most turns: check for volatile system-prompt inputs and verify model/provider supports your cache settings.cacheWrite on Anthropic: often means the cache breakpoint is landing on content that changes every request.cacheRead: verify the stable prefix is at the front, the repeated prefix is at least 1024 tokens, and the same prompt_cache_key is reused for turns that should share a cache.cacheRetention: confirm model key matches agents.defaults.models["provider/model"].none.Related docs: