backend/core/agentpress/PROMPT_CACHING.md
AgentPress implements mathematically optimized prompt caching for Anthropic Claude models to achieve 70-90% cost and latency savings in long conversations. The system uses dynamic token-based thresholds that adapt to conversation length, context window size, and message density.
Optimal Threshold = Base × Stage × Context × Density
Where:
• Base = 2.5% of context window
• Stage = Conversation length multiplier
• Context = Context window multiplier
• Density = Token density multiplier
| Stage | Messages | Multiplier | Strategy |
|---|---|---|---|
| Early | ≤20 | 0.3x | Aggressive caching for quick wins |
| Growing | 21-100 | 0.6x | Balanced approach |
| Mature | 101-500 | 1.0x | Larger chunks, preserve blocks |
| Very Long | 500+ | 1.8x | Conservative, maximum efficiency |
| Context Window | Multiplier | Example Models |
|---|---|---|
| 200k tokens | 1.0x | Claude 3.7 Sonnet |
| 500k tokens | 1.2x | GPT-4 variants |
| 1M tokens | 1.5x | Claude Sonnet 4 |
| 2M+ tokens | 2.0x | Gemini 2.5 Pro |
| Model | Context | Early (≤20) | Growing (≤100) | Mature (≤500) | Very Long (500+) |
|---|---|---|---|---|---|
| Claude 3.7 | 200k | 1.5k tokens | 3k tokens | 5k tokens | 9k tokens |
| GPT-5 | 400k | 3k tokens | 6k tokens | 10k tokens | 18k tokens |
| Claude Sonnet 4 | 1M | 7.5k tokens | 15k tokens | 25k tokens | 45k tokens |
| Gemini 2.5 Pro | 2M | 15k tokens | 30k tokens | 50k tokens | 90k tokens |
Uses LiteLLM's accurate tokenizers:
from litellm import token_counter
tokens = token_counter(model=model_name, text=content)
graph TD
A[New Message] --> B{Anthropic Model?}
B -->|No| C[Standard Processing]
B -->|Yes| D[Get Context Window from Registry]
D --> E[Calculate Optimal Threshold]
E --> F[Count Existing Tokens]
F --> G{Threshold Reached?}
G -->|No| H[Add to Current Chunk]
G -->|Yes| I[Create Cache Block]
I --> J{Max Blocks Reached?}
J -->|No| K[Continue Chunking]
J -->|Yes| L[Add Remaining Uncached]
H --> M[Send to LLM]
K --> M
L --> M
C --> M
The caching system is automatically applied in ThreadManager.run_thread():
# Auto-detects context window and calculates optimal thresholds
prepared_messages = apply_anthropic_caching_strategy(
system_prompt,
conversation_messages,
model_name # e.g., "claude-sonnet-4"
)
Track cache performance via logs:
🔥 Block X: Cached chunk (Y tokens, Z messages)🎯 Total cache blocks used: X/4📊 Processing N messages (X tokens)🧮 Calculated optimal cache threshold: X tokens70-90% cost and latency savings in long conversations while scaling efficiently across all context window sizes and conversation lengths.