Back to Suna

AgentPress Prompt Caching System

backend/core/agentpress/PROMPT_CACHING.md

15.0 KB
Original Source

AgentPress Prompt Caching System

Overview

AgentPress implements mathematically optimized prompt caching for Anthropic Claude models to achieve 70-90% cost and latency savings in long conversations. The system uses dynamic token-based thresholds that adapt to conversation length, context window size, and message density.

How It Works

1. Dynamic Context Detection

  • Auto-detects context window from model registry (200k-2M+ tokens)
  • Supports all models: Claude 3.7 (200k), Claude Sonnet 4 (1M), Gemini 2.5 Pro (2M)
  • Falls back to 200k default if model not found

2. Mathematical Threshold Calculation

Optimal Threshold = Base × Stage × Context × Density

Where:
• Base = 2.5% of context window
• Stage = Conversation length multiplier
• Context = Context window multiplier  
• Density = Token density multiplier

3. Conversation Stage Scaling

StageMessagesMultiplierStrategy
Early≤200.3xAggressive caching for quick wins
Growing21-1000.6xBalanced approach
Mature101-5001.0xLarger chunks, preserve blocks
Very Long500+1.8xConservative, maximum efficiency

4. Context Window Scaling

Context WindowMultiplierExample Models
200k tokens1.0xClaude 3.7 Sonnet
500k tokens1.2xGPT-4 variants
1M tokens1.5xClaude Sonnet 4
2M+ tokens2.0xGemini 2.5 Pro

Cache Threshold Examples

Real-World Thresholds by Model & Conversation Length

ModelContextEarly (≤20)Growing (≤100)Mature (≤500)Very Long (500+)
Claude 3.7200k1.5k tokens3k tokens5k tokens9k tokens
GPT-5400k3k tokens6k tokens10k tokens18k tokens
Claude Sonnet 41M7.5k tokens15k tokens25k tokens45k tokens
Gemini 2.5 Pro2M15k tokens30k tokens50k tokens90k tokens

Cache Block Strategy

4-Block Distribution

  1. Block 1: System prompt (cached if ≥1024 tokens)
  2. Blocks 2-4: Conversation chunks (automatic management)

Cache Management

  • Early blocks: Stable, reused longest
  • Recent blocks: Dynamic, optimized for conversation flow

Token Counting

Uses LiteLLM's accurate tokenizers:

python
from litellm import token_counter
tokens = token_counter(model=model_name, text=content)
  • Anthropic models: Uses Anthropic's actual tokenizer
  • OpenAI models: Uses tiktoken
  • Other models: Model-specific tokenizers
  • Fallback: Word-based estimation (1.3x words)

Cost Benefits

Pricing Structure

  • Cache Writes: 1.25x base cost for write operations
  • Cache Hits: 0.1x base cost (90% savings)
  • Break-even: 2-3 reuses for most chunks

Example Savings

  • 200k context conversation: 70-85% cost reduction
  • 1M context conversation: 80-90% cost reduction
  • 500+ message threads: Up to 95% latency reduction

Implementation Flow

mermaid
graph TD
    A[New Message] --> B{Anthropic Model?}
    B -->|No| C[Standard Processing]
    B -->|Yes| D[Get Context Window from Registry]
    D --> E[Calculate Optimal Threshold]
    E --> F[Count Existing Tokens]
    F --> G{Threshold Reached?}
    G -->|No| H[Add to Current Chunk]
    G -->|Yes| I[Create Cache Block]
    I --> J{Max Blocks Reached?}
    J -->|No| K[Continue Chunking]
    J -->|Yes| L[Add Remaining Uncached]
    H --> M[Send to LLM]
    K --> M
    L --> M
    C --> M

Key Features

Prevents Cache Invalidation

  • Fixed-size chunks never change once created
  • New messages go into new chunks or remain uncached
  • No more cache invalidation on every new message

Scales Efficiently

  • Handles 20-message conversations to 1000+ message threads
  • Adapts chunk sizes to context window (200k-2M tokens)
  • Preserves cache blocks for maximum reuse

Cost Optimized

  • Mathematical break-even analysis
  • Early aggressive caching for quick wins
  • Late conservative caching to preserve blocks

Context Window Aware

  • Prevents cache block preoccupation in large contexts
  • Reserves 20% of context for new messages/outputs
  • Handles oversized conversations gracefully

Usage

The caching system is automatically applied in ThreadManager.run_thread():

python
# Auto-detects context window and calculates optimal thresholds
prepared_messages = apply_anthropic_caching_strategy(
    system_prompt, 
    conversation_messages, 
    model_name  # e.g., "claude-sonnet-4"
)

Monitoring

Track cache performance via logs:

  • 🔥 Block X: Cached chunk (Y tokens, Z messages)
  • 🎯 Total cache blocks used: X/4
  • 📊 Processing N messages (X tokens)
  • 🧮 Calculated optimal cache threshold: X tokens

Result

70-90% cost and latency savings in long conversations while scaling efficiently across all context window sizes and conversation lengths.