Back to Llama Index

Anthropic Prompt Caching

docs/examples/llm/anthropic_prompt_caching.ipynb

0.14.214.9 KB
Original Source

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/anthropic_prompt_caching.ipynb" target="_parent"></a>

Anthropic Prompt Caching

In this Notebook, we will demonstrate the usage of Anthropic Prompt Caching with LlamaIndex abstractions.

Prompt Caching is enabled by marking cache_control in the messages request.

How Prompt Caching works

When you send a request with Prompt Caching enabled:

  1. The system checks if the prompt prefix is already cached from a recent query.
  2. If found, it uses the cached version, reducing processing time and costs.
  3. Otherwise, it processes the full prompt and caches the prefix for future use.

Note:

A. Prompt caching works with Claude 4 Opus, Claude 4 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Haiku and Claude 3 Opus models.

B. The minimum cacheable prompt length is:

1. 2048 tokens for Claude 3.5 Haiku and Claude 3 Haiku
2. 1024 for all the other models.

C. Shorter prompts cannot be cached, even if marked with cache_control.

Setup API Keys

python
import os

os.environ[
    "ANTHROPIC_API_KEY"
] = "sk-ant-..."  # replace with your Anthropic API key

Setup LLM

python
from llama_index.llms.anthropic import Anthropic

llm = Anthropic(model="claude-3-5-sonnet-20240620")

Download Data

In this demonstration, we will use the text from the Paul Graham Essay. We will cache the text and run some queries based on it.

python
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O './paul_graham_essay.txt'

Load Data

python
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["./paul_graham_essay.txt"],
).load_data()

document_text = documents[0].text

Prompt Caching

To enable prompt caching, you can just use the CachePoint block within LlamaIndex: everything previous to that block will be cached.

We can verify if the text is cached by checking the following parameters:

cache_creation_input_tokens: Number of tokens written to the cache when creating a new entry.

cache_read_input_tokens: Number of tokens retrieved from the cache for this request.

input_tokens: Number of input tokens which were not read from or used to create a cache.

python
from llama_index.core.llms import (
    ChatMessage,
    TextBlock,
    CachePoint,
    CacheControl,
)

messages = [
    ChatMessage(role="system", content="You are helpful AI Assitant."),
    ChatMessage(
        role="user",
        content=[
            TextBlock(
                text=f"{document_text}",
                type="text",
            ),
            TextBlock(
                text="\n\nWhy did Paul Graham start YC?",
                type="text",
            ),
            CachePoint(cache_control=CacheControl(type="ephemeral")),
        ],
    ),
]

resp = llm.chat(messages)

Let's examine the raw response.

python
resp.raw

As you can see, since I've ran this a few different times, cache_creation_input_tokens and cache_read_input_tokens are both higher than zero, indicating that the text was cached properly.

Now, let’s run another query on the same document. It should retrieve the document text from the cache, which will be reflected in cache_read_input_tokens.

python
messages = [
    ChatMessage(role="system", content="You are helpful AI Assitant."),
    ChatMessage(
        role="user",
        content=[
            TextBlock(
                text=f"{document_text}",
                type="text",
            ),
            TextBlock(
                text="\n\nWhat did Paul Graham do growing up?",
                type="text",
            ),
            CachePoint(cache_control=CacheControl(type="ephemeral")),
        ],
    ),
]

resp = llm.chat(messages)
python
resp.raw

As you can see, the response was generated using cached text, as indicated by cache_read_input_tokens.

With Anthropic, the default cache lasts 5 minutes. You can also have longer lasting caches, for instance 1 hour, you just have to specify that under the ttl argument in CachControl.

python
messages = [
    ChatMessage(role="system", content="You are helpful AI Assitant."),
    ChatMessage(
        role="user",
        content=[
            TextBlock(
                text=f"{document_text}",
                type="text",
            ),
            TextBlock(
                text="\n\nWhat did Paul Graham do growing up?",
                type="text",
            ),
            CachePoint(
                cache_control=CacheControl(type="ephemeral", ttl="1h"),
            ),
        ],
    ),
]

resp = llm.chat(messages)