docs/examples/llm/anthropic_prompt_caching.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/anthropic_prompt_caching.ipynb" target="_parent"></a>
In this Notebook, we will demonstrate the usage of Anthropic Prompt Caching with LlamaIndex abstractions.
Prompt Caching is enabled by marking cache_control in the messages request.
When you send a request with Prompt Caching enabled:
Note:
A. Prompt caching works with Claude 4 Opus, Claude 4 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Haiku and Claude 3 Opus models.
B. The minimum cacheable prompt length is:
1. 2048 tokens for Claude 3.5 Haiku and Claude 3 Haiku
2. 1024 for all the other models.
C. Shorter prompts cannot be cached, even if marked with cache_control.
import os
os.environ[
"ANTHROPIC_API_KEY"
] = "sk-ant-..." # replace with your Anthropic API key
from llama_index.llms.anthropic import Anthropic
llm = Anthropic(model="claude-3-5-sonnet-20240620")
In this demonstration, we will use the text from the Paul Graham Essay. We will cache the text and run some queries based on it.
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O './paul_graham_essay.txt'
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_files=["./paul_graham_essay.txt"],
).load_data()
document_text = documents[0].text
To enable prompt caching, you can just use the CachePoint block within LlamaIndex: everything previous to that block will be cached.
We can verify if the text is cached by checking the following parameters:
cache_creation_input_tokens: Number of tokens written to the cache when creating a new entry.
cache_read_input_tokens: Number of tokens retrieved from the cache for this request.
input_tokens: Number of input tokens which were not read from or used to create a cache.
from llama_index.core.llms import (
ChatMessage,
TextBlock,
CachePoint,
CacheControl,
)
messages = [
ChatMessage(role="system", content="You are helpful AI Assitant."),
ChatMessage(
role="user",
content=[
TextBlock(
text=f"{document_text}",
type="text",
),
TextBlock(
text="\n\nWhy did Paul Graham start YC?",
type="text",
),
CachePoint(cache_control=CacheControl(type="ephemeral")),
],
),
]
resp = llm.chat(messages)
Let's examine the raw response.
resp.raw
As you can see, since I've ran this a few different times, cache_creation_input_tokens and cache_read_input_tokens are both higher than zero, indicating that the text was cached properly.
Now, let’s run another query on the same document. It should retrieve the document text from the cache, which will be reflected in cache_read_input_tokens.
messages = [
ChatMessage(role="system", content="You are helpful AI Assitant."),
ChatMessage(
role="user",
content=[
TextBlock(
text=f"{document_text}",
type="text",
),
TextBlock(
text="\n\nWhat did Paul Graham do growing up?",
type="text",
),
CachePoint(cache_control=CacheControl(type="ephemeral")),
],
),
]
resp = llm.chat(messages)
resp.raw
As you can see, the response was generated using cached text, as indicated by cache_read_input_tokens.
With Anthropic, the default cache lasts 5 minutes. You can also have longer lasting caches, for instance 1 hour, you just have to specify that under the ttl argument in CachControl.
messages = [
ChatMessage(role="system", content="You are helpful AI Assitant."),
ChatMessage(
role="user",
content=[
TextBlock(
text=f"{document_text}",
type="text",
),
TextBlock(
text="\n\nWhat did Paul Graham do growing up?",
type="text",
),
CachePoint(
cache_control=CacheControl(type="ephemeral", ttl="1h"),
),
],
),
]
resp = llm.chat(messages)