docs/examples/llm/anthropic.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/anthropic.ipynb" target="_parent"></a>
Anthropic offers many state-of-the-art models from the haiku, sonnet, and opus families.
Read on to learn how to use these models with LlamaIndex!
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
%pip install llama-index-llms-anthropic
First we want to set the tokenizer, which is slightly different than TikToken. This ensures that token counting is accurate throughout the library.
NOTE: Anthropic recently updated their token counting API. Older models like claude-2.1 are no longer supported for token counting in the latest versions of the Anthropic python client.
from llama_index.llms.anthropic import Anthropic
from llama_index.core import Settings
tokenizer = Anthropic().tokenizer
Settings.tokenizer = tokenizer
import os
os.environ["ANTHROPIC_API_KEY"] = "sk-..."
You can call complete with a prompt:
from llama_index.llms.anthropic import Anthropic
# To customize your API key, do this
# otherwise it will lookup ANTHROPIC_API_KEY from your env variable
# llm = Anthropic(api_key="<api_key>")
llm = Anthropic(model="claude-sonnet-4-0")
resp = llm.complete("Who is Paul Graham?")
print(resp)
You can also call chat with a list of chat messages:
from llama_index.core.llms import ChatMessage
from llama_index.llms.anthropic import Anthropic
messages = [
ChatMessage(
role="system", content="You are a pirate with a colorful personality"
),
ChatMessage(role="user", content="Tell me a story"),
]
llm = Anthropic(model="claude-sonnet-4-0")
resp = llm.chat(messages)
print(resp)
Every method supports streaming through the stream_ prefix.
from llama_index.llms.anthropic import Anthropic
llm = Anthropic(model="claude-sonnet-4-0")
resp = llm.stream_complete("Who is Paul Graham?")
for r in resp:
print(r.delta, end="")
from llama_index.core.llms import ChatMessage
messages = [
ChatMessage(role="user", content="Who is Paul Graham?"),
]
resp = llm.stream_chat(messages)
for r in resp:
print(r.delta, end="")
Every synchronous method has an async counterpart.
from llama_index.llms.anthropic import Anthropic
llm = Anthropic(model="claude-sonnet-4-0")
resp = await llm.astream_complete("Who is Paul Graham?")
async for r in resp:
print(r.delta, end="")
messages = [
ChatMessage(role="user", content="Who is Paul Graham?"),
]
resp = await llm.achat(messages)
print(resp)
By providing the region and project_id parameters (either through environment variables or directly), you can use an Anthropic model through Vertex AI.
import os
os.environ["ANTHROPIC_PROJECT_ID"] = "YOUR PROJECT ID HERE"
os.environ["ANTHROPIC_REGION"] = "YOUR PROJECT REGION HERE"
Do keep in mind that setting region and project_id here will make Anthropic use the Vertex AI client
LlamaIndex also supports Anthropic models through AWS Bedrock.
from llama_index.llms.anthropic import Anthropic
# Note: this assumes you have standard AWS credentials configured in your environment
llm = Anthropic(
model="anthropic.claude-3-7-sonnet-20250219-v1:0",
aws_region="us-east-1",
)
resp = llm.complete("Who is Paul Graham?")
Using ChatMessage objects, you can pass in images and text to the LLM.
!wget https://cdn.pixabay.com/photo/2021/12/12/20/00/play-6865967_640.jpg -O image.jpg
from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock
from llama_index.llms.anthropic import Anthropic
llm = Anthropic(model="claude-sonnet-4-0")
messages = [
ChatMessage(
role="user",
blocks=[
ImageBlock(path="image.jpg"),
TextBlock(text="What is in this image?"),
],
)
]
resp = llm.chat(messages)
print(resp)
Anthropic models support the idea of prompt cahcing -- wherein if a prompt is repeated multiple times, or the start of a prompt is repeated, the LLM can reuse pre-calculated attention results to speed up the response and lower costs.
To enable prompt caching, you can set cache_control on your ChatMessage objects, or set cache_idx on the LLM to always cache the first X messages (with -1 being all messages).
from llama_index.core.llms import ChatMessage
from llama_index.llms.anthropic import Anthropic
llm = Anthropic(model="claude-sonnet-4-0")
# cache individual message(s)
messages = [
ChatMessage(
role="user",
content="<some very long prompt>",
additional_kwargs={"cache_control": {"type": "ephemeral"}},
),
]
resp = llm.chat(messages)
# cache first X messages (with -1 being all messages)
llm = Anthropic(model="claude-sonnet-4-0", cache_idx=-1)
resp = llm.chat(messages)
LlamaIndex provides an intuitive interface for converting any Anthropic LLMs into a structured LLM through structured_predict - simply define the target Pydantic class (can be nested), and given a prompt, we extract out the desired object.
from llama_index.llms.anthropic import Anthropic
from llama_index.core.prompts import PromptTemplate
from llama_index.core.bridge.pydantic import BaseModel
from typing import List
class MenuItem(BaseModel):
"""A menu item in a restaurant."""
course_name: str
is_vegetarian: bool
class Restaurant(BaseModel):
"""A restaurant with name, city, and cuisine."""
name: str
city: str
cuisine: str
menu_items: List[MenuItem]
llm = Anthropic(model="claude-sonnet-4-0")
prompt_tmpl = PromptTemplate(
"Generate a restaurant in a given city {city_name}"
)
# Option 1: Use `as_structured_llm`
restaurant_obj = (
llm.as_structured_llm(Restaurant)
.complete(prompt_tmpl.format(city_name="Miami"))
.raw
)
# Option 2: Use `structured_predict`
# restaurant_obj = llm.structured_predict(Restaurant, prompt_tmpl, city_name="Miami")
restaurant_obj
Any LLM wrapped with as_structured_llm supports streaming through stream_chat.
from llama_index.core.llms import ChatMessage
from IPython.display import clear_output
from pprint import pprint
input_msg = ChatMessage.from_str("Generate a restaurant in San Francisco")
sllm = llm.as_structured_llm(Restaurant)
stream_output = sllm.stream_chat([input_msg])
for partial_output in stream_output:
clear_output(wait=True)
pprint(partial_output.raw.dict())
restaurant_obj = partial_output.raw
restaurant_obj
With claude-3.7 Sonnet, you can enable the model to "think" harder about a task, generating a chain-of-thought response before writing out the final answer.
You can enable this by passing in the thinking_dict parameter to the constructor, specififying the amount of tokens to reserve for the thinking process.
from llama_index.llms.anthropic import Anthropic
from llama_index.core.llms import ChatMessage
llm = Anthropic(
model="claude-sonnet-4-0",
# max_tokens must be greater than budget_tokens
max_tokens=64000,
# temperature must be 1.0 for thinking to work
temperature=1.0,
thinking_dict={"type": "enabled", "budget_tokens": 1600},
)
messages = [
ChatMessage(role="user", content="(1234 * 3421) / (231 + 2341) = ?")
]
resp_gen = llm.stream_chat(messages)
for r in resp_gen:
print(r.delta, end="")
print()
print(r.message.content)
print(r.message.additional_kwargs["thinking"]["signature"])
We can also expose the exact thinking process:
print(r.message.additional_kwargs["thinking"]["thinking"])
Anthropic supports direct tool/function calling through the API. Using LlamaIndex, we can implement some core agentic tool calling patterns.
from llama_index.core.tools import FunctionTool
from llama_index.core.llms import ChatMessage
from llama_index.llms.anthropic import Anthropic
from datetime import datetime
llm = Anthropic(model="claude-sonnet-4-0")
def get_current_time() -> dict:
"""Get the current time"""
return {"time": datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
# uses the tool name, any type annotations, and docstring to describe the tool
tool = FunctionTool.from_defaults(fn=get_current_time)
We can simply do a single pass to call the tool and get the result:
resp = llm.predict_and_call([tool], "What is the current time?")
print(resp)
We can also use lower-level APIs to implement an agentic tool-calling loop!
chat_history = [ChatMessage(role="user", content="What is the current time?")]
tools_by_name = {t.metadata.name: t for t in [tool]}
resp = llm.chat_with_tools([tool], chat_history=chat_history)
tool_calls = llm.get_tool_calls_from_response(
resp, error_on_no_tool_call=False
)
if not tool_calls:
print(resp)
else:
while tool_calls:
# add the LLM's response to the chat history
chat_history.append(resp.message)
for tool_call in tool_calls:
tool_name = tool_call.tool_name
tool_kwargs = tool_call.tool_kwargs
print(f"Calling {tool_name} with {tool_kwargs}")
tool_output = tool.call(**tool_kwargs)
print("Tool output: ", tool_output)
chat_history.append(
ChatMessage(
role="tool",
content=str(tool_output),
# most LLMs like Anthropic, OpenAI, etc. need to know the tool call id
additional_kwargs={"tool_call_id": tool_call.tool_id},
)
)
resp = llm.chat_with_tools([tool], chat_history=chat_history)
tool_calls = llm.get_tool_calls_from_response(
resp, error_on_no_tool_call=False
)
print("Final response: ", resp.message.content)
Anthropic now also supports server-side tool calling in latest versions.
Here's an example of how to use it:
from llama_index.llms.anthropic import Anthropic
llm = Anthropic(
model="claude-sonnet-4-0",
max_tokens=1024,
tools=[
{
"type": "web_search_20250305",
"name": "web_search",
"max_uses": 3, # Limit to 3 searches
}
],
)
# Get response with citations
response = llm.complete("What are the latest AI research trends?")
# Access the main response content
print(response.text)
# Access citations if available
for citation in response.citations:
print(f"Source: {citation.get('url')} - {citation.get('cited_text')}")
In llama-index-core>=0.12.46 + llama-index-llms-anthropic>=0.7.6, we've added support for outputting citable tool results!
Using Anthropic, you can now utilize server-side citations to cite specific parts of your tool results.
If the LLM cites a tool result, the citation will appear in the output as a CitationBlock, containing the source, title, and cited content.
Let's cover a few ways to do this in practice.
First, let's define a dummy tool/function that returns a citable block.
from llama_index.core import Document
from llama_index.core.llms import CitableBlock, TextBlock
from llama_index.core.tools import FunctionTool
dummy_text = Document.example().text
async def search_fn(query: str):
"""Useful for searching the web to answer questions."""
return CitableBlock(
content=[TextBlock(text=dummy_text)],
title="Facts about LLMs and LlamaIndex",
source="https://docs.llamaindex.ai",
)
search_tool = FunctionTool.from_defaults(search_fn)
from llama_index.llms.anthropic import Anthropic
llm = Anthropic(
model="claude-sonnet-4-0",
# api_key="sk-...",
)
You can also use these tools directly in pre-built agents, like the FunctionAgent, to get the same citations in the output.
from llama_index.core.agent.workflow import FunctionAgent
agent = FunctionAgent(
tools=[search_tool],
llm=llm,
# Since we have a fake tool that returns a static result, we don't want to waste LLM tokens
system_prompt="Only make one search query per user message.",
timeout=None,
)
output = await agent.run("How do LlamaIndex and LLMs work together?")
from llama_index.core.llms import CitationBlock
print(output.response.content)
print("----" * 20)
for block in output.response.blocks:
if isinstance(block, CitationBlock):
print("Source: ", block.source)
print("Title: ", block.title)
print("Cited Content:\n", block.cited_content.text)
print("----" * 20)
Using our tool that returns a citable block, we can manually call the LLM with the given tool in a manual agent loop.
Once the LLM stops making tool calls, we can return the final response and parse the citations from the response.
from llama_index.core.llms import ChatMessage, CitationBlock
chat_history = [
ChatMessage(
role="system",
# Since we have a fake tool that returns a static result, we don't want to waste LLM tokens
content="Only make one search query per user message.",
),
ChatMessage(
role="user", content="How do LlamaIndex and LLMs work together?"
),
]
resp = llm.chat_with_tools([search_tool], chat_history=chat_history)
chat_history.append(resp.message)
tool_calls = llm.get_tool_calls_from_response(
resp, error_on_no_tool_call=False
)
while tool_calls:
for tool_call in tool_calls:
if tool_call.tool_name == "search_fn":
tool_result = search_tool.call(tool_call.tool_kwargs)
chat_history.append(
ChatMessage(
role="tool",
blocks=tool_result.blocks,
additional_kwargs={"tool_call_id": tool_call.tool_id},
)
)
resp = llm.chat_with_tools([search_tool], chat_history=chat_history)
chat_history.append(resp.message)
tool_calls = llm.get_tool_calls_from_response(
resp, error_on_no_tool_call=False
)
print(resp.message.content)
print("----" * 20)
for block in resp.message.blocks:
if isinstance(block, CitationBlock):
print("Source: ", block.source)
print("Title: ", block.title)
print("Cited Content:\n", block.cited_content.text)
print("----" * 20)