Back to Llama Index

Anthropic

docs/examples/llm/anthropic.ipynb

0.14.2114.6 KB
Original Source

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/anthropic.ipynb" target="_parent"></a>

Anthropic

Anthropic offers many state-of-the-art models from the haiku, sonnet, and opus families.

Read on to learn how to use these models with LlamaIndex!

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python
%pip install llama-index-llms-anthropic

Set Tokenizer

First we want to set the tokenizer, which is slightly different than TikToken. This ensures that token counting is accurate throughout the library.

NOTE: Anthropic recently updated their token counting API. Older models like claude-2.1 are no longer supported for token counting in the latest versions of the Anthropic python client.

python
from llama_index.llms.anthropic import Anthropic
from llama_index.core import Settings

tokenizer = Anthropic().tokenizer
Settings.tokenizer = tokenizer

Basic Usage

python
import os

os.environ["ANTHROPIC_API_KEY"] = "sk-..."

You can call complete with a prompt:

python
from llama_index.llms.anthropic import Anthropic

# To customize your API key, do this
# otherwise it will lookup ANTHROPIC_API_KEY from your env variable
# llm = Anthropic(api_key="<api_key>")
llm = Anthropic(model="claude-sonnet-4-0")

resp = llm.complete("Who is Paul Graham?")
python
print(resp)

You can also call chat with a list of chat messages:

python
from llama_index.core.llms import ChatMessage
from llama_index.llms.anthropic import Anthropic

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="Tell me a story"),
]
llm = Anthropic(model="claude-sonnet-4-0")
resp = llm.chat(messages)

print(resp)

Streaming Support

Every method supports streaming through the stream_ prefix.

python
from llama_index.llms.anthropic import Anthropic

llm = Anthropic(model="claude-sonnet-4-0")

resp = llm.stream_complete("Who is Paul Graham?")
for r in resp:
    print(r.delta, end="")
python
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(role="user", content="Who is Paul Graham?"),
]

resp = llm.stream_chat(messages)
for r in resp:
    print(r.delta, end="")

Async Usage

Every synchronous method has an async counterpart.

python
from llama_index.llms.anthropic import Anthropic

llm = Anthropic(model="claude-sonnet-4-0")

resp = await llm.astream_complete("Who is Paul Graham?")
async for r in resp:
    print(r.delta, end="")
python
messages = [
    ChatMessage(role="user", content="Who is Paul Graham?"),
]

resp = await llm.achat(messages)
print(resp)

Vertex AI Support

By providing the region and project_id parameters (either through environment variables or directly), you can use an Anthropic model through Vertex AI.

python
import os

os.environ["ANTHROPIC_PROJECT_ID"] = "YOUR PROJECT ID HERE"
os.environ["ANTHROPIC_REGION"] = "YOUR PROJECT REGION HERE"

Do keep in mind that setting region and project_id here will make Anthropic use the Vertex AI client

Bedrock Support

LlamaIndex also supports Anthropic models through AWS Bedrock.

python
from llama_index.llms.anthropic import Anthropic

# Note: this assumes you have standard AWS credentials configured in your environment
llm = Anthropic(
    model="anthropic.claude-3-7-sonnet-20250219-v1:0",
    aws_region="us-east-1",
)

resp = llm.complete("Who is Paul Graham?")

Multi-Modal Support

Using ChatMessage objects, you can pass in images and text to the LLM.

python
!wget https://cdn.pixabay.com/photo/2021/12/12/20/00/play-6865967_640.jpg -O image.jpg
python
from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock
from llama_index.llms.anthropic import Anthropic

llm = Anthropic(model="claude-sonnet-4-0")

messages = [
    ChatMessage(
        role="user",
        blocks=[
            ImageBlock(path="image.jpg"),
            TextBlock(text="What is in this image?"),
        ],
    )
]

resp = llm.chat(messages)
print(resp)

Prompt Caching

Anthropic models support the idea of prompt cahcing -- wherein if a prompt is repeated multiple times, or the start of a prompt is repeated, the LLM can reuse pre-calculated attention results to speed up the response and lower costs.

To enable prompt caching, you can set cache_control on your ChatMessage objects, or set cache_idx on the LLM to always cache the first X messages (with -1 being all messages).

python
from llama_index.core.llms import ChatMessage
from llama_index.llms.anthropic import Anthropic

llm = Anthropic(model="claude-sonnet-4-0")

# cache individual message(s)
messages = [
    ChatMessage(
        role="user",
        content="<some very long prompt>",
        additional_kwargs={"cache_control": {"type": "ephemeral"}},
    ),
]

resp = llm.chat(messages)

# cache first X messages (with -1 being all messages)
llm = Anthropic(model="claude-sonnet-4-0", cache_idx=-1)

resp = llm.chat(messages)

Structured Prediction

LlamaIndex provides an intuitive interface for converting any Anthropic LLMs into a structured LLM through structured_predict - simply define the target Pydantic class (can be nested), and given a prompt, we extract out the desired object.

python
from llama_index.llms.anthropic import Anthropic
from llama_index.core.prompts import PromptTemplate
from llama_index.core.bridge.pydantic import BaseModel
from typing import List


class MenuItem(BaseModel):
    """A menu item in a restaurant."""

    course_name: str
    is_vegetarian: bool


class Restaurant(BaseModel):
    """A restaurant with name, city, and cuisine."""

    name: str
    city: str
    cuisine: str
    menu_items: List[MenuItem]


llm = Anthropic(model="claude-sonnet-4-0")
prompt_tmpl = PromptTemplate(
    "Generate a restaurant in a given city {city_name}"
)

# Option 1: Use `as_structured_llm`
restaurant_obj = (
    llm.as_structured_llm(Restaurant)
    .complete(prompt_tmpl.format(city_name="Miami"))
    .raw
)
# Option 2: Use `structured_predict`
# restaurant_obj = llm.structured_predict(Restaurant, prompt_tmpl, city_name="Miami")
python
restaurant_obj

Structured Prediction with Streaming

Any LLM wrapped with as_structured_llm supports streaming through stream_chat.

python
from llama_index.core.llms import ChatMessage
from IPython.display import clear_output
from pprint import pprint

input_msg = ChatMessage.from_str("Generate a restaurant in San Francisco")

sllm = llm.as_structured_llm(Restaurant)
stream_output = sllm.stream_chat([input_msg])
for partial_output in stream_output:
    clear_output(wait=True)
    pprint(partial_output.raw.dict())
    restaurant_obj = partial_output.raw

restaurant_obj

Model Thinking

With claude-3.7 Sonnet, you can enable the model to "think" harder about a task, generating a chain-of-thought response before writing out the final answer.

You can enable this by passing in the thinking_dict parameter to the constructor, specififying the amount of tokens to reserve for the thinking process.

python
from llama_index.llms.anthropic import Anthropic
from llama_index.core.llms import ChatMessage

llm = Anthropic(
    model="claude-sonnet-4-0",
    # max_tokens must be greater than budget_tokens
    max_tokens=64000,
    # temperature must be 1.0 for thinking to work
    temperature=1.0,
    thinking_dict={"type": "enabled", "budget_tokens": 1600},
)
python
messages = [
    ChatMessage(role="user", content="(1234 * 3421) / (231 + 2341) = ?")
]

resp_gen = llm.stream_chat(messages)

for r in resp_gen:
    print(r.delta, end="")

print()
print(r.message.content)
python
print(r.message.additional_kwargs["thinking"]["signature"])

We can also expose the exact thinking process:

python
print(r.message.additional_kwargs["thinking"]["thinking"])

Tool/Function Calling

Anthropic supports direct tool/function calling through the API. Using LlamaIndex, we can implement some core agentic tool calling patterns.

python
from llama_index.core.tools import FunctionTool
from llama_index.core.llms import ChatMessage
from llama_index.llms.anthropic import Anthropic
from datetime import datetime

llm = Anthropic(model="claude-sonnet-4-0")


def get_current_time() -> dict:
    """Get the current time"""
    return {"time": datetime.now().strftime("%Y-%m-%d %H:%M:%S")}


# uses the tool name, any type annotations, and docstring to describe the tool
tool = FunctionTool.from_defaults(fn=get_current_time)

We can simply do a single pass to call the tool and get the result:

python
resp = llm.predict_and_call([tool], "What is the current time?")
print(resp)

We can also use lower-level APIs to implement an agentic tool-calling loop!

python
chat_history = [ChatMessage(role="user", content="What is the current time?")]
tools_by_name = {t.metadata.name: t for t in [tool]}

resp = llm.chat_with_tools([tool], chat_history=chat_history)
tool_calls = llm.get_tool_calls_from_response(
    resp, error_on_no_tool_call=False
)

if not tool_calls:
    print(resp)
else:
    while tool_calls:
        # add the LLM's response to the chat history
        chat_history.append(resp.message)

        for tool_call in tool_calls:
            tool_name = tool_call.tool_name
            tool_kwargs = tool_call.tool_kwargs

            print(f"Calling {tool_name} with {tool_kwargs}")
            tool_output = tool.call(**tool_kwargs)
            print("Tool output: ", tool_output)
            chat_history.append(
                ChatMessage(
                    role="tool",
                    content=str(tool_output),
                    # most LLMs like Anthropic, OpenAI, etc. need to know the tool call id
                    additional_kwargs={"tool_call_id": tool_call.tool_id},
                )
            )

            resp = llm.chat_with_tools([tool], chat_history=chat_history)
            tool_calls = llm.get_tool_calls_from_response(
                resp, error_on_no_tool_call=False
            )
    print("Final response: ", resp.message.content)

Server-Side Tool Calling

Anthropic now also supports server-side tool calling in latest versions.

Here's an example of how to use it:

python
from llama_index.llms.anthropic import Anthropic

llm = Anthropic(
    model="claude-sonnet-4-0",
    max_tokens=1024,
    tools=[
        {
            "type": "web_search_20250305",
            "name": "web_search",
            "max_uses": 3,  # Limit to 3 searches
        }
    ],
)

# Get response with citations
response = llm.complete("What are the latest AI research trends?")

# Access the main response content
print(response.text)

# Access citations if available
for citation in response.citations:
    print(f"Source: {citation.get('url')} - {citation.get('cited_text')}")

Tool Calling + Citations

In llama-index-core>=0.12.46 + llama-index-llms-anthropic>=0.7.6, we've added support for outputting citable tool results!

Using Anthropic, you can now utilize server-side citations to cite specific parts of your tool results.

If the LLM cites a tool result, the citation will appear in the output as a CitationBlock, containing the source, title, and cited content.

Let's cover a few ways to do this in practice.

First, let's define a dummy tool/function that returns a citable block.

python
from llama_index.core import Document
from llama_index.core.llms import CitableBlock, TextBlock
from llama_index.core.tools import FunctionTool

dummy_text = Document.example().text


async def search_fn(query: str):
    """Useful for searching the web to answer questions."""
    return CitableBlock(
        content=[TextBlock(text=dummy_text)],
        title="Facts about LLMs and LlamaIndex",
        source="https://docs.llamaindex.ai",
    )


search_tool = FunctionTool.from_defaults(search_fn)
python
from llama_index.llms.anthropic import Anthropic

llm = Anthropic(
    model="claude-sonnet-4-0",
    # api_key="sk-...",
)

Agents + Citable Tools

You can also use these tools directly in pre-built agents, like the FunctionAgent, to get the same citations in the output.

python
from llama_index.core.agent.workflow import FunctionAgent

agent = FunctionAgent(
    tools=[search_tool],
    llm=llm,
    # Since we have a fake tool that returns a static result, we don't want to waste LLM tokens
    system_prompt="Only make one search query per user message.",
    timeout=None,
)
python
output = await agent.run("How do LlamaIndex and LLMs work together?")
python
from llama_index.core.llms import CitationBlock

print(output.response.content)
print("----" * 20)
for block in output.response.blocks:
    if isinstance(block, CitationBlock):
        print("Source: ", block.source)
        print("Title: ", block.title)
        print("Cited Content:\n", block.cited_content.text)
        print("----" * 20)

Manual Tool Calling + Citations

Using our tool that returns a citable block, we can manually call the LLM with the given tool in a manual agent loop.

Once the LLM stops making tool calls, we can return the final response and parse the citations from the response.

python
from llama_index.core.llms import ChatMessage, CitationBlock

chat_history = [
    ChatMessage(
        role="system",
        # Since we have a fake tool that returns a static result, we don't want to waste LLM tokens
        content="Only make one search query per user message.",
    ),
    ChatMessage(
        role="user", content="How do LlamaIndex and LLMs work together?"
    ),
]
resp = llm.chat_with_tools([search_tool], chat_history=chat_history)
chat_history.append(resp.message)

tool_calls = llm.get_tool_calls_from_response(
    resp, error_on_no_tool_call=False
)
while tool_calls:
    for tool_call in tool_calls:
        if tool_call.tool_name == "search_fn":
            tool_result = search_tool.call(tool_call.tool_kwargs)
            chat_history.append(
                ChatMessage(
                    role="tool",
                    blocks=tool_result.blocks,
                    additional_kwargs={"tool_call_id": tool_call.tool_id},
                )
            )

    resp = llm.chat_with_tools([search_tool], chat_history=chat_history)
    chat_history.append(resp.message)
    tool_calls = llm.get_tool_calls_from_response(
        resp, error_on_no_tool_call=False
    )

print(resp.message.content)
print("----" * 20)
for block in resp.message.blocks:
    if isinstance(block, CitationBlock):
        print("Source: ", block.source)
        print("Title: ", block.title)
        print("Cited Content:\n", block.cited_content.text)
        print("----" * 20)