Back to Llama Index

Multi-Modal LLM using Anthropic model for image reasoning

docs/examples/multi_modal/anthropic_multi_modal.ipynb

0.14.217.0 KB
Original Source

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/multi_modal/anthropic_multi_modal.ipynb" target="_parent"></a>

Multi-Modal LLM using Anthropic model for image reasoning

Anthropic has recently released its latest Multi modal models: Claude 3 Opus, Claude 3 Sonnet.

  1. Claude 3 Opus - claude-3-opus-20240229

  2. Claude 3 Sonnet - claude-3-sonnet-20240229

In this notebook, we show how to use Anthropic MultiModal LLM class/abstraction for image understanding/reasoning.

We also show several functions we are now supporting for Anthropic MultiModal LLM:

  • complete (both sync and async): for a single prompt and list of images
  • chat (both sync and async): for multiple chat messages
  • stream complete (both sync and async): for steaming output of complete
  • stream chat (both sync and async): for steaming output of chat
python
!pip install llama-index-multi-modal-llms-anthropic
!pip install llama-index-vector-stores-qdrant
!pip install matplotlib

Use Anthropic to understand Images from Local directory

python
import os

os.environ["ANTHROPIC_API_KEY"] = ""  # Your ANTHROPIC API key here
python
from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("../data/images/prometheus_paper_card.png")
plt.imshow(img)
python
from llama_index.core import SimpleDirectoryReader
from llama_index.multi_modal_llms.anthropic import AnthropicMultiModal

# put your local directore here
image_documents = SimpleDirectoryReader(
    input_files=["../data/images/prometheus_paper_card.png"]
).load_data()

# Initiated Anthropic MultiModal class
anthropic_mm_llm = AnthropicMultiModal(max_tokens=300)
python
response = anthropic_mm_llm.complete(
    prompt="Describe the images as an alternative text",
    image_documents=image_documents,
)

print(response)

Use AnthropicMultiModal to reason images from URLs

python
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls

image_urls = [
    "https://venturebeat.com/wp-content/uploads/2024/03/Screenshot-2024-03-04-at-12.49.41%E2%80%AFAM.png",
    # Add yours here!
]

img_response = requests.get(image_urls[0])
img = Image.open(BytesIO(img_response.content))
plt.imshow(img)

image_url_documents = load_image_urls(image_urls)
python
response = anthropic_mm_llm.complete(
    prompt="Describe the images as an alternative text",
    image_documents=image_url_documents,
)

print(response)

Structured Output Parsing from an Image

In this section, we use our multi-modal Pydantic program to generate structured output from an image.

python
from llama_index.core import SimpleDirectoryReader

# put your local directore here
image_documents = SimpleDirectoryReader(
    input_files=["../data/images/ark_email_sample.PNG"]
).load_data()
python
from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("../data/images/ark_email_sample.PNG")
plt.imshow(img)
python
from pydantic import BaseModel
from typing import List


class TickerInfo(BaseModel):
    """List of ticker info."""

    direction: str
    ticker: str
    company: str
    shares_traded: int
    percent_of_total_etf: float


class TickerList(BaseModel):
    """List of stock tickers."""

    fund: str
    tickers: List[TickerInfo]
python
from llama_index.multi_modal_llms.anthropic import AnthropicMultiModal
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser

prompt_template_str = """\
Can you get the stock information in the image \
and return the answer? Pick just one fund. 

Make sure the answer is a JSON format corresponding to a Pydantic schema. The Pydantic schema is given below.

"""

# Initiated Anthropic MultiModal class
anthropic_mm_llm = AnthropicMultiModal(max_tokens=300)


llm_program = MultiModalLLMCompletionProgram.from_defaults(
    output_cls=TickerList,
    image_documents=image_documents,
    prompt_template_str=prompt_template_str,
    multi_modal_llm=anthropic_mm_llm,
    verbose=True,
)
python
response = llm_program()
python
print(str(response))

Index into a Vector Store

In this section we show you how to use Claude 3 to build a RAG pipeline over image data. We first use Claude to extract text from a set of images. We then index the text with an embedding model. Finally, we build a query pipeline over the data.

python
# !wget "https://www.dropbox.com/scl/fi/pvxgohp5ts5mcj2js8drk/mixed_wiki_images_small.zip?rlkey=3zf0z0n2etsjp19tofasaf4vy&dl=1" -O mixed_wiki_images_small.zip
# !wget "https://www.dropbox.com/scl/fi/vg2h92owduqmarwj7fxnc/mixed_wiki_images_small.zip?rlkey=fejq570ehhil3qgv3gibaliqu&dl=1" -O mixed_wiki_images_small.zip
!wget "https://www.dropbox.com/scl/fi/c1ec6osn0r2ggnitijqhl/mixed_wiki_images_small.zip?rlkey=swwxc7h4qtwlnhmby5fsnderd&dl=1" -O mixed_wiki_images_small.zip
!unzip mixed_wiki_images_small.zip
python
from llama_index.multi_modal_llms.anthropic import AnthropicMultiModal

anthropic_mm_llm = AnthropicMultiModal(max_tokens=300)
python
from llama_index.core.schema import TextNode
from pathlib import Path
from llama_index.core import SimpleDirectoryReader

nodes = []
for img_file in Path("mixed_wiki_images_small").glob("*.png"):
    print(img_file)
    # put your local directore here
    image_documents = SimpleDirectoryReader(input_files=[img_file]).load_data()
    response = anthropic_mm_llm.complete(
        prompt="Describe the images as an alternative text",
        image_documents=image_documents,
    )
    metadata = {"img_file": img_file}
    nodes.append(TextNode(text=str(response), metadata=metadata))
python
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.anthropic import Anthropic
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import Settings
from llama_index.core import StorageContext
import qdrant_client


# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_mixed_img")

vector_store = QdrantVectorStore(client=client, collection_name="collection")

# Using the embedding model to Gemini
embed_model = OpenAIEmbedding()
anthropic_mm_llm = AnthropicMultiModal(max_tokens=300)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
)
python
from llama_index.llms.anthropic import Anthropic

query_engine = index.as_query_engine(llm=Anthropic())
response = query_engine.query("Tell me more about the porsche")
python
print(str(response))
python
from llama_index.core.response.notebook_utils import display_source_node

for n in response.source_nodes:
    display_source_node(n, metadata_mode="all")