Google GenAI

In this notebook, we show how to use the google-genai Python SDK with LlamaIndex to interact with Google GenAI models.

If you're opening this Notebook on colab, you will need to install LlamaIndex 🦙 and the google-genai Python SDK.

python

%pip install llama-index-llms-google-genai llama-index

Basic Usage

You will need to get an API key from Google AI Studio. Once you have one, you can either pass it explicity to the model, or use the GOOGLE_API_KEY environment variable.

python

import os

os.environ["GOOGLE_API_KEY"] = "..."

Basic Usage

You can call complete with a prompt:

python

from llama_index.llms.google_genai import GoogleGenAI

llm = GoogleGenAI(
    model="gemini-2.5-flash",
    # api_key="some key",  # uses GOOGLE_API_KEY env var by default
)

resp = llm.complete("Who is Paul Graham?")
print(resp)

You can also call chat with a list of chat messages:

python

from llama_index.core.llms import ChatMessage
from llama_index.llms.google_genai import GoogleGenAI

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="Tell me a story"),
]
llm = GoogleGenAI(model="gemini-2.5-flash")
resp = llm.chat(messages)

print(resp)

Streaming Support

Every method supports streaming through the stream_ prefix.

python

from llama_index.llms.google_genai import GoogleGenAI

llm = GoogleGenAI(model="gemini-2.5-flash")

resp = llm.stream_complete("Who is Paul Graham?")
for r in resp:
    print(r.delta, end="")

python

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(role="user", content="Who is Paul Graham?"),
]

resp = llm.stream_chat(messages)
for r in resp:
    print(r.delta, end="")

Async Usage

Every synchronous method has an async counterpart.

python

from llama_index.llms.google_genai import GoogleGenAI

llm = GoogleGenAI(model="gemini-2.5-flash")

resp = await llm.astream_complete("Who is Paul Graham?")
async for r in resp:
    print(r.delta, end="")

python

messages = [
    ChatMessage(role="user", content="Who is Paul Graham?"),
]

resp = await llm.achat(messages)
print(resp)

Vertex AI Support

By providing the region and project_id parameters (either through environment variables or directly), you can enable usage through Vertex AI.

python

# Set environment variables
!export GOOGLE_GENAI_USE_VERTEXAI=true
!export GOOGLE_CLOUD_PROJECT='your-project-id'
!export GOOGLE_CLOUD_LOCATION='us-central1'

python

from llama_index.llms.google_genai import GoogleGenAI

# or set the parameters directly
llm = GoogleGenAI(
    model="gemini-2.5-flash",
    vertexai_config={"project": "your-project-id", "location": "us-central1"},
    # you should set the context window to the max input tokens for the model
    context_window=200000,
    max_tokens=512,
)

Cached Content Support

Google GenAI supports cached content for improved performance and cost efficiency when reusing large contexts across multiple requests. This is particularly useful for RAG applications, document analysis, and multi-turn conversations with consistent context.

Benefits

Faster responses
Cost savings through reduced input token usage
Consistent context across multiple queries
Perfect for document analysis with large files

Creating Cached Content

First, create cached content using the Google GenAI SDK:

python

from google import genai
from google.genai.types import CreateCachedContentConfig, Content, Part
import time

client = genai.Client(api_key="your-api-key")

# For VertexAI
# client = genai.Client(
#     http_options=HttpOptions(api_version="v1"),
#     project="your-project-id",
#     location="us-central1",
#     vertexai="True"
# )

Option 1: Upload Local Files

python

# Upload and process local PDF files
pdf_file = client.files.upload(file="./your_document.pdf")
while pdf_file.state.name == "PROCESSING":
    print("Waiting for PDF to be processed.")
    time.sleep(2)
    pdf_file = client.files.get(name=pdf_file.name)

# Create cache with uploaded file
cache = client.caches.create(
    model="gemini-2.5-flash",
    config=CreateCachedContentConfig(
        display_name="Document Analysis Cache",
        system_instruction=(
            "You are an expert document analyzer. Answer questions "
            "based on the provided documents with accuracy and detail."
        ),
        contents=[pdf_file],  # Direct file reference
        ttl="3600s",  # Cache for 1 hour
    ),
)

Option 2: Multiple Files with Content Structure

python

# For multiple files or Cloud Storage files with VertexAI
contents = [
    Content(
        role="user",
        parts=[
            Part.from_uri(
                # file_uri=pdf_file.uri,    # you can use the uploaded file's URI too
                file_uri="gs://cloud-samples-data/generative-ai/pdf/2312.11805v3.pdf",
                mime_type="application/pdf",
            ),
            Part.from_uri(
                file_uri="gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf",
                mime_type="application/pdf",
            ),
        ],
    )
]

cache = client.caches.create(
    model="gemini-2.5-flash",
    config=CreateCachedContentConfig(
        display_name="Multi-Document Cache",
        system_instruction=(
            "You are an expert researcher. Analyze and compare "
            "information across the provided documents."
        ),
        contents=contents,
        ttl="3600s",
    ),
)

print(f"Cache created: {cache.name}")
print(f"Cached tokens: {cache.usage_metadata.total_token_count}")

Using Cached Content with LlamaIndex

Once you have created the cache, use it with LlamaIndex:

python

from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core.llms import ChatMessage

llm = GoogleGenAI(
    model="gemini-2.5-flash",
    api_key="your-api-key",
    cached_content=cache.name,
)

# For VertexAI
# llm = GoogleGenAI(
#     model="gemini-2.5-flash",
#     vertexai_config={"project": "your-project-id", "location": "us-central1"},
#     cached_content=cache.name
# )

# Use the cached content
message = ChatMessage(
    role="user", content="Summarize the key findings from Chapter 4."
)
response = llm.chat([message])
print(response)

Using Cached Content in Generation Config

For request-level caching control:

python

import google.genai.types as types

# Specify cached content per request
config = types.GenerateContentConfig(
    cached_content=cache.name, temperature=0.1, max_output_tokens=1024
)

llm = GoogleGenAI(model="gemini-2.5-flash", generation_config=config)

response = llm.complete("List the first five chapters of the document")
print(response)

Cache Management

python

# List all caches
caches = client.caches.list()
for cache_item in caches:
    print(f"Cache: {cache_item.display_name} ({cache_item.name})")
    print(f"Tokens: {cache_item.usage_metadata.total_token_count}")

# Get cache details
cache_info = client.caches.get(name=cache.name)
print(f"Created: {cache_info.create_time}")
print(f"Expires: {cache_info.expire_time}")

# Delete cache when done
client.caches.delete(name=cache.name)
print("Cache deleted")

Using ChatMessage objects, you can pass in images and text to the LLM.

python

!wget https://cdn.pixabay.com/photo/2021/12/12/20/00/play-6865967_640.jpg -O image.jpg

python

from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock
from llama_index.llms.google_genai import GoogleGenAI

llm = GoogleGenAI(model="gemini-2.5-flash")

messages = [
    ChatMessage(
        role="user",
        blocks=[
            ImageBlock(path="image.jpg", image_mimetype="image/jpeg"),
            TextBlock(text="What is in this image?"),
        ],
    )
]

resp = llm.chat(messages)
print(resp)

You can also pass in documents.

python

from llama_index.core.llms import DocumentBlock

messages = [
    ChatMessage(
        role="user",
        blocks=[
            DocumentBlock(
                path="/path/to/your/test.pdf",
                document_mimetype="application/pdf",
            ),
            TextBlock(text="Describe the document in a sentence."),
        ],
    )
]

resp = llm.chat(messages)
print(resp)

Finally, you can also pass videos.

python

from llama_index.core.llms import VideoBlock

messages = [
    ChatMessage(
        role="user",
        blocks=[
            VideoBlock(
                path="/path/to/your/video.mp4", video_mimetype="video/mp4"
            ),
            TextBlock(text="Describe this video in a sentence."),
        ],
    )
]

resp = llm.chat(messages)
print(resp)

Structured Prediction

LlamaIndex provides an intuitive interface for converting any LLM into a structured LLM through structured_predict - simply define the target Pydantic class (can be nested), and given a prompt, we extract out the desired object.

python

from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core.prompts import PromptTemplate
from llama_index.core.bridge.pydantic import BaseModel
from typing import List


class MenuItem(BaseModel):
    """A menu item in a restaurant."""

    course_name: str
    is_vegetarian: bool


class Restaurant(BaseModel):
    """A restaurant with name, city, and cuisine."""

    name: str
    city: str
    cuisine: str
    menu_items: List[MenuItem]


llm = GoogleGenAI(model="gemini-2.5-flash")
prompt_tmpl = PromptTemplate(
    "Generate a restaurant in a given city {city_name}"
)

# Option 1: Use `as_structured_llm`
restaurant_obj = (
    llm.as_structured_llm(Restaurant)
    .complete(prompt_tmpl.format(city_name="Miami"))
    .raw
)
# Option 2: Use `structured_predict`
# restaurant_obj = llm.structured_predict(Restaurant, prompt_tmpl, city_name="Miami")

python

print(restaurant_obj)

Structured Prediction with Streaming

Any LLM wrapped with as_structured_llm supports streaming through stream_chat.

python

from llama_index.core.llms import ChatMessage
from IPython.display import clear_output
from pprint import pprint

input_msg = ChatMessage.from_str("Generate a restaurant in San Francisco")

sllm = llm.as_structured_llm(Restaurant)
stream_output = sllm.stream_chat([input_msg])
for partial_output in stream_output:
    clear_output(wait=True)
    pprint(partial_output.raw.dict())
    restaurant_obj = partial_output.raw

restaurant_obj

Tool/Function Calling

Google GenAI supports direct tool/function calling through the API. Using LlamaIndex, we can implement some core agentic tool calling patterns.

python

from llama_index.core.tools import FunctionTool
from llama_index.core.llms import ChatMessage
from llama_index.llms.google_genai import GoogleGenAI
from datetime import datetime

llm = GoogleGenAI(model="gemini-2.5-flash")


def get_current_time(timezone: str) -> dict:
    """Get the current time"""
    return {
        "time": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        "timezone": timezone,
    }


# uses the tool name, any type annotations, and docstring to describe the tool
tool = FunctionTool.from_defaults(fn=get_current_time)

We can simply do a single pass to call the tool and get the result:

python

resp = llm.predict_and_call([tool], "What is the current time in New York?")
print(resp)

We can also use lower-level APIs to implement an agentic tool-calling loop!

python

chat_history = [
    ChatMessage(role="user", content="What is the current time in New York?")
]
tools_by_name = {t.metadata.name: t for t in [tool]}

resp = llm.chat_with_tools([tool], chat_history=chat_history)
tool_calls = llm.get_tool_calls_from_response(
    resp, error_on_no_tool_call=False
)

if not tool_calls:
    print(resp)
else:
    while tool_calls:
        # add the LLM's response to the chat history
        chat_history.append(resp.message)

        for tool_call in tool_calls:
            tool_name = tool_call.tool_name
            tool_kwargs = tool_call.tool_kwargs

            print(f"Calling {tool_name} with {tool_kwargs}")
            tool_output = tool.call(**tool_kwargs)
            print("Tool output: ", tool_output)
            chat_history.append(
                ChatMessage(
                    role="tool",
                    content=str(tool_output),
                    # most LLMs like Gemini, Anthropic, OpenAI, etc. need to know the tool call id
                    additional_kwargs={"tool_call_id": tool_call.tool_id},
                )
            )

            resp = llm.chat_with_tools([tool], chat_history=chat_history)
            tool_calls = llm.get_tool_calls_from_response(
                resp, error_on_no_tool_call=False
            )
    print("Final response: ", resp.message.content)

We can also call multiple tools simultaneously in a single request, making it efficient for complex queries that require different types of information.

python

# Define another tool for temperature
def get_temperature(city: str) -> dict:
    """Get the current temperature for a city"""
    return {
        "city": city,
        "temperature": "25°C",
    }


# Create tools from functions
tool1 = FunctionTool.from_defaults(fn=get_current_time)
tool2 = FunctionTool.from_defaults(fn=get_temperature)

# Ask a question that requires both tools
chat_history = [
    ChatMessage(
        role="user",
        content="What is the current time and temperature in New York?",
    )
]

# The model will intelligently decide which tools to call
resp = llm.chat_with_tools([tool1, tool2], chat_history=chat_history)
tool_calls = llm.get_tool_calls_from_response(
    resp, error_on_no_tool_call=False
)

print(f"Model made {len(tool_calls)} tool calls:")
for i, tool_call in enumerate(tool_calls, 1):
    print(f"{i}. {tool_call.tool_name} with args: {tool_call.tool_kwargs}")

Google Search Grounding

Google Gemini 2.0 and 2.5 models support Google Search grounding, which allows the model to search for real-time information and ground its responses with web search results. This is particularly useful for getting up-to-date information.

The built_in_tool parameter accepts Google Search tools that enable the model to ground its responses with real-world data from Google Search results.

python

from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core.llms import ChatMessage
from google.genai import types

# Create Google Search grounding tool
grounding_tool = types.Tool(google_search=types.GoogleSearch())

llm = GoogleGenAI(
    model="gemini-2.5-flash",
    built_in_tool=grounding_tool,
)

resp = llm.complete("When is the next total solar eclipse in the US?")
print(resp)

The Google Search grounding tool provides several benefits:

Real-time information: Access to current events and up-to-date data
Factual accuracy: Responses grounded in actual search results
Source attribution: Grounding metadata includes search sources
Automatic search decisions: The model determines when to search based on the query

You can also use the grounding tool with chat messages:

python

# Using Google Search with chat messages
messages = [ChatMessage(role="user", content="Who won the Euro 2024?")]

resp = llm.chat(messages)
print(resp)

# You can access grounding metadata from the raw response
if hasattr(resp, "raw") and "grounding_metadata" in resp.raw:
    print(resp.raw["grounding_metadata"])
else:
    print("\nNo grounding metadata in this response")

Code Execution

The built_in_tool parameter also accepts code execution tools that enable the model to write and execute Python code to solve problems, perform calculations, and analyze data. This is particularly useful for mathematical computations, data analysis, and generating visualizations.

python

from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core.llms import ChatMessage
from google.genai import types

# Create code execution tool
code_execution_tool = types.Tool(code_execution=types.ToolCodeExecution())

llm = GoogleGenAI(
    model="gemini-2.5-flash",
    built_in_tool=code_execution_tool,
)

resp = llm.complete("Calculate 20th fibonacci number.")
print(resp)

Accessing Code Execution Details

When the model uses code execution, you can access the executed code, results, and other metadata through the raw response. This includes:

executable_code: The actual Python code that was executed
code_execution_result: The output from running the code
text: The model's explanation and commentary

Let's see this in action:

python

# Request a calculation that will likely use code execution
messages = [
    ChatMessage(
        role="user", content="What is the sum of the first 50 prime numbers?"
    )
]

resp = llm.chat(messages)

# Access the raw response to see code execution details
if hasattr(resp, "raw") and "content" in resp.raw:
    parts = resp.raw["content"].get("parts", [])

    for i, part in enumerate(parts):
        print(f"Part {i+1}:")

        if "text" in part and part["text"]:
            print(f"  Text: {part['text'][:100]}", end="")
            print(" ..." if len(part["text"]) > 100 else "")

        if "executable_code" in part and part["executable_code"]:
            print(f"  Executable Code: {part['executable_code']}")

        if "code_execution_result" in part and part["code_execution_result"]:
            print(f"  Code Result: {part['code_execution_result']}")
else:
    print("No detailed parts found in raw response")

Image Generation

Select models also support image outputs, as well as image inputs. Using the response_modalities config, we can generate and edit images with a Gemini model!

python

from llama_index.llms.google_genai import GoogleGenAI
import google.genai.types as types

config = types.GenerateContentConfig(
    temperature=0.1, response_modalities=["Text", "Image"]
)

llm = GoogleGenAI(
    model="gemini-2.5-flash-image-preview", generation_config=config
)

python

from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock

messages = [
    ChatMessage(role="user", content="Please generate an image of a cute dog")
]

resp = llm.chat(messages)

python

from PIL import Image
from IPython.display import display

for block in resp.message.blocks:
    if isinstance(block, ImageBlock):
        image = Image.open(block.resolve_image())
        display(image)
    elif isinstance(block, TextBlock):
        print(block.text)

We can also edit the image!

python

messages.append(resp.message)
messages.append(
    ChatMessage(
        role="user",
        content="Please edit the image to make the dog a mini-schnauzer, but keep the same overall pose, framing, background, and art style.",
    )
)

resp = llm.chat(messages)

for block in resp.message.blocks:
    if isinstance(block, ImageBlock):
        image = Image.open(block.resolve_image())
        display(image)
    elif isinstance(block, TextBlock):
        print(block.text)

Google GenAI

Google GenAI

Basic Usage

Basic Usage

Streaming Support

Async Usage

Vertex AI Support

Cached Content Support

Benefits

Creating Cached Content

Multi-Modal Support

Structured Prediction

Structured Prediction with Streaming

Tool/Function Calling

Google Search Grounding

Code Execution

Accessing Code Execution Details

Image Generation