docs/examples/llm/google_genai.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/gemini.ipynb" target="_parent"></a>
In this notebook, we show how to use the google-genai Python SDK with LlamaIndex to interact with Google GenAI models.
If you're opening this Notebook on colab, you will need to install LlamaIndex 🦙 and the google-genai Python SDK.
%pip install llama-index-llms-google-genai llama-index
You will need to get an API key from Google AI Studio. Once you have one, you can either pass it explicity to the model, or use the GOOGLE_API_KEY environment variable.
import os
os.environ["GOOGLE_API_KEY"] = "..."
You can call complete with a prompt:
from llama_index.llms.google_genai import GoogleGenAI
llm = GoogleGenAI(
model="gemini-2.5-flash",
# api_key="some key", # uses GOOGLE_API_KEY env var by default
)
resp = llm.complete("Who is Paul Graham?")
print(resp)
You can also call chat with a list of chat messages:
from llama_index.core.llms import ChatMessage
from llama_index.llms.google_genai import GoogleGenAI
messages = [
ChatMessage(
role="system", content="You are a pirate with a colorful personality"
),
ChatMessage(role="user", content="Tell me a story"),
]
llm = GoogleGenAI(model="gemini-2.5-flash")
resp = llm.chat(messages)
print(resp)
Every method supports streaming through the stream_ prefix.
from llama_index.llms.google_genai import GoogleGenAI
llm = GoogleGenAI(model="gemini-2.5-flash")
resp = llm.stream_complete("Who is Paul Graham?")
for r in resp:
print(r.delta, end="")
from llama_index.core.llms import ChatMessage
messages = [
ChatMessage(role="user", content="Who is Paul Graham?"),
]
resp = llm.stream_chat(messages)
for r in resp:
print(r.delta, end="")
Every synchronous method has an async counterpart.
from llama_index.llms.google_genai import GoogleGenAI
llm = GoogleGenAI(model="gemini-2.5-flash")
resp = await llm.astream_complete("Who is Paul Graham?")
async for r in resp:
print(r.delta, end="")
messages = [
ChatMessage(role="user", content="Who is Paul Graham?"),
]
resp = await llm.achat(messages)
print(resp)
By providing the region and project_id parameters (either through environment variables or directly), you can enable usage through Vertex AI.
# Set environment variables
!export GOOGLE_GENAI_USE_VERTEXAI=true
!export GOOGLE_CLOUD_PROJECT='your-project-id'
!export GOOGLE_CLOUD_LOCATION='us-central1'
from llama_index.llms.google_genai import GoogleGenAI
# or set the parameters directly
llm = GoogleGenAI(
model="gemini-2.5-flash",
vertexai_config={"project": "your-project-id", "location": "us-central1"},
# you should set the context window to the max input tokens for the model
context_window=200000,
max_tokens=512,
)
Google GenAI supports cached content for improved performance and cost efficiency when reusing large contexts across multiple requests. This is particularly useful for RAG applications, document analysis, and multi-turn conversations with consistent context.
First, create cached content using the Google GenAI SDK:
from google import genai
from google.genai.types import CreateCachedContentConfig, Content, Part
import time
client = genai.Client(api_key="your-api-key")
# For VertexAI
# client = genai.Client(
# http_options=HttpOptions(api_version="v1"),
# project="your-project-id",
# location="us-central1",
# vertexai="True"
# )
Option 1: Upload Local Files
# Upload and process local PDF files
pdf_file = client.files.upload(file="./your_document.pdf")
while pdf_file.state.name == "PROCESSING":
print("Waiting for PDF to be processed.")
time.sleep(2)
pdf_file = client.files.get(name=pdf_file.name)
# Create cache with uploaded file
cache = client.caches.create(
model="gemini-2.5-flash",
config=CreateCachedContentConfig(
display_name="Document Analysis Cache",
system_instruction=(
"You are an expert document analyzer. Answer questions "
"based on the provided documents with accuracy and detail."
),
contents=[pdf_file], # Direct file reference
ttl="3600s", # Cache for 1 hour
),
)
Option 2: Multiple Files with Content Structure
# For multiple files or Cloud Storage files with VertexAI
contents = [
Content(
role="user",
parts=[
Part.from_uri(
# file_uri=pdf_file.uri, # you can use the uploaded file's URI too
file_uri="gs://cloud-samples-data/generative-ai/pdf/2312.11805v3.pdf",
mime_type="application/pdf",
),
Part.from_uri(
file_uri="gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf",
mime_type="application/pdf",
),
],
)
]
cache = client.caches.create(
model="gemini-2.5-flash",
config=CreateCachedContentConfig(
display_name="Multi-Document Cache",
system_instruction=(
"You are an expert researcher. Analyze and compare "
"information across the provided documents."
),
contents=contents,
ttl="3600s",
),
)
print(f"Cache created: {cache.name}")
print(f"Cached tokens: {cache.usage_metadata.total_token_count}")
Using Cached Content with LlamaIndex
Once you have created the cache, use it with LlamaIndex:
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core.llms import ChatMessage
llm = GoogleGenAI(
model="gemini-2.5-flash",
api_key="your-api-key",
cached_content=cache.name,
)
# For VertexAI
# llm = GoogleGenAI(
# model="gemini-2.5-flash",
# vertexai_config={"project": "your-project-id", "location": "us-central1"},
# cached_content=cache.name
# )
# Use the cached content
message = ChatMessage(
role="user", content="Summarize the key findings from Chapter 4."
)
response = llm.chat([message])
print(response)
Using Cached Content in Generation Config
For request-level caching control:
import google.genai.types as types
# Specify cached content per request
config = types.GenerateContentConfig(
cached_content=cache.name, temperature=0.1, max_output_tokens=1024
)
llm = GoogleGenAI(model="gemini-2.5-flash", generation_config=config)
response = llm.complete("List the first five chapters of the document")
print(response)
Cache Management
# List all caches
caches = client.caches.list()
for cache_item in caches:
print(f"Cache: {cache_item.display_name} ({cache_item.name})")
print(f"Tokens: {cache_item.usage_metadata.total_token_count}")
# Get cache details
cache_info = client.caches.get(name=cache.name)
print(f"Created: {cache_info.create_time}")
print(f"Expires: {cache_info.expire_time}")
# Delete cache when done
client.caches.delete(name=cache.name)
print("Cache deleted")
Using ChatMessage objects, you can pass in images and text to the LLM.
!wget https://cdn.pixabay.com/photo/2021/12/12/20/00/play-6865967_640.jpg -O image.jpg
from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock
from llama_index.llms.google_genai import GoogleGenAI
llm = GoogleGenAI(model="gemini-2.5-flash")
messages = [
ChatMessage(
role="user",
blocks=[
ImageBlock(path="image.jpg", image_mimetype="image/jpeg"),
TextBlock(text="What is in this image?"),
],
)
]
resp = llm.chat(messages)
print(resp)
You can also pass in documents.
from llama_index.core.llms import DocumentBlock
messages = [
ChatMessage(
role="user",
blocks=[
DocumentBlock(
path="/path/to/your/test.pdf",
document_mimetype="application/pdf",
),
TextBlock(text="Describe the document in a sentence."),
],
)
]
resp = llm.chat(messages)
print(resp)
Finally, you can also pass videos.
from llama_index.core.llms import VideoBlock
messages = [
ChatMessage(
role="user",
blocks=[
VideoBlock(
path="/path/to/your/video.mp4", video_mimetype="video/mp4"
),
TextBlock(text="Describe this video in a sentence."),
],
)
]
resp = llm.chat(messages)
print(resp)
LlamaIndex provides an intuitive interface for converting any LLM into a structured LLM through structured_predict - simply define the target Pydantic class (can be nested), and given a prompt, we extract out the desired object.
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core.prompts import PromptTemplate
from llama_index.core.bridge.pydantic import BaseModel
from typing import List
class MenuItem(BaseModel):
"""A menu item in a restaurant."""
course_name: str
is_vegetarian: bool
class Restaurant(BaseModel):
"""A restaurant with name, city, and cuisine."""
name: str
city: str
cuisine: str
menu_items: List[MenuItem]
llm = GoogleGenAI(model="gemini-2.5-flash")
prompt_tmpl = PromptTemplate(
"Generate a restaurant in a given city {city_name}"
)
# Option 1: Use `as_structured_llm`
restaurant_obj = (
llm.as_structured_llm(Restaurant)
.complete(prompt_tmpl.format(city_name="Miami"))
.raw
)
# Option 2: Use `structured_predict`
# restaurant_obj = llm.structured_predict(Restaurant, prompt_tmpl, city_name="Miami")
print(restaurant_obj)
Any LLM wrapped with as_structured_llm supports streaming through stream_chat.
from llama_index.core.llms import ChatMessage
from IPython.display import clear_output
from pprint import pprint
input_msg = ChatMessage.from_str("Generate a restaurant in San Francisco")
sllm = llm.as_structured_llm(Restaurant)
stream_output = sllm.stream_chat([input_msg])
for partial_output in stream_output:
clear_output(wait=True)
pprint(partial_output.raw.dict())
restaurant_obj = partial_output.raw
restaurant_obj
Google GenAI supports direct tool/function calling through the API. Using LlamaIndex, we can implement some core agentic tool calling patterns.
from llama_index.core.tools import FunctionTool
from llama_index.core.llms import ChatMessage
from llama_index.llms.google_genai import GoogleGenAI
from datetime import datetime
llm = GoogleGenAI(model="gemini-2.5-flash")
def get_current_time(timezone: str) -> dict:
"""Get the current time"""
return {
"time": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
"timezone": timezone,
}
# uses the tool name, any type annotations, and docstring to describe the tool
tool = FunctionTool.from_defaults(fn=get_current_time)
We can simply do a single pass to call the tool and get the result:
resp = llm.predict_and_call([tool], "What is the current time in New York?")
print(resp)
We can also use lower-level APIs to implement an agentic tool-calling loop!
chat_history = [
ChatMessage(role="user", content="What is the current time in New York?")
]
tools_by_name = {t.metadata.name: t for t in [tool]}
resp = llm.chat_with_tools([tool], chat_history=chat_history)
tool_calls = llm.get_tool_calls_from_response(
resp, error_on_no_tool_call=False
)
if not tool_calls:
print(resp)
else:
while tool_calls:
# add the LLM's response to the chat history
chat_history.append(resp.message)
for tool_call in tool_calls:
tool_name = tool_call.tool_name
tool_kwargs = tool_call.tool_kwargs
print(f"Calling {tool_name} with {tool_kwargs}")
tool_output = tool.call(**tool_kwargs)
print("Tool output: ", tool_output)
chat_history.append(
ChatMessage(
role="tool",
content=str(tool_output),
# most LLMs like Gemini, Anthropic, OpenAI, etc. need to know the tool call id
additional_kwargs={"tool_call_id": tool_call.tool_id},
)
)
resp = llm.chat_with_tools([tool], chat_history=chat_history)
tool_calls = llm.get_tool_calls_from_response(
resp, error_on_no_tool_call=False
)
print("Final response: ", resp.message.content)
We can also call multiple tools simultaneously in a single request, making it efficient for complex queries that require different types of information.
# Define another tool for temperature
def get_temperature(city: str) -> dict:
"""Get the current temperature for a city"""
return {
"city": city,
"temperature": "25°C",
}
# Create tools from functions
tool1 = FunctionTool.from_defaults(fn=get_current_time)
tool2 = FunctionTool.from_defaults(fn=get_temperature)
# Ask a question that requires both tools
chat_history = [
ChatMessage(
role="user",
content="What is the current time and temperature in New York?",
)
]
# The model will intelligently decide which tools to call
resp = llm.chat_with_tools([tool1, tool2], chat_history=chat_history)
tool_calls = llm.get_tool_calls_from_response(
resp, error_on_no_tool_call=False
)
print(f"Model made {len(tool_calls)} tool calls:")
for i, tool_call in enumerate(tool_calls, 1):
print(f"{i}. {tool_call.tool_name} with args: {tool_call.tool_kwargs}")
Google Gemini 2.0 and 2.5 models support Google Search grounding, which allows the model to search for real-time information and ground its responses with web search results. This is particularly useful for getting up-to-date information.
The built_in_tool parameter accepts Google Search tools that enable the model to ground its responses with real-world data from Google Search results.
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core.llms import ChatMessage
from google.genai import types
# Create Google Search grounding tool
grounding_tool = types.Tool(google_search=types.GoogleSearch())
llm = GoogleGenAI(
model="gemini-2.5-flash",
built_in_tool=grounding_tool,
)
resp = llm.complete("When is the next total solar eclipse in the US?")
print(resp)
The Google Search grounding tool provides several benefits:
You can also use the grounding tool with chat messages:
# Using Google Search with chat messages
messages = [ChatMessage(role="user", content="Who won the Euro 2024?")]
resp = llm.chat(messages)
print(resp)
# You can access grounding metadata from the raw response
if hasattr(resp, "raw") and "grounding_metadata" in resp.raw:
print(resp.raw["grounding_metadata"])
else:
print("\nNo grounding metadata in this response")
The built_in_tool parameter also accepts code execution tools that enable the model to write and execute Python code to solve problems, perform calculations, and analyze data. This is particularly useful for mathematical computations, data analysis, and generating visualizations.
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core.llms import ChatMessage
from google.genai import types
# Create code execution tool
code_execution_tool = types.Tool(code_execution=types.ToolCodeExecution())
llm = GoogleGenAI(
model="gemini-2.5-flash",
built_in_tool=code_execution_tool,
)
resp = llm.complete("Calculate 20th fibonacci number.")
print(resp)
When the model uses code execution, you can access the executed code, results, and other metadata through the raw response. This includes:
Let's see this in action:
# Request a calculation that will likely use code execution
messages = [
ChatMessage(
role="user", content="What is the sum of the first 50 prime numbers?"
)
]
resp = llm.chat(messages)
# Access the raw response to see code execution details
if hasattr(resp, "raw") and "content" in resp.raw:
parts = resp.raw["content"].get("parts", [])
for i, part in enumerate(parts):
print(f"Part {i+1}:")
if "text" in part and part["text"]:
print(f" Text: {part['text'][:100]}", end="")
print(" ..." if len(part["text"]) > 100 else "")
if "executable_code" in part and part["executable_code"]:
print(f" Executable Code: {part['executable_code']}")
if "code_execution_result" in part and part["code_execution_result"]:
print(f" Code Result: {part['code_execution_result']}")
else:
print("No detailed parts found in raw response")
Select models also support image outputs, as well as image inputs. Using the response_modalities config, we can generate and edit images with a Gemini model!
from llama_index.llms.google_genai import GoogleGenAI
import google.genai.types as types
config = types.GenerateContentConfig(
temperature=0.1, response_modalities=["Text", "Image"]
)
llm = GoogleGenAI(
model="gemini-2.5-flash-image-preview", generation_config=config
)
from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock
messages = [
ChatMessage(role="user", content="Please generate an image of a cute dog")
]
resp = llm.chat(messages)
from PIL import Image
from IPython.display import display
for block in resp.message.blocks:
if isinstance(block, ImageBlock):
image = Image.open(block.resolve_image())
display(image)
elif isinstance(block, TextBlock):
print(block.text)
We can also edit the image!
messages.append(resp.message)
messages.append(
ChatMessage(
role="user",
content="Please edit the image to make the dog a mini-schnauzer, but keep the same overall pose, framing, background, and art style.",
)
)
resp = llm.chat(messages)
for block in resp.message.blocks:
if isinstance(block, ImageBlock):
image = Image.open(block.resolve_image())
display(image)
elif isinstance(block, TextBlock):
print(block.text)