docs/examples/llm/openai_responses.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/openai_responses.ipynb" target="_parent"></a>
This notebook shows how to use the OpenAI Responses LLM.
If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
%pip install llama-index llama-index-llms-openai
import os
os.environ["OPENAI_API_KEY"] = "..."
from llama_index.llms.openai import OpenAIResponses
llm = OpenAIResponses(
model="gpt-4o-mini",
# api_key="some key", # uses OPENAI_API_KEY env var by default
)
complete with a promptfrom llama_index.llms.openai import OpenAI
resp = llm.complete("Paul Graham is ")
print(resp)
chat with a list of messagesfrom llama_index.core.llms import ChatMessage
messages = [
ChatMessage(
role="system", content="You are a pirate with a colorful personality"
),
ChatMessage(role="user", content="What is your name"),
]
resp = llm.chat(messages)
print(resp)
Using stream_complete endpoint
resp = llm.stream_complete("Paul Graham is ")
for r in resp:
print(r.delta, end="")
Using stream_chat endpoint
from llama_index.core.llms import ChatMessage
messages = [
ChatMessage(
role="system", content="You are a pirate with a colorful personality"
),
ChatMessage(role="user", content="What is your name"),
]
resp = llm.stream_chat(messages)
for r in resp:
print(r.delta, end="")
The Respones API supports many options:
from llama_index.llms.openai import OpenAIResponses
llm = OpenAIResponses(
model="gpt-4o-mini",
temperature=0.5, # default is 0.1
max_output_tokens=100, # default is None
top_p=0.95, # default is 1.0
)
The responses API supports built-in tool calling, which you can read more about here.
Configuring this means that the LLM will automatically call the tool and use it to augment the response.
Tools are defined as a list of dictionaries, each containing settings for a tool.
Below is an example of using the built-in web search tool.
from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage
llm = OpenAIResponses(
model="gpt-4o-mini",
built_in_tools=[{"type": "web_search_preview"}],
)
resp = llm.chat(
[ChatMessage(role="user", content="What is the weather in San Francisco?")]
)
print(resp)
print("========" * 2)
print(resp.additional_kwargs)
For O-series models, you can set the reasoning effort to control the amount of time the model will spend reasoning.
See the OpenAI API docs for more information.
from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage
llm = OpenAIResponses(
model="o3-mini",
reasoning_options={"effort": "high"},
)
resp = llm.chat(
[ChatMessage(role="user", content="What is the meaning of life?")]
)
print(resp)
print("========" * 2)
print(resp.additional_kwargs)
OpenAI has support for images in the input of chat messages for many models.
Using the content blocks feature of chat messages, you can easily combone text and images in a single LLM prompt.
!wget https://cdn.pixabay.com/photo/2016/07/07/16/46/dice-1502706_640.jpg -O image.png
from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock
from llama_index.llms.openai import OpenAIResponses
llm = OpenAIResponses(model="gpt-4o")
messages = [
ChatMessage(
role="user",
blocks=[
ImageBlock(path="image.png"),
TextBlock(text="Describe the image in a few sentences."),
],
)
]
resp = llm.chat(messages)
print(resp.message.content)
OpenAI models have native support for function calling. This conveniently integrates with LlamaIndex tool abstractions, letting you plug in any arbitrary Python function to the LLM.
In the example below, we define a function to generate a Song object.
from pydantic import BaseModel
from llama_index.core.tools import FunctionTool
class Song(BaseModel):
"""A song with name and artist"""
name: str
artist: str
def generate_song(name: str, artist: str) -> Song:
"""Generates a song with provided name and artist."""
return Song(name=name, artist=artist)
tool = FunctionTool.from_defaults(fn=generate_song)
The strict parameter tells OpenAI whether or not to use constrained sampling when generating tool calls/structured outputs. This means that the generated tool call schema will always contain the expected fields.
Since this seems to increase latency, it defaults to false.
from llama_index.llms.openai import OpenAIResponses
llm = OpenAIResponses(model="gpt-4o-mini", strict=True)
response = llm.predict_and_call(
[tool],
"Write a random song for me",
# strict=True # can also be set at the function level to override the class
)
print(str(response))
We can also do multiple function calling.
llm = OpenAIResponses(model="gpt-4o-mini")
response = llm.predict_and_call(
[tool],
"Generate five songs from the Beatles",
allow_parallel_tool_calls=True,
)
for s in response.sources:
print(f"Name: {s.tool_name}, Input: {s.raw_input}, Output: {str(s)}")
If you want to control how a tool is called, you can also split the tool calling and tool selection into their own steps.
First, lets select a tool.
from llama_index.core.llms import ChatMessage
from llama_index.llms.openai import OpenAIResponses
llm = OpenAIResponses(model="gpt-4o-mini")
chat_history = [ChatMessage(role="user", content="Write a random song for me")]
resp = llm.chat_with_tools([tool], chat_history=chat_history)
Now, lets call the tool the LLM selected (if any).
If there was a tool call, we should send the results to the LLM to generate the final response (or another tool call!).
tools_by_name = {t.metadata.name: t for t in [tool]}
tool_calls = llm.get_tool_calls_from_response(
resp, error_on_no_tool_call=False
)
while tool_calls:
# add the LLM's response to the chat history
chat_history.append(resp.message)
for tool_call in tool_calls:
tool_name = tool_call.tool_name
tool_kwargs = tool_call.tool_kwargs
print(f"Calling {tool_name} with {tool_kwargs}")
tool_output = tool(**tool_kwargs)
chat_history.append(
ChatMessage(
role="tool",
content=str(tool_output),
# most LLMs like OpenAI need to know the tool call id
additional_kwargs={"call_id": tool_call.tool_id},
)
)
resp = llm.chat_with_tools([tool], chat_history=chat_history)
tool_calls = llm.get_tool_calls_from_response(
resp, error_on_no_tool_call=False
)
Now, we should have a final response!
print(resp.message.content)
An important use case for function calling is extracting structured objects. LlamaIndex provides an intuitive interface for converting any LLM into a structured LLM - simply define the target Pydantic class (can be nested), and given a prompt, we extract out the desired object.
from llama_index.llms.openai import OpenAIResponses
from llama_index.core.prompts import PromptTemplate
from pydantic import BaseModel
from typing import List
class MenuItem(BaseModel):
"""A menu item in a restaurant."""
course_name: str
is_vegetarian: bool
class Restaurant(BaseModel):
"""A restaurant with name, city, and cuisine."""
name: str
city: str
cuisine: str
menu_items: List[MenuItem]
llm = OpenAIResponses(model="gpt-4o-mini")
prompt_tmpl = PromptTemplate(
"Generate a restaurant in a given city {city_name}"
)
# Option 1: Use `as_structured_llm`
restaurant_obj = (
llm.as_structured_llm(Restaurant)
.complete(prompt_tmpl.format(city_name="Dallas"))
.raw
)
# Option 2: Use `structured_predict`
# restaurant_obj = llm.structured_predict(Restaurant, prompt_tmpl, city_name="Miami")
restaurant_obj
from llama_index.llms.openai import OpenAIResponses
llm = OpenAIResponses(model="gpt-4o")
resp = await llm.acomplete("Paul Graham is ")
print(resp)
resp = await llm.astream_complete("Paul Graham is ")
async for delta in resp:
print(delta.delta, end="")
Async function calling is also supported.
llm = OpenAIResponses(model="gpt-4o-mini")
response = await llm.apredict_and_call([tool], "Generate a random song")
print(str(response))
If there are additional kwargs not present in the constructor, you can set them at a per-instance level with additional_kwargs.
These will be passed into every call to the LLM.
from llama_index.llms.openai import OpenAIResponses
llm = OpenAIResponses(
model="gpt-4o-mini", additional_kwargs={"user": "your_user_id"}
)
resp = llm.complete("Paul Graham is ")
print(resp)
You can use image generation by passing, as a built-in-tool, {'type': 'image_generation'} or, if you want to enable streaming, {'type': 'image_generation', 'partial_images': 2}:
import base64
from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage, ImageBlock, TextBlock
# run without streaming
llm = OpenAIResponses(
model="gpt-4.1-mini", built_in_tools=[{"type": "image_generation"}]
)
messages = [
ChatMessage.from_str(
content="A llama dancing with a cat in a meadow", role="user"
)
]
response = llm.chat(
messages
) # response = await llm.achat(messages) for an async implementation
for block in response.message.blocks:
if isinstance(block, ImageBlock):
with open("llama_and_cat_dancing.png", "wb") as f:
f.write(bas64.b64decode(block.image))
elif isinstance(block, TextBlock):
print(block.text)
# run with streaming
llm_stream = OpenAIResponses(
model="gpt-4.1-mini",
built_in_tools=[{"type": "image_generation", "partial_images": 2}],
)
response = llm_stream.stream_chat(
messages
) # response = await llm_stream.asteam_chat(messages) for an async implementation
for event in response:
for block in event.message.blocks:
if isinstance(block, ImageBlock):
# block.detail contains the ID of the image
with open(f"llama_and_cat_dancing_{block.detail}.png", "wb") as f:
f.write(bas64.b64decode(block.image))
elif isinstance(block, TextBlock):
print(block.text)
You can call any remote MCP through the OpenAI Responses API just by passing the MCP specifics as a built-in tool to the LLM
from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage
llm = OpenAIResponses(
model="gpt-4.1",
built_in_tools=[
{
"type": "mcp",
"server_label": "deepwiki",
"server_url": "https://mcp.deepwiki.com/mcp",
"require_approval": "never",
}
],
)
messages = [
ChatMessage.from_str(
content="What transport protocols are supported in the 2025-03-26 version of the MCP spec?",
role="user",
)
]
response = llm.chat(messages)
# see the textual output
print(response.message.content)
# see the MCP tool call
print(response.raw.output[0])
You can use the Code Interpreter just by setting, as a built-in tool, "type": "code_interpreter", "container": { "type": "auto" }.
from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage
llm = OpenAIResponses(
model="gpt-4.1",
built_in_tools=[
{
"type": "code_interpreter",
"container": {"type": "auto"},
}
],
)
messages = messages = [
ChatMessage.from_str(
content="I need to solve the equation 3x + 11 = 14. Can you help me?",
role="user",
)
]
response = llm.chat(messages)
# see the textual output
print(response.message.content)
# see the MCP tool call
print(response.raw.output[0])