OpenAI Responses API

This notebook shows how to use the OpenAI Responses LLM.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python

%pip install llama-index llama-index-llms-openai

Basic Usage

python

import os

os.environ["OPENAI_API_KEY"] = "..."

python

from llama_index.llms.openai import OpenAIResponses

llm = OpenAIResponses(
    model="gpt-4o-mini",
    # api_key="some key",  # uses OPENAI_API_KEY env var by default
)

Call `complete` with a prompt

python

from llama_index.llms.openai import OpenAI

resp = llm.complete("Paul Graham is ")

python

print(resp)

Call `chat` with a list of messages

python

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.chat(messages)

python

print(resp)

Streaming

Using stream_complete endpoint

python

resp = llm.stream_complete("Paul Graham is ")

python

for r in resp:
    print(r.delta, end="")

Using stream_chat endpoint

python

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.stream_chat(messages)

python

for r in resp:
    print(r.delta, end="")

Configure Parameters

The Respones API supports many options:

Setting the model name
Generation parameters like temperature, top_p, max_output_tokens
enabling built-in tool calling
setting the resoning effort for O-series models
tracking previous responses for automatic conversation history
and more!

Basic Parameters

python

from llama_index.llms.openai import OpenAIResponses

llm = OpenAIResponses(
    model="gpt-4o-mini",
    temperature=0.5,  # default is 0.1
    max_output_tokens=100,  # default is None
    top_p=0.95,  # default is 1.0
)

Built-in Tool Calling

The responses API supports built-in tool calling, which you can read more about here.

Configuring this means that the LLM will automatically call the tool and use it to augment the response.

Tools are defined as a list of dictionaries, each containing settings for a tool.

Below is an example of using the built-in web search tool.

python

from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage

llm = OpenAIResponses(
    model="gpt-4o-mini",
    built_in_tools=[{"type": "web_search_preview"}],
)

resp = llm.chat(
    [ChatMessage(role="user", content="What is the weather in San Francisco?")]
)
print(resp)
print("========" * 2)
print(resp.additional_kwargs)

Reasoning Effort

For O-series models, you can set the reasoning effort to control the amount of time the model will spend reasoning.

See the OpenAI API docs for more information.

python

from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage

llm = OpenAIResponses(
    model="o3-mini",
    reasoning_options={"effort": "high"},
)

resp = llm.chat(
    [ChatMessage(role="user", content="What is the meaning of life?")]
)
print(resp)
print("========" * 2)
print(resp.additional_kwargs)

Image Support

OpenAI has support for images in the input of chat messages for many models.

Using the content blocks feature of chat messages, you can easily combone text and images in a single LLM prompt.

python

!wget https://cdn.pixabay.com/photo/2016/07/07/16/46/dice-1502706_640.jpg -O image.png

python

from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock
from llama_index.llms.openai import OpenAIResponses

llm = OpenAIResponses(model="gpt-4o")

messages = [
    ChatMessage(
        role="user",
        blocks=[
            ImageBlock(path="image.png"),
            TextBlock(text="Describe the image in a few sentences."),
        ],
    )
]

resp = llm.chat(messages)
print(resp.message.content)

Using Function/Tool Calling

OpenAI models have native support for function calling. This conveniently integrates with LlamaIndex tool abstractions, letting you plug in any arbitrary Python function to the LLM.

In the example below, we define a function to generate a Song object.

python

from pydantic import BaseModel
from llama_index.core.tools import FunctionTool


class Song(BaseModel):
    """A song with name and artist"""

    name: str
    artist: str


def generate_song(name: str, artist: str) -> Song:
    """Generates a song with provided name and artist."""
    return Song(name=name, artist=artist)


tool = FunctionTool.from_defaults(fn=generate_song)

The strict parameter tells OpenAI whether or not to use constrained sampling when generating tool calls/structured outputs. This means that the generated tool call schema will always contain the expected fields.

Since this seems to increase latency, it defaults to false.

python

from llama_index.llms.openai import OpenAIResponses

llm = OpenAIResponses(model="gpt-4o-mini", strict=True)
response = llm.predict_and_call(
    [tool],
    "Write a random song for me",
    # strict=True  # can also be set at the function level to override the class
)
print(str(response))

We can also do multiple function calling.

python

llm = OpenAIResponses(model="gpt-4o-mini")
response = llm.predict_and_call(
    [tool],
    "Generate five songs from the Beatles",
    allow_parallel_tool_calls=True,
)
for s in response.sources:
    print(f"Name: {s.tool_name}, Input: {s.raw_input}, Output: {str(s)}")

Manual Tool Calling

If you want to control how a tool is called, you can also split the tool calling and tool selection into their own steps.

First, lets select a tool.

python

from llama_index.core.llms import ChatMessage
from llama_index.llms.openai import OpenAIResponses

llm = OpenAIResponses(model="gpt-4o-mini")

chat_history = [ChatMessage(role="user", content="Write a random song for me")]

resp = llm.chat_with_tools([tool], chat_history=chat_history)

Now, lets call the tool the LLM selected (if any).

If there was a tool call, we should send the results to the LLM to generate the final response (or another tool call!).

python

tools_by_name = {t.metadata.name: t for t in [tool]}
tool_calls = llm.get_tool_calls_from_response(
    resp, error_on_no_tool_call=False
)

while tool_calls:
    # add the LLM's response to the chat history
    chat_history.append(resp.message)

    for tool_call in tool_calls:
        tool_name = tool_call.tool_name
        tool_kwargs = tool_call.tool_kwargs

        print(f"Calling {tool_name} with {tool_kwargs}")
        tool_output = tool(**tool_kwargs)
        chat_history.append(
            ChatMessage(
                role="tool",
                content=str(tool_output),
                # most LLMs like OpenAI need to know the tool call id
                additional_kwargs={"call_id": tool_call.tool_id},
            )
        )

        resp = llm.chat_with_tools([tool], chat_history=chat_history)
        tool_calls = llm.get_tool_calls_from_response(
            resp, error_on_no_tool_call=False
        )

Now, we should have a final response!

python

print(resp.message.content)

Structured Prediction

An important use case for function calling is extracting structured objects. LlamaIndex provides an intuitive interface for converting any LLM into a structured LLM - simply define the target Pydantic class (can be nested), and given a prompt, we extract out the desired object.

python

from llama_index.llms.openai import OpenAIResponses
from llama_index.core.prompts import PromptTemplate
from pydantic import BaseModel
from typing import List


class MenuItem(BaseModel):
    """A menu item in a restaurant."""

    course_name: str
    is_vegetarian: bool


class Restaurant(BaseModel):
    """A restaurant with name, city, and cuisine."""

    name: str
    city: str
    cuisine: str
    menu_items: List[MenuItem]


llm = OpenAIResponses(model="gpt-4o-mini")
prompt_tmpl = PromptTemplate(
    "Generate a restaurant in a given city {city_name}"
)
# Option 1: Use `as_structured_llm`
restaurant_obj = (
    llm.as_structured_llm(Restaurant)
    .complete(prompt_tmpl.format(city_name="Dallas"))
    .raw
)
# Option 2: Use `structured_predict`
# restaurant_obj = llm.structured_predict(Restaurant, prompt_tmpl, city_name="Miami")

python

restaurant_obj

Async

python

from llama_index.llms.openai import OpenAIResponses

llm = OpenAIResponses(model="gpt-4o")

python

resp = await llm.acomplete("Paul Graham is ")

python

print(resp)

python

resp = await llm.astream_complete("Paul Graham is ")

python

async for delta in resp:
    print(delta.delta, end="")

Async function calling is also supported.

python

llm = OpenAIResponses(model="gpt-4o-mini")
response = await llm.apredict_and_call([tool], "Generate a random song")
print(str(response))

Additional kwargs

If there are additional kwargs not present in the constructor, you can set them at a per-instance level with additional_kwargs.

These will be passed into every call to the LLM.

python

from llama_index.llms.openai import OpenAIResponses

llm = OpenAIResponses(
    model="gpt-4o-mini", additional_kwargs={"user": "your_user_id"}
)
resp = llm.complete("Paul Graham is ")
print(resp)

Image generation

You can use image generation by passing, as a built-in-tool, {'type': 'image_generation'} or, if you want to enable streaming, {'type': 'image_generation', 'partial_images': 2}:

python

import base64
from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage, ImageBlock, TextBlock

# run without streaming
llm = OpenAIResponses(
    model="gpt-4.1-mini", built_in_tools=[{"type": "image_generation"}]
)
messages = [
    ChatMessage.from_str(
        content="A llama dancing with a cat in a meadow", role="user"
    )
]
response = llm.chat(
    messages
)  # response = await llm.achat(messages) for an async implementation
for block in response.message.blocks:
    if isinstance(block, ImageBlock):
        with open("llama_and_cat_dancing.png", "wb") as f:
            f.write(bas64.b64decode(block.image))
    elif isinstance(block, TextBlock):
        print(block.text)

# run with streaming
llm_stream = OpenAIResponses(
    model="gpt-4.1-mini",
    built_in_tools=[{"type": "image_generation", "partial_images": 2}],
)
response = llm_stream.stream_chat(
    messages
)  # response = await llm_stream.asteam_chat(messages) for an async implementation
for event in response:
    for block in event.message.blocks:
        if isinstance(block, ImageBlock):
            # block.detail contains the ID of the image
            with open(f"llama_and_cat_dancing_{block.detail}.png", "wb") as f:
                f.write(bas64.b64decode(block.image))
        elif isinstance(block, TextBlock):
            print(block.text)

MCP Remote calls

You can call any remote MCP through the OpenAI Responses API just by passing the MCP specifics as a built-in tool to the LLM

python

from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage

llm = OpenAIResponses(
    model="gpt-4.1",
    built_in_tools=[
        {
            "type": "mcp",
            "server_label": "deepwiki",
            "server_url": "https://mcp.deepwiki.com/mcp",
            "require_approval": "never",
        }
    ],
)
messages = [
    ChatMessage.from_str(
        content="What transport protocols are supported in the 2025-03-26 version of the MCP spec?",
        role="user",
    )
]
response = llm.chat(messages)
# see the textual output
print(response.message.content)
# see the MCP tool call
print(response.raw.output[0])

Code interpreter

You can use the Code Interpreter just by setting, as a built-in tool, "type": "code_interpreter", "container": { "type": "auto" }.

python

from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage

llm = OpenAIResponses(
    model="gpt-4.1",
    built_in_tools=[
        {
            "type": "code_interpreter",
            "container": {"type": "auto"},
        }
    ],
)
messages = messages = [
    ChatMessage.from_str(
        content="I need to solve the equation 3x + 11 = 14. Can you help me?",
        role="user",
    )
]
response = llm.chat(messages)
# see the textual output
print(response.message.content)
# see the MCP tool call
print(response.raw.output[0])

OpenAI Responses API

OpenAI Responses API

Basic Usage

Call complete with a prompt

Call chat with a list of messages

Streaming

Configure Parameters

Basic Parameters

Built-in Tool Calling

Reasoning Effort

Image Support

Using Function/Tool Calling

Manual Tool Calling

Structured Prediction

Async

Additional kwargs

Image generation

MCP Remote calls

Code interpreter

Call `complete` with a prompt

Call `chat` with a list of messages