Back to Llama Index

OpenAI Responses API

docs/examples/llm/openai_responses.ipynb

0.14.2113.0 KB
Original Source

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/openai_responses.ipynb" target="_parent"></a>

OpenAI Responses API

This notebook shows how to use the OpenAI Responses LLM.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python
%pip install llama-index llama-index-llms-openai

Basic Usage

python
import os

os.environ["OPENAI_API_KEY"] = "..."
python
from llama_index.llms.openai import OpenAIResponses

llm = OpenAIResponses(
    model="gpt-4o-mini",
    # api_key="some key",  # uses OPENAI_API_KEY env var by default
)

Call complete with a prompt

python
from llama_index.llms.openai import OpenAI

resp = llm.complete("Paul Graham is ")
python
print(resp)

Call chat with a list of messages

python
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.chat(messages)
python
print(resp)

Streaming

Using stream_complete endpoint

python
resp = llm.stream_complete("Paul Graham is ")
python
for r in resp:
    print(r.delta, end="")

Using stream_chat endpoint

python
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.stream_chat(messages)
python
for r in resp:
    print(r.delta, end="")

Configure Parameters

The Respones API supports many options:

  • Setting the model name
  • Generation parameters like temperature, top_p, max_output_tokens
  • enabling built-in tool calling
  • setting the resoning effort for O-series models
  • tracking previous responses for automatic conversation history
  • and more!

Basic Parameters

python
from llama_index.llms.openai import OpenAIResponses

llm = OpenAIResponses(
    model="gpt-4o-mini",
    temperature=0.5,  # default is 0.1
    max_output_tokens=100,  # default is None
    top_p=0.95,  # default is 1.0
)

Built-in Tool Calling

The responses API supports built-in tool calling, which you can read more about here.

Configuring this means that the LLM will automatically call the tool and use it to augment the response.

Tools are defined as a list of dictionaries, each containing settings for a tool.

Below is an example of using the built-in web search tool.

python
from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage

llm = OpenAIResponses(
    model="gpt-4o-mini",
    built_in_tools=[{"type": "web_search_preview"}],
)

resp = llm.chat(
    [ChatMessage(role="user", content="What is the weather in San Francisco?")]
)
print(resp)
print("========" * 2)
print(resp.additional_kwargs)

Reasoning Effort

For O-series models, you can set the reasoning effort to control the amount of time the model will spend reasoning.

See the OpenAI API docs for more information.

python
from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage

llm = OpenAIResponses(
    model="o3-mini",
    reasoning_options={"effort": "high"},
)

resp = llm.chat(
    [ChatMessage(role="user", content="What is the meaning of life?")]
)
print(resp)
print("========" * 2)
print(resp.additional_kwargs)

Image Support

OpenAI has support for images in the input of chat messages for many models.

Using the content blocks feature of chat messages, you can easily combone text and images in a single LLM prompt.

python
!wget https://cdn.pixabay.com/photo/2016/07/07/16/46/dice-1502706_640.jpg -O image.png
python
from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock
from llama_index.llms.openai import OpenAIResponses

llm = OpenAIResponses(model="gpt-4o")

messages = [
    ChatMessage(
        role="user",
        blocks=[
            ImageBlock(path="image.png"),
            TextBlock(text="Describe the image in a few sentences."),
        ],
    )
]

resp = llm.chat(messages)
print(resp.message.content)

Using Function/Tool Calling

OpenAI models have native support for function calling. This conveniently integrates with LlamaIndex tool abstractions, letting you plug in any arbitrary Python function to the LLM.

In the example below, we define a function to generate a Song object.

python
from pydantic import BaseModel
from llama_index.core.tools import FunctionTool


class Song(BaseModel):
    """A song with name and artist"""

    name: str
    artist: str


def generate_song(name: str, artist: str) -> Song:
    """Generates a song with provided name and artist."""
    return Song(name=name, artist=artist)


tool = FunctionTool.from_defaults(fn=generate_song)

The strict parameter tells OpenAI whether or not to use constrained sampling when generating tool calls/structured outputs. This means that the generated tool call schema will always contain the expected fields.

Since this seems to increase latency, it defaults to false.

python
from llama_index.llms.openai import OpenAIResponses

llm = OpenAIResponses(model="gpt-4o-mini", strict=True)
response = llm.predict_and_call(
    [tool],
    "Write a random song for me",
    # strict=True  # can also be set at the function level to override the class
)
print(str(response))

We can also do multiple function calling.

python
llm = OpenAIResponses(model="gpt-4o-mini")
response = llm.predict_and_call(
    [tool],
    "Generate five songs from the Beatles",
    allow_parallel_tool_calls=True,
)
for s in response.sources:
    print(f"Name: {s.tool_name}, Input: {s.raw_input}, Output: {str(s)}")

Manual Tool Calling

If you want to control how a tool is called, you can also split the tool calling and tool selection into their own steps.

First, lets select a tool.

python
from llama_index.core.llms import ChatMessage
from llama_index.llms.openai import OpenAIResponses

llm = OpenAIResponses(model="gpt-4o-mini")

chat_history = [ChatMessage(role="user", content="Write a random song for me")]

resp = llm.chat_with_tools([tool], chat_history=chat_history)

Now, lets call the tool the LLM selected (if any).

If there was a tool call, we should send the results to the LLM to generate the final response (or another tool call!).

python
tools_by_name = {t.metadata.name: t for t in [tool]}
tool_calls = llm.get_tool_calls_from_response(
    resp, error_on_no_tool_call=False
)

while tool_calls:
    # add the LLM's response to the chat history
    chat_history.append(resp.message)

    for tool_call in tool_calls:
        tool_name = tool_call.tool_name
        tool_kwargs = tool_call.tool_kwargs

        print(f"Calling {tool_name} with {tool_kwargs}")
        tool_output = tool(**tool_kwargs)
        chat_history.append(
            ChatMessage(
                role="tool",
                content=str(tool_output),
                # most LLMs like OpenAI need to know the tool call id
                additional_kwargs={"call_id": tool_call.tool_id},
            )
        )

        resp = llm.chat_with_tools([tool], chat_history=chat_history)
        tool_calls = llm.get_tool_calls_from_response(
            resp, error_on_no_tool_call=False
        )

Now, we should have a final response!

python
print(resp.message.content)

Structured Prediction

An important use case for function calling is extracting structured objects. LlamaIndex provides an intuitive interface for converting any LLM into a structured LLM - simply define the target Pydantic class (can be nested), and given a prompt, we extract out the desired object.

python
from llama_index.llms.openai import OpenAIResponses
from llama_index.core.prompts import PromptTemplate
from pydantic import BaseModel
from typing import List


class MenuItem(BaseModel):
    """A menu item in a restaurant."""

    course_name: str
    is_vegetarian: bool


class Restaurant(BaseModel):
    """A restaurant with name, city, and cuisine."""

    name: str
    city: str
    cuisine: str
    menu_items: List[MenuItem]


llm = OpenAIResponses(model="gpt-4o-mini")
prompt_tmpl = PromptTemplate(
    "Generate a restaurant in a given city {city_name}"
)
# Option 1: Use `as_structured_llm`
restaurant_obj = (
    llm.as_structured_llm(Restaurant)
    .complete(prompt_tmpl.format(city_name="Dallas"))
    .raw
)
# Option 2: Use `structured_predict`
# restaurant_obj = llm.structured_predict(Restaurant, prompt_tmpl, city_name="Miami")
python
restaurant_obj

Async

python
from llama_index.llms.openai import OpenAIResponses

llm = OpenAIResponses(model="gpt-4o")
python
resp = await llm.acomplete("Paul Graham is ")
python
print(resp)
python
resp = await llm.astream_complete("Paul Graham is ")
python
async for delta in resp:
    print(delta.delta, end="")

Async function calling is also supported.

python
llm = OpenAIResponses(model="gpt-4o-mini")
response = await llm.apredict_and_call([tool], "Generate a random song")
print(str(response))

Additional kwargs

If there are additional kwargs not present in the constructor, you can set them at a per-instance level with additional_kwargs.

These will be passed into every call to the LLM.

python
from llama_index.llms.openai import OpenAIResponses

llm = OpenAIResponses(
    model="gpt-4o-mini", additional_kwargs={"user": "your_user_id"}
)
resp = llm.complete("Paul Graham is ")
print(resp)

Image generation

You can use image generation by passing, as a built-in-tool, {'type': 'image_generation'} or, if you want to enable streaming, {'type': 'image_generation', 'partial_images': 2}:

python
import base64
from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage, ImageBlock, TextBlock

# run without streaming
llm = OpenAIResponses(
    model="gpt-4.1-mini", built_in_tools=[{"type": "image_generation"}]
)
messages = [
    ChatMessage.from_str(
        content="A llama dancing with a cat in a meadow", role="user"
    )
]
response = llm.chat(
    messages
)  # response = await llm.achat(messages) for an async implementation
for block in response.message.blocks:
    if isinstance(block, ImageBlock):
        with open("llama_and_cat_dancing.png", "wb") as f:
            f.write(bas64.b64decode(block.image))
    elif isinstance(block, TextBlock):
        print(block.text)

# run with streaming
llm_stream = OpenAIResponses(
    model="gpt-4.1-mini",
    built_in_tools=[{"type": "image_generation", "partial_images": 2}],
)
response = llm_stream.stream_chat(
    messages
)  # response = await llm_stream.asteam_chat(messages) for an async implementation
for event in response:
    for block in event.message.blocks:
        if isinstance(block, ImageBlock):
            # block.detail contains the ID of the image
            with open(f"llama_and_cat_dancing_{block.detail}.png", "wb") as f:
                f.write(bas64.b64decode(block.image))
        elif isinstance(block, TextBlock):
            print(block.text)

MCP Remote calls

You can call any remote MCP through the OpenAI Responses API just by passing the MCP specifics as a built-in tool to the LLM

python
from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage

llm = OpenAIResponses(
    model="gpt-4.1",
    built_in_tools=[
        {
            "type": "mcp",
            "server_label": "deepwiki",
            "server_url": "https://mcp.deepwiki.com/mcp",
            "require_approval": "never",
        }
    ],
)
messages = [
    ChatMessage.from_str(
        content="What transport protocols are supported in the 2025-03-26 version of the MCP spec?",
        role="user",
    )
]
response = llm.chat(messages)
# see the textual output
print(response.message.content)
# see the MCP tool call
print(response.raw.output[0])

Code interpreter

You can use the Code Interpreter just by setting, as a built-in tool, "type": "code_interpreter", "container": { "type": "auto" }.

python
from llama_index.llms.openai import OpenAIResponses
from llama_index.core.llms import ChatMessage

llm = OpenAIResponses(
    model="gpt-4.1",
    built_in_tools=[
        {
            "type": "code_interpreter",
            "container": {"type": "auto"},
        }
    ],
)
messages = messages = [
    ChatMessage.from_str(
        content="I need to solve the equation 3x + 11 = 14. Can you help me?",
        role="user",
    )
]
response = llm.chat(messages)
# see the textual output
print(response.message.content)
# see the MCP tool call
print(response.raw.output[0])