docs/examples/cookbooks/ollama_gpt_oss_cookbook.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/ollama.ipynb" target="_parent"></a>
gpt-oss CookbookOpenAI's latest open-source models, gpt-oss, have been released.
They come in two sizes:
These models are Apache 2.0 licensed, and can be run locally on your machine. In this cookbook, we will use Ollama to demonstrate capabilities and test some claims of agentic and chain-of-thought behavior.
First, follow the readme to set up and run a local Ollama instance.
When the Ollama app is running on your local machine:
context_window to limit memory usage.If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
%pip install llama-index-llms-ollama
gpt-ossOllama supports configuration for thinking when using gpt-oss models. Let's test this out with a few examples.
from llama_index.llms.ollama import Ollama
llm = Ollama(
model="gpt-oss:20b",
request_timeout=360,
thinking=True,
temperature=1.0,
# Supports up to 130K tokens, lowering to save memory
context_window=8000,
)
resp_gen = await llm.astream_complete("What is 1234 * 5678?")
still_thinking = True
print("====== THINKING ======")
async for chunk in resp_gen:
if still_thinking and chunk.additional_kwargs.get("thinking_delta"):
print(chunk.additional_kwargs["thinking_delta"], end="", flush=True)
elif still_thinking:
still_thinking = False
print("\n====== ANSWER ======")
if not still_thinking:
print(chunk.delta, end="", flush=True)
gpt-ossWhile giving a response from a prompt is fine, we can also incorporate tools to get more precise results, and build an agent.
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.ollama import Ollama
def multiply(a: int, b: int) -> int:
"""Multiply two numbers"""
return a * b
llm = Ollama(
model="gpt-oss:20b",
request_timeout=360,
thinking=False,
temperature=1.0,
# Supports up to 130K tokens, lowering to save memory
context_window=8000,
)
agent = FunctionAgent(
tools=[multiply],
llm=llm,
system_prompt="You are a helpful assistant that can multiply and add numbers. Always rely on tools for math operations.",
)
from llama_index.core.agent.workflow import (
ToolCall,
ToolCallResult,
AgentStream,
)
handler = agent.run("What is 1234 * 5678?")
async for ev in handler.stream_events():
if isinstance(ev, ToolCall):
print(f"\nTool call: {ev.tool_name}({ev.tool_kwargs}")
elif isinstance(ev, ToolCallResult):
print(
f"\nTool call: {ev.tool_name}({ev.tool_kwargs}) -> {ev.tool_output}"
)
elif isinstance(ev, AgentStream):
print(ev.delta, end="", flush=True)
resp = await handler
By default, agent runs do not remember past events. However, using the Context, we can maintain state between calls.
from llama_index.core.workflow import Context
ctx = Context(agent)
resp = await agent.run("What is 1234 * 5678?", ctx=ctx)
resp = await agent.run("What was the last question/answer pair?", ctx=ctx)
print(resp.response.content)