Back to Llama Index

Ollama + `gpt-oss` Cookbook

docs/examples/cookbooks/ollama_gpt_oss_cookbook.ipynb

0.14.214.0 KB
Original Source

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/ollama.ipynb" target="_parent"></a>

Ollama + gpt-oss Cookbook

OpenAI's latest open-source models, gpt-oss, have been released.

They come in two sizes:

  • 20 billion parameter model
  • 120 billion parameter model

These models are Apache 2.0 licensed, and can be run locally on your machine. In this cookbook, we will use Ollama to demonstrate capabilities and test some claims of agentic and chain-of-thought behavior.

Setup

First, follow the readme to set up and run a local Ollama instance.

When the Ollama app is running on your local machine:

  • All of your local models are automatically served on localhost:11434
  • Select your model when setting llm = Ollama(..., model="<model family>:<version>")
  • Increase defaullt timeout (30 seconds) if needed setting Ollama(..., request_timeout=300.0)
  • If you set llm = Ollama(..., model="<model family") without a version it will simply look for latest
  • By default, the maximum context window for your model is used. You can manually set the context_window to limit memory usage.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python
%pip install llama-index-llms-ollama

Chain-of-thought / Thinking with gpt-oss

Ollama supports configuration for thinking when using gpt-oss models. Let's test this out with a few examples.

python
from llama_index.llms.ollama import Ollama

llm = Ollama(
    model="gpt-oss:20b",
    request_timeout=360,
    thinking=True,
    temperature=1.0,
    # Supports up to 130K tokens, lowering to save memory
    context_window=8000,
)
python
resp_gen = await llm.astream_complete("What is 1234 * 5678?")

still_thinking = True
print("====== THINKING ======")
async for chunk in resp_gen:
    if still_thinking and chunk.additional_kwargs.get("thinking_delta"):
        print(chunk.additional_kwargs["thinking_delta"], end="", flush=True)
    elif still_thinking:
        still_thinking = False
        print("\n====== ANSWER ======")

    if not still_thinking:
        print(chunk.delta, end="", flush=True)

Creating agents with gpt-oss

While giving a response from a prompt is fine, we can also incorporate tools to get more precise results, and build an agent.

python
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.ollama import Ollama


def multiply(a: int, b: int) -> int:
    """Multiply two numbers"""
    return a * b


llm = Ollama(
    model="gpt-oss:20b",
    request_timeout=360,
    thinking=False,
    temperature=1.0,
    # Supports up to 130K tokens, lowering to save memory
    context_window=8000,
)

agent = FunctionAgent(
    tools=[multiply],
    llm=llm,
    system_prompt="You are a helpful assistant that can multiply and add numbers. Always rely on tools for math operations.",
)
python
from llama_index.core.agent.workflow import (
    ToolCall,
    ToolCallResult,
    AgentStream,
)

handler = agent.run("What is 1234 * 5678?")
async for ev in handler.stream_events():
    if isinstance(ev, ToolCall):
        print(f"\nTool call: {ev.tool_name}({ev.tool_kwargs}")
    elif isinstance(ev, ToolCallResult):
        print(
            f"\nTool call: {ev.tool_name}({ev.tool_kwargs}) -> {ev.tool_output}"
        )
    elif isinstance(ev, AgentStream):
        print(ev.delta, end="", flush=True)

resp = await handler

Remembering past events with Agents

By default, agent runs do not remember past events. However, using the Context, we can maintain state between calls.

python
from llama_index.core.workflow import Context

ctx = Context(agent)

resp = await agent.run("What is 1234 * 5678?", ctx=ctx)
resp = await agent.run("What was the last question/answer pair?", ctx=ctx)
python
print(resp.response.content)