Ollama LLM

Setup

First, follow the readme to set up and run a local Ollama instance.

When the Ollama app is running on your local machine:

All of your local models are automatically served on localhost:11434
Select your model when setting llm = Ollama(..., model="<model family>:<version>")
Increase defaullt timeout (30 seconds) if needed setting Ollama(..., request_timeout=300.0)
If you set llm = Ollama(..., model="<model family") without a version it will simply look for latest
By default, the maximum context window for your model is used. You can manually set the context_window to limit memory usage.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python

%pip install llama-index-llms-ollama

python

from llama_index.llms.ollama import Ollama

python

llm = Ollama(
    model="llama3.1:latest",
    request_timeout=120.0,
    # Manually set the context window to limit memory usage
    context_window=8000,
)

python

resp = llm.complete("Who is Paul Graham?")

python

print(resp)

Call `chat` with a list of messages

python

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.chat(messages)

python

print(resp)

Streaming

Using stream_complete endpoint

python

response = llm.stream_complete("Who is Paul Graham?")

python

for r in response:
    print(r.delta, end="")

Using stream_chat endpoint

python

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.stream_chat(messages)

python

for r in resp:
    print(r.delta, end="")

JSON Mode

Ollama also supports a JSON mode, which tries to ensure all responses are valid JSON.

This is particularly useful when trying to run tools that need to parse structured outputs.

python

llm = Ollama(
    model="llama3.1:latest",
    request_timeout=120.0,
    json_mode=True,
    # Manually set the context window to limit memory usage
    context_window=8000,
)

python

response = llm.complete(
    "Who is Paul Graham? Output as a structured JSON object."
)
print(str(response))

Structured Outputs

We can also attach a pyndatic class to the LLM to ensure structured outputs. This will use Ollama's builtin structured output capabilities for a given pydantic class.

python

from llama_index.core.bridge.pydantic import BaseModel


class Song(BaseModel):
    """A song with name and artist."""

    name: str
    artist: str

python

llm = Ollama(
    model="llama3.1:latest",
    request_timeout=120.0,
    # Manually set the context window to limit memory usage
    context_window=8000,
)

sllm = llm.as_structured_llm(Song)

python

from llama_index.core.llms import ChatMessage

response = sllm.chat([ChatMessage(role="user", content="Name a random song!")])
print(response.message.content)

Or with async

python

response = await sllm.achat(
    [ChatMessage(role="user", content="Name a random song!")]
)
print(response.message.content)

You can also stream structured outputs! Streaming a structured output is a little different than streaming a normal string. It will yield a generator of the most up to date structured object.

python

response_gen = sllm.stream_chat(
    [ChatMessage(role="user", content="Name a random song!")]
)
for r in response_gen:
    print(r.message.content)

Ollama supports multi-modal models, and the Ollama LLM class natively supports images out of the box.

This leverages the content blocks feature of the chat messages.

Here, we leverage the llama3.2-vision model to answer a question about an image. If you don't have this model yet, you'll want to run ollama pull llama3.2-vision.

python

!wget "https://pbs.twimg.com/media/GVhGD1PXkAANfPV?format=jpg&name=4096x4096" -O ollama_image.jpg

python

from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock
from llama_index.llms.ollama import Ollama

llm = Ollama(
    model="llama3.2-vision",
    request_timeout=120.0,
    # Manually set the context window to limit memory usage
    context_window=8000,
)

messages = [
    ChatMessage(
        role="user",
        blocks=[
            TextBlock(text="What type of animal is this?"),
            ImageBlock(path="ollama_image.jpg"),
        ],
    ),
]

resp = llm.chat(messages)
print(resp)

Close enough ;)

Thinking

Models in Ollama support "thinking" -- the process of reasoning and reflecting on a response before returning a final answer.

Below we show how to enable thinking in Ollama models in both streaming and non-streaming modes using the thinking parameter and the qwen3:8b model.

python

from llama_index.llms.ollama import Ollama

llm = Ollama(
    model="qwen3:8b",
    request_timeout=360,
    thinking=True,
    # Manually set the context window to limit memory usage
    context_window=8000,
)

python

resp = llm.complete("What is 434 / 22?")

python

print(resp.additional_kwargs["thinking"])

python

print(resp.text)

Thats a lot of thinking!

Now, let's try a streaming example to make the wait less painful:

python

resp_gen = llm.stream_complete("What is 434 / 22?")

thinking_started = False
response_started = False

for resp in resp_gen:
    if resp.additional_kwargs.get("thinking_delta", None):
        if not thinking_started:
            print("\n\n-------- Thinking: --------\n")
            thinking_started = True
            response_started = False
        print(resp.additional_kwargs["thinking_delta"], end="", flush=True)
    if resp.delta:
        if not response_started:
            print("\n\n-------- Response: --------\n")
            response_started = True
            thinking_started = False
        print(resp.delta, end="", flush=True)

Ollama LLM

Ollama LLM

Setup

Call chat with a list of messages

Streaming

JSON Mode

Structured Outputs

Multi-Modal Support

Thinking

Call `chat` with a list of messages