Back to Llama Index

Ollama LLM

docs/examples/llm/ollama.ipynb

0.14.216.3 KB
Original Source

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/ollama.ipynb" target="_parent"></a>

Ollama LLM

Setup

First, follow the readme to set up and run a local Ollama instance.

When the Ollama app is running on your local machine:

  • All of your local models are automatically served on localhost:11434
  • Select your model when setting llm = Ollama(..., model="<model family>:<version>")
  • Increase defaullt timeout (30 seconds) if needed setting Ollama(..., request_timeout=300.0)
  • If you set llm = Ollama(..., model="<model family") without a version it will simply look for latest
  • By default, the maximum context window for your model is used. You can manually set the context_window to limit memory usage.

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

python
%pip install llama-index-llms-ollama
python
from llama_index.llms.ollama import Ollama
python
llm = Ollama(
    model="llama3.1:latest",
    request_timeout=120.0,
    # Manually set the context window to limit memory usage
    context_window=8000,
)
python
resp = llm.complete("Who is Paul Graham?")
python
print(resp)

Call chat with a list of messages

python
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.chat(messages)
python
print(resp)

Streaming

Using stream_complete endpoint

python
response = llm.stream_complete("Who is Paul Graham?")
python
for r in response:
    print(r.delta, end="")

Using stream_chat endpoint

python
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(
        role="system", content="You are a pirate with a colorful personality"
    ),
    ChatMessage(role="user", content="What is your name"),
]
resp = llm.stream_chat(messages)
python
for r in resp:
    print(r.delta, end="")

JSON Mode

Ollama also supports a JSON mode, which tries to ensure all responses are valid JSON.

This is particularly useful when trying to run tools that need to parse structured outputs.

python
llm = Ollama(
    model="llama3.1:latest",
    request_timeout=120.0,
    json_mode=True,
    # Manually set the context window to limit memory usage
    context_window=8000,
)
python
response = llm.complete(
    "Who is Paul Graham? Output as a structured JSON object."
)
print(str(response))

Structured Outputs

We can also attach a pyndatic class to the LLM to ensure structured outputs. This will use Ollama's builtin structured output capabilities for a given pydantic class.

python
from llama_index.core.bridge.pydantic import BaseModel


class Song(BaseModel):
    """A song with name and artist."""

    name: str
    artist: str
python
llm = Ollama(
    model="llama3.1:latest",
    request_timeout=120.0,
    # Manually set the context window to limit memory usage
    context_window=8000,
)

sllm = llm.as_structured_llm(Song)
python
from llama_index.core.llms import ChatMessage

response = sllm.chat([ChatMessage(role="user", content="Name a random song!")])
print(response.message.content)

Or with async

python
response = await sllm.achat(
    [ChatMessage(role="user", content="Name a random song!")]
)
print(response.message.content)

You can also stream structured outputs! Streaming a structured output is a little different than streaming a normal string. It will yield a generator of the most up to date structured object.

python
response_gen = sllm.stream_chat(
    [ChatMessage(role="user", content="Name a random song!")]
)
for r in response_gen:
    print(r.message.content)

Multi-Modal Support

Ollama supports multi-modal models, and the Ollama LLM class natively supports images out of the box.

This leverages the content blocks feature of the chat messages.

Here, we leverage the llama3.2-vision model to answer a question about an image. If you don't have this model yet, you'll want to run ollama pull llama3.2-vision.

python
!wget "https://pbs.twimg.com/media/GVhGD1PXkAANfPV?format=jpg&name=4096x4096" -O ollama_image.jpg
python
from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock
from llama_index.llms.ollama import Ollama

llm = Ollama(
    model="llama3.2-vision",
    request_timeout=120.0,
    # Manually set the context window to limit memory usage
    context_window=8000,
)

messages = [
    ChatMessage(
        role="user",
        blocks=[
            TextBlock(text="What type of animal is this?"),
            ImageBlock(path="ollama_image.jpg"),
        ],
    ),
]

resp = llm.chat(messages)
print(resp)

Close enough ;)

Thinking

Models in Ollama support "thinking" -- the process of reasoning and reflecting on a response before returning a final answer.

Below we show how to enable thinking in Ollama models in both streaming and non-streaming modes using the thinking parameter and the qwen3:8b model.

python
from llama_index.llms.ollama import Ollama

llm = Ollama(
    model="qwen3:8b",
    request_timeout=360,
    thinking=True,
    # Manually set the context window to limit memory usage
    context_window=8000,
)
python
resp = llm.complete("What is 434 / 22?")
python
print(resp.additional_kwargs["thinking"])
python
print(resp.text)

Thats a lot of thinking!

Now, let's try a streaming example to make the wait less painful:

python
resp_gen = llm.stream_complete("What is 434 / 22?")

thinking_started = False
response_started = False

for resp in resp_gen:
    if resp.additional_kwargs.get("thinking_delta", None):
        if not thinking_started:
            print("\n\n-------- Thinking: --------\n")
            thinking_started = True
            response_started = False
        print(resp.additional_kwargs["thinking_delta"], end="", flush=True)
    if resp.delta:
        if not response_started:
            print("\n\n-------- Response: --------\n")
            response_started = True
            thinking_started = False
        print(resp.delta, end="", flush=True)