docs/examples/llm/ollama.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/ollama.ipynb" target="_parent"></a>
First, follow the readme to set up and run a local Ollama instance.
When the Ollama app is running on your local machine:
context_window to limit memory usage.If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.
%pip install llama-index-llms-ollama
from llama_index.llms.ollama import Ollama
llm = Ollama(
model="llama3.1:latest",
request_timeout=120.0,
# Manually set the context window to limit memory usage
context_window=8000,
)
resp = llm.complete("Who is Paul Graham?")
print(resp)
chat with a list of messagesfrom llama_index.core.llms import ChatMessage
messages = [
ChatMessage(
role="system", content="You are a pirate with a colorful personality"
),
ChatMessage(role="user", content="What is your name"),
]
resp = llm.chat(messages)
print(resp)
Using stream_complete endpoint
response = llm.stream_complete("Who is Paul Graham?")
for r in response:
print(r.delta, end="")
Using stream_chat endpoint
from llama_index.core.llms import ChatMessage
messages = [
ChatMessage(
role="system", content="You are a pirate with a colorful personality"
),
ChatMessage(role="user", content="What is your name"),
]
resp = llm.stream_chat(messages)
for r in resp:
print(r.delta, end="")
Ollama also supports a JSON mode, which tries to ensure all responses are valid JSON.
This is particularly useful when trying to run tools that need to parse structured outputs.
llm = Ollama(
model="llama3.1:latest",
request_timeout=120.0,
json_mode=True,
# Manually set the context window to limit memory usage
context_window=8000,
)
response = llm.complete(
"Who is Paul Graham? Output as a structured JSON object."
)
print(str(response))
We can also attach a pyndatic class to the LLM to ensure structured outputs. This will use Ollama's builtin structured output capabilities for a given pydantic class.
from llama_index.core.bridge.pydantic import BaseModel
class Song(BaseModel):
"""A song with name and artist."""
name: str
artist: str
llm = Ollama(
model="llama3.1:latest",
request_timeout=120.0,
# Manually set the context window to limit memory usage
context_window=8000,
)
sllm = llm.as_structured_llm(Song)
from llama_index.core.llms import ChatMessage
response = sllm.chat([ChatMessage(role="user", content="Name a random song!")])
print(response.message.content)
Or with async
response = await sllm.achat(
[ChatMessage(role="user", content="Name a random song!")]
)
print(response.message.content)
You can also stream structured outputs! Streaming a structured output is a little different than streaming a normal string. It will yield a generator of the most up to date structured object.
response_gen = sllm.stream_chat(
[ChatMessage(role="user", content="Name a random song!")]
)
for r in response_gen:
print(r.message.content)
Ollama supports multi-modal models, and the Ollama LLM class natively supports images out of the box.
This leverages the content blocks feature of the chat messages.
Here, we leverage the llama3.2-vision model to answer a question about an image. If you don't have this model yet, you'll want to run ollama pull llama3.2-vision.
!wget "https://pbs.twimg.com/media/GVhGD1PXkAANfPV?format=jpg&name=4096x4096" -O ollama_image.jpg
from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock
from llama_index.llms.ollama import Ollama
llm = Ollama(
model="llama3.2-vision",
request_timeout=120.0,
# Manually set the context window to limit memory usage
context_window=8000,
)
messages = [
ChatMessage(
role="user",
blocks=[
TextBlock(text="What type of animal is this?"),
ImageBlock(path="ollama_image.jpg"),
],
),
]
resp = llm.chat(messages)
print(resp)
Close enough ;)
Models in Ollama support "thinking" -- the process of reasoning and reflecting on a response before returning a final answer.
Below we show how to enable thinking in Ollama models in both streaming and non-streaming modes using the thinking parameter and the qwen3:8b model.
from llama_index.llms.ollama import Ollama
llm = Ollama(
model="qwen3:8b",
request_timeout=360,
thinking=True,
# Manually set the context window to limit memory usage
context_window=8000,
)
resp = llm.complete("What is 434 / 22?")
print(resp.additional_kwargs["thinking"])
print(resp.text)
Thats a lot of thinking!
Now, let's try a streaming example to make the wait less painful:
resp_gen = llm.stream_complete("What is 434 / 22?")
thinking_started = False
response_started = False
for resp in resp_gen:
if resp.additional_kwargs.get("thinking_delta", None):
if not thinking_started:
print("\n\n-------- Thinking: --------\n")
thinking_started = True
response_started = False
print(resp.additional_kwargs["thinking_delta"], end="", flush=True)
if resp.delta:
if not response_started:
print("\n\n-------- Response: --------\n")
response_started = True
thinking_started = False
print(resp.delta, end="", flush=True)