Fully Local RAG Pipeline with Chroma + Ollama

No API key required. This notebook runs entirely on your local machine using Ollama for both the LLM and embeddings, and ChromaDB as the vector store.

What you will learn

Step	Concept
1	Configure the pipeline — all tunables in one place
2	Ingest and chunk a sample document with `SentenceSplitter`
3	Embed chunks with `OllamaEmbedding` and persist in ChromaDB
4	Query the index with a local LLM (`llama3.2:3b` via Ollama)
5	Evaluate retrieval quality against a gold Q&A set (hit-rate & MRR)
6	Explore failure modes: empty context, long queries, hallucination guard

Prerequisites

Ollama installed and running — download here

Models pulled:

bash

ollama pull llama3.2:3b
ollama pull nomic-embed-text

Cell 1 — Install dependencies

python

%pip install -q llama-index-core llama-index-llms-ollama llama-index-embeddings-ollama llama-index-vector-stores-chroma llama-index-readers-file chromadb

Cell 2 — Imports and logging

python

import json
import logging
import shutil
from pathlib import Path

import chromadb
from IPython.display import Markdown, display

from llama_index.core import Document, StorageContext, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.vector_stores.chroma import ChromaVectorStore

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
    datefmt="%H:%M:%S",
)
logger = logging.getLogger("local_rag")
logger.info("Imports loaded successfully.")

Cell 3 — Configuration

All tunables are defined here. Edit this cell to change model names, chunk size, top-k, etc.

python

cfg = {
    "llm": {
        "model": "llama3.2:3b",
        "base_url": "http://localhost:11434",
        "temperature": 0.0,
        "request_timeout": 120.0,
    },
    "embedding": {
        "model": "nomic-embed-text",
        "base_url": "http://localhost:11434",
    },
    "splitter": {
        "chunk_size": 512,
        "chunk_overlap": 50,
    },
    "chroma": {
        "persist_dir": "./chroma_db",
        "collection_name": "ai_safety_rag",
    },
    "retrieval": {
        "similarity_top_k": 3,
    },
}

print(json.dumps(cfg, indent=2))

Cell 4 — Initialise LLM and embedding model

python

llm = Ollama(
    model=cfg["llm"]["model"],
    base_url=cfg["llm"]["base_url"],
    temperature=cfg["llm"]["temperature"],
    request_timeout=cfg["llm"]["request_timeout"],
)

embed_model = OllamaEmbedding(
    model_name=cfg["embedding"]["model"],
    base_url=cfg["embedding"]["base_url"],
)

logger.info(
    "LLM: %s | Embedding: %s", cfg["llm"]["model"], cfg["embedding"]["model"]
)

Cell 5 — Load and chunk the document

We use SentenceSplitter with the chunk size and overlap from config. The corpus is defined inline — replace CORPUS_TEXT with your own content or load from a file.

python

CORPUS_TEXT = """# AI Safety Primer

## What is AI Safety?

AI safety is a field of research focused on ensuring that artificial intelligence systems
behave in ways that are safe, beneficial, and aligned with human values. As AI systems
become more capable, the importance of safety research grows correspondingly.

## Key Concepts

### Alignment
Alignment refers to the challenge of ensuring that an AI system's goals and behaviors
match the intentions of its designers and the broader interests of humanity. A misaligned
AI might pursue its programmed objective in ways that are harmful or unintended.

### RLHF (Reinforcement Learning from Human Feedback)
RLHF is a training technique where human evaluators rank model outputs, and those
rankings are used to train a reward model. The AI is then fine-tuned using reinforcement
learning to maximize this reward signal, steering it toward outputs humans prefer.

### Reward Hacking
Reward hacking occurs when an AI system finds unintended ways to maximize its reward
signal without achieving the true underlying goal. For example, a robot trained to run
fast might learn to make itself very tall and then fall forward repeatedly.

### Constitutional AI
Constitutional AI (CAI) is a technique developed by Anthropic to make AI systems more
helpful, harmless, and honest. It uses a set of explicit principles (a "constitution")
to guide the model's behavior. The model critiques and revises its own outputs against
these principles, reducing reliance on human labelers for harmful content.

### Red Teaming
Red teaming in AI involves deliberately trying to find failure modes, vulnerabilities,
or harmful outputs in AI systems. Red teamers act as adversaries, probing the system
with edge cases, jailbreaks, and adversarial prompts to expose weaknesses before
deployment.

### Deceptive Alignment
Deceptive alignment is a hypothetical failure mode where an AI system behaves safely
during training and evaluation but pursues different goals once deployed. The system
"knows" it is being evaluated and acts accordingly to pass safety checks.

### Interpretability
Interpretability (or explainability) research aims to understand what is happening
inside AI models — which features they use, how they represent concepts, and why they
produce specific outputs. Tools like mechanistic interpretability try to reverse-engineer
neural network computations.

## Why AI Safety Matters Now

The rapid pace of AI development means that safety considerations must be integrated
early into the design and training process. Several organizations are actively working
on AI safety research:

- **Anthropic** — founded by former OpenAI researchers, focuses on Constitutional AI
  and interpretability
- **OpenAI** — safety team works on alignment, red teaming, and policy
- **DeepMind** — conducts research on agent safety and specification gaming
- **MIRI (Machine Intelligence Research Institute)** — focuses on long-term
  theoretical alignment problems
- **Center for AI Safety (CAIS)** — coordinates safety research across academia
  and industry
"""

documents = [
    Document(text=CORPUS_TEXT, metadata={"source": "ai_safety_primer"})
]
logger.info("Loaded %d document(s) from inline corpus", len(documents))

splitter = SentenceSplitter(
    chunk_size=cfg["splitter"]["chunk_size"],
    chunk_overlap=cfg["splitter"]["chunk_overlap"],
)
nodes = splitter.get_nodes_from_documents(documents)
logger.info(
    "Split into %d nodes (chunk_size=%d, overlap=%d)",
    len(nodes),
    cfg["splitter"]["chunk_size"],
    cfg["splitter"]["chunk_overlap"],
)

print(f"\nFirst chunk preview ({len(nodes[0].text)} chars):")
print("-" * 60)
print(nodes[0].text[:400], "...")

Cell 6 — Build or load the Chroma index (with caching)

If chroma_db/ already exists on disk we load from it — no re-embedding. Delete the chroma_db/ folder to force a full re-index.

python

PERSIST_DIR = Path(cfg["chroma"]["persist_dir"])
COLLECTION = cfg["chroma"]["collection_name"]

chroma_client = chromadb.PersistentClient(path=str(PERSIST_DIR))
existing = [c.name for c in chroma_client.list_collections()]

if COLLECTION in existing:
    logger.info(
        "Cache hit — loading existing collection '%s' from %s",
        COLLECTION,
        PERSIST_DIR,
    )
    chroma_collection = chroma_client.get_collection(COLLECTION)
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    index = VectorStoreIndex.from_vector_store(
        vector_store,
        embed_model=embed_model,
    )
else:
    logger.info(
        "Cache miss — embedding %d nodes into new collection '%s'",
        len(nodes),
        COLLECTION,
    )
    chroma_collection = chroma_client.create_collection(COLLECTION)
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex(
        nodes,
        storage_context=storage_context,
        embed_model=embed_model,
    )
    logger.info("Index built and persisted to %s", PERSIST_DIR)

print(f"Collection '{COLLECTION}' has {chroma_collection.count()} vectors.")

Cell 7 — RAG query

The query engine retrieves the top-k most relevant chunks and passes them as context to the local LLM to generate a grounded answer.

python

query_engine = index.as_query_engine(
    llm=llm,
    similarity_top_k=cfg["retrieval"]["similarity_top_k"],
)

QUERY = "What is Constitutional AI and who developed it?"
logger.info("Running query: %s", QUERY)

response = query_engine.query(QUERY)

display(Markdown(f"**Query:** {QUERY}\n\n**Answer:** {response}"))

print("\n--- Retrieved source nodes ---")
for i, node in enumerate(response.source_nodes, 1):
    score = getattr(node, "score", "n/a")
    print(f"[{i}] score={score:.4f}  |  {node.text[:120].strip()}...")

Cell 8 — Retrieval evaluation: Hit-Rate and MRR

We loop over the gold Q&A set and for each question:

Retrieve the top-k nodes
Check if any retrieved chunk contains all expected keywords (hit)
Record the rank of the first hit (for MRR)

This is CI-friendly — no extra LLM calls, runs in seconds.

python

gold_qa = [
    {
        "id": "q1",
        "question": "What is alignment in the context of AI safety?",
        "expected_keywords": ["alignment", "goals"],
    },
    {
        "id": "q2",
        "question": "What is RLHF and how does it work?",
        "expected_keywords": ["rlhf", "reward"],
    },
    {
        "id": "q3",
        "question": "What is reward hacking?",
        "expected_keywords": ["reward hacking", "unintended"],
    },
    {
        "id": "q4",
        "question": "What is Constitutional AI and who developed it?",
        "expected_keywords": ["constitutional ai", "anthropic"],
    },
    {
        "id": "q5",
        "question": "What is red teaming in AI?",
        "expected_keywords": ["red teaming", "failure"],
    },
    {
        "id": "q6",
        "question": "What is deceptive alignment?",
        "expected_keywords": ["deceptive alignment", "training"],
    },
    {
        "id": "q7",
        "question": "What is interpretability in AI systems?",
        "expected_keywords": ["interpretability", "neural"],
    },
    {
        "id": "q8",
        "question": "Which organizations are working on AI safety?",
        "expected_keywords": ["anthropic", "openai", "deepmind"],
    },
]

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=cfg["retrieval"]["similarity_top_k"],
    embed_model=embed_model,
)

hits = 0
reciprocal_ranks = []
results = []

for item in gold_qa:
    retrieved_nodes = retriever.retrieve(item["question"])
    keywords = [kw.lower() for kw in item["expected_keywords"]]

    first_hit_rank = None
    for rank, node in enumerate(retrieved_nodes, 1):
        text_lower = node.text.lower()
        if all(kw in text_lower for kw in keywords):
            first_hit_rank = rank
            break

    hit = first_hit_rank is not None
    hits += int(hit)
    reciprocal_ranks.append(1 / first_hit_rank if hit else 0.0)

    results.append(
        {
            "id": item["id"],
            "hit": hit,
            "rank": first_hit_rank,
            "question": item["question"][:60],
        }
    )
    logger.info(
        "[%s] hit=%s rank=%s | %s",
        item["id"],
        hit,
        first_hit_rank,
        item["question"][:50],
    )

hit_rate = hits / len(gold_qa)
mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)

print("\n" + "=" * 50)
print(
    f"  Retrieval Evaluation Results (top-k={cfg['retrieval']['similarity_top_k']})"
)
print("=" * 50)
print(f"  Hit-Rate : {hit_rate:.2%}  ({hits}/{len(gold_qa)} questions)")
print(f"  MRR      : {mrr:.4f}")
print("=" * 50)
print("\nPer-question breakdown:")
for r in results:
    status = "✅" if r["hit"] else "❌"
    print(f"  {status} [{r['id']}] rank={r['rank']}  {r['question']}")

Cell 9 — Failure mode demos

Understanding where a RAG pipeline breaks is as important as knowing where it works. We demonstrate three common failure modes.

Failure Mode 1: Empty / nonsense query

When the query has no semantic content, retrieval returns low-relevance chunks and the LLM is forced to hallucinate or admit it doesn't know.

python

empty_query = "asdfjkl qwerty zzz"
logger.info("[Failure Mode 1] Empty/nonsense query: '%s'", empty_query)

response_empty = query_engine.query(empty_query)

print("Query  :", empty_query)
print("Answer :", str(response_empty))
print(
    "\nTop retrieved node score:",
    f"{response_empty.source_nodes[0].score:.4f}"
    if response_empty.source_nodes
    else "none",
)
print(
    "\n⚠️  Note: Low retrieval score indicates the context is not relevant to the query."
)

Failure Mode 2: Query about a topic outside the document

The document covers AI safety. A query about an unrelated topic will retrieve the least-bad chunks, but the answer will be unreliable.

python

ood_query = "What is the recipe for making sourdough bread?"
logger.info("[Failure Mode 2] Out-of-domain query: '%s'", ood_query)

response_ood = query_engine.query(ood_query)

print("Query  :", ood_query)
print("Answer :", str(response_ood))
print(
    "\n⚠️  Guardrail tip: Add a relevance score threshold. If max(node.score) < 0.4,"
)
print(
    "   return 'I don't have information about this topic' instead of hallucinating."
)

Failure Mode 3: Hallucination guardrail (score threshold)

A simple but effective guardrail: if the best retrieval score is below a threshold, refuse to answer rather than hallucinate.

python

SCORE_THRESHOLD = 0.40


def safe_query(
    query_engine, retriever, question: str, threshold: float = SCORE_THRESHOLD
) -> str:
    """Run RAG query with a relevance score guardrail.

    Returns the LLM answer if the best retrieved chunk exceeds `threshold`,
    otherwise returns a fallback message to prevent hallucination.
    """
    nodes = retriever.retrieve(question)
    if not nodes:
        return "[GUARDRAIL] No documents retrieved."

    best_score = max(n.score for n in nodes if n.score is not None)
    logger.info(
        "[safe_query] best_score=%.4f threshold=%.2f", best_score, threshold
    )

    if best_score < threshold:
        return (
            f"[GUARDRAIL] Best retrieval score ({best_score:.4f}) is below "
            f"threshold ({threshold}). Cannot answer reliably."
        )
    return str(query_engine.query(question))


# In-domain question — should pass the guardrail
q_in = "What is reward hacking?"
# Out-of-domain question — should be blocked
q_out = "What is the capital of France?"

print("=" * 55)
print(f"Q (in-domain) : {q_in}")
print(f"A             : {safe_query(query_engine, retriever, q_in)}")
print()
print(f"Q (out-domain): {q_out}")
print(f"A             : {safe_query(query_engine, retriever, q_out)}")
print("=" * 55)

Cell 10 — Cleanup (optional)

Run this cell to delete the persisted Chroma database and start fresh. Useful for testing the full pipeline from scratch.

python

# Uncomment to reset the vector store
# PERSIST_DIR = Path(cfg["chroma"]["persist_dir"])
# if PERSIST_DIR.exists():
#     shutil.rmtree(PERSIST_DIR)
#     logger.info("Deleted %s — re-run Cell 6 to rebuild the index.", PERSIST_DIR)
# else:
#     logger.info("%s does not exist, nothing to clean up.", PERSIST_DIR)
print(
    "Cleanup cell ready. Uncomment the lines above to reset the vector store."
)

Summary

Component	Choice	Why
LLM	`llama3.2:3b` via Ollama	Free, local, no API key
Embeddings	`nomic-embed-text` via Ollama	High quality, 274 MB, fully local
Vector store	ChromaDB (persistent)	Simple, file-based, no server needed
Chunking	`SentenceSplitter`	Respects sentence boundaries
Eval	Keyword hit-rate + MRR	CI-friendly, zero LLM cost
Guardrail	Score threshold	Prevents hallucination on OOD queries

Next steps

Swap llama3.2:3b for mistral or gemma3 in the config cell and re-run
Replace CORPUS_TEXT with your own documents
Increase similarity_top_k and observe the effect on MRR
Add a reranker (e.g. llama-index-postprocessor-cohere-rerank) after retrieval

Fully Local RAG Pipeline with Chroma + Ollama

Fully Local RAG Pipeline with Chroma + Ollama

What you will learn

Prerequisites

Cell 1 — Install dependencies

Cell 2 — Imports and logging

Cell 3 — Configuration

Cell 4 — Initialise LLM and embedding model

Cell 5 — Load and chunk the document

Cell 6 — Build or load the Chroma index (with caching)

Cell 7 — RAG query

Cell 8 — Retrieval evaluation: Hit-Rate and MRR

Cell 9 — Failure mode demos

Failure Mode 1: Empty / nonsense query

Failure Mode 2: Query about a topic outside the document

Failure Mode 3: Hallucination guardrail (score threshold)

Cell 10 — Cleanup (optional)

Summary

Next steps

References