docs/examples/cookbooks/local_rag_with_chroma_and_ollama.ipynb
No API key required. This notebook runs entirely on your local machine using Ollama for both the LLM and embeddings, and ChromaDB as the vector store.
| Step | Concept |
|---|---|
| 1 | Configure the pipeline — all tunables in one place |
| 2 | Ingest and chunk a sample document with SentenceSplitter |
| 3 | Embed chunks with OllamaEmbedding and persist in ChromaDB |
| 4 | Query the index with a local LLM (llama3.2:3b via Ollama) |
| 5 | Evaluate retrieval quality against a gold Q&A set (hit-rate & MRR) |
| 6 | Explore failure modes: empty context, long queries, hallucination guard |
ollama pull llama3.2:3b
ollama pull nomic-embed-text
%pip install -q llama-index-core llama-index-llms-ollama llama-index-embeddings-ollama llama-index-vector-stores-chroma llama-index-readers-file chromadb
import json
import logging
import shutil
from pathlib import Path
import chromadb
from IPython.display import Markdown, display
from llama_index.core import Document, StorageContext, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.vector_stores.chroma import ChromaVectorStore
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
datefmt="%H:%M:%S",
)
logger = logging.getLogger("local_rag")
logger.info("Imports loaded successfully.")
All tunables are defined here. Edit this cell to change model names, chunk size, top-k, etc.
cfg = {
"llm": {
"model": "llama3.2:3b",
"base_url": "http://localhost:11434",
"temperature": 0.0,
"request_timeout": 120.0,
},
"embedding": {
"model": "nomic-embed-text",
"base_url": "http://localhost:11434",
},
"splitter": {
"chunk_size": 512,
"chunk_overlap": 50,
},
"chroma": {
"persist_dir": "./chroma_db",
"collection_name": "ai_safety_rag",
},
"retrieval": {
"similarity_top_k": 3,
},
}
print(json.dumps(cfg, indent=2))
llm = Ollama(
model=cfg["llm"]["model"],
base_url=cfg["llm"]["base_url"],
temperature=cfg["llm"]["temperature"],
request_timeout=cfg["llm"]["request_timeout"],
)
embed_model = OllamaEmbedding(
model_name=cfg["embedding"]["model"],
base_url=cfg["embedding"]["base_url"],
)
logger.info(
"LLM: %s | Embedding: %s", cfg["llm"]["model"], cfg["embedding"]["model"]
)
We use SentenceSplitter with the chunk size and overlap from config.
The corpus is defined inline — replace CORPUS_TEXT with your own content or load from a file.
CORPUS_TEXT = """# AI Safety Primer
## What is AI Safety?
AI safety is a field of research focused on ensuring that artificial intelligence systems
behave in ways that are safe, beneficial, and aligned with human values. As AI systems
become more capable, the importance of safety research grows correspondingly.
## Key Concepts
### Alignment
Alignment refers to the challenge of ensuring that an AI system's goals and behaviors
match the intentions of its designers and the broader interests of humanity. A misaligned
AI might pursue its programmed objective in ways that are harmful or unintended.
### RLHF (Reinforcement Learning from Human Feedback)
RLHF is a training technique where human evaluators rank model outputs, and those
rankings are used to train a reward model. The AI is then fine-tuned using reinforcement
learning to maximize this reward signal, steering it toward outputs humans prefer.
### Reward Hacking
Reward hacking occurs when an AI system finds unintended ways to maximize its reward
signal without achieving the true underlying goal. For example, a robot trained to run
fast might learn to make itself very tall and then fall forward repeatedly.
### Constitutional AI
Constitutional AI (CAI) is a technique developed by Anthropic to make AI systems more
helpful, harmless, and honest. It uses a set of explicit principles (a "constitution")
to guide the model's behavior. The model critiques and revises its own outputs against
these principles, reducing reliance on human labelers for harmful content.
### Red Teaming
Red teaming in AI involves deliberately trying to find failure modes, vulnerabilities,
or harmful outputs in AI systems. Red teamers act as adversaries, probing the system
with edge cases, jailbreaks, and adversarial prompts to expose weaknesses before
deployment.
### Deceptive Alignment
Deceptive alignment is a hypothetical failure mode where an AI system behaves safely
during training and evaluation but pursues different goals once deployed. The system
"knows" it is being evaluated and acts accordingly to pass safety checks.
### Interpretability
Interpretability (or explainability) research aims to understand what is happening
inside AI models — which features they use, how they represent concepts, and why they
produce specific outputs. Tools like mechanistic interpretability try to reverse-engineer
neural network computations.
## Why AI Safety Matters Now
The rapid pace of AI development means that safety considerations must be integrated
early into the design and training process. Several organizations are actively working
on AI safety research:
- **Anthropic** — founded by former OpenAI researchers, focuses on Constitutional AI
and interpretability
- **OpenAI** — safety team works on alignment, red teaming, and policy
- **DeepMind** — conducts research on agent safety and specification gaming
- **MIRI (Machine Intelligence Research Institute)** — focuses on long-term
theoretical alignment problems
- **Center for AI Safety (CAIS)** — coordinates safety research across academia
and industry
"""
documents = [
Document(text=CORPUS_TEXT, metadata={"source": "ai_safety_primer"})
]
logger.info("Loaded %d document(s) from inline corpus", len(documents))
splitter = SentenceSplitter(
chunk_size=cfg["splitter"]["chunk_size"],
chunk_overlap=cfg["splitter"]["chunk_overlap"],
)
nodes = splitter.get_nodes_from_documents(documents)
logger.info(
"Split into %d nodes (chunk_size=%d, overlap=%d)",
len(nodes),
cfg["splitter"]["chunk_size"],
cfg["splitter"]["chunk_overlap"],
)
print(f"\nFirst chunk preview ({len(nodes[0].text)} chars):")
print("-" * 60)
print(nodes[0].text[:400], "...")
If chroma_db/ already exists on disk we load from it — no re-embedding.
Delete the chroma_db/ folder to force a full re-index.
PERSIST_DIR = Path(cfg["chroma"]["persist_dir"])
COLLECTION = cfg["chroma"]["collection_name"]
chroma_client = chromadb.PersistentClient(path=str(PERSIST_DIR))
existing = [c.name for c in chroma_client.list_collections()]
if COLLECTION in existing:
logger.info(
"Cache hit — loading existing collection '%s' from %s",
COLLECTION,
PERSIST_DIR,
)
chroma_collection = chroma_client.get_collection(COLLECTION)
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
index = VectorStoreIndex.from_vector_store(
vector_store,
embed_model=embed_model,
)
else:
logger.info(
"Cache miss — embedding %d nodes into new collection '%s'",
len(nodes),
COLLECTION,
)
chroma_collection = chroma_client.create_collection(COLLECTION)
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(
nodes,
storage_context=storage_context,
embed_model=embed_model,
)
logger.info("Index built and persisted to %s", PERSIST_DIR)
print(f"Collection '{COLLECTION}' has {chroma_collection.count()} vectors.")
The query engine retrieves the top-k most relevant chunks and passes them as context to the local LLM to generate a grounded answer.
query_engine = index.as_query_engine(
llm=llm,
similarity_top_k=cfg["retrieval"]["similarity_top_k"],
)
QUERY = "What is Constitutional AI and who developed it?"
logger.info("Running query: %s", QUERY)
response = query_engine.query(QUERY)
display(Markdown(f"**Query:** {QUERY}\n\n**Answer:** {response}"))
print("\n--- Retrieved source nodes ---")
for i, node in enumerate(response.source_nodes, 1):
score = getattr(node, "score", "n/a")
print(f"[{i}] score={score:.4f} | {node.text[:120].strip()}...")
We loop over the gold Q&A set and for each question:
This is CI-friendly — no extra LLM calls, runs in seconds.
gold_qa = [
{
"id": "q1",
"question": "What is alignment in the context of AI safety?",
"expected_keywords": ["alignment", "goals"],
},
{
"id": "q2",
"question": "What is RLHF and how does it work?",
"expected_keywords": ["rlhf", "reward"],
},
{
"id": "q3",
"question": "What is reward hacking?",
"expected_keywords": ["reward hacking", "unintended"],
},
{
"id": "q4",
"question": "What is Constitutional AI and who developed it?",
"expected_keywords": ["constitutional ai", "anthropic"],
},
{
"id": "q5",
"question": "What is red teaming in AI?",
"expected_keywords": ["red teaming", "failure"],
},
{
"id": "q6",
"question": "What is deceptive alignment?",
"expected_keywords": ["deceptive alignment", "training"],
},
{
"id": "q7",
"question": "What is interpretability in AI systems?",
"expected_keywords": ["interpretability", "neural"],
},
{
"id": "q8",
"question": "Which organizations are working on AI safety?",
"expected_keywords": ["anthropic", "openai", "deepmind"],
},
]
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=cfg["retrieval"]["similarity_top_k"],
embed_model=embed_model,
)
hits = 0
reciprocal_ranks = []
results = []
for item in gold_qa:
retrieved_nodes = retriever.retrieve(item["question"])
keywords = [kw.lower() for kw in item["expected_keywords"]]
first_hit_rank = None
for rank, node in enumerate(retrieved_nodes, 1):
text_lower = node.text.lower()
if all(kw in text_lower for kw in keywords):
first_hit_rank = rank
break
hit = first_hit_rank is not None
hits += int(hit)
reciprocal_ranks.append(1 / first_hit_rank if hit else 0.0)
results.append(
{
"id": item["id"],
"hit": hit,
"rank": first_hit_rank,
"question": item["question"][:60],
}
)
logger.info(
"[%s] hit=%s rank=%s | %s",
item["id"],
hit,
first_hit_rank,
item["question"][:50],
)
hit_rate = hits / len(gold_qa)
mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)
print("\n" + "=" * 50)
print(
f" Retrieval Evaluation Results (top-k={cfg['retrieval']['similarity_top_k']})"
)
print("=" * 50)
print(f" Hit-Rate : {hit_rate:.2%} ({hits}/{len(gold_qa)} questions)")
print(f" MRR : {mrr:.4f}")
print("=" * 50)
print("\nPer-question breakdown:")
for r in results:
status = "✅" if r["hit"] else "❌"
print(f" {status} [{r['id']}] rank={r['rank']} {r['question']}")
Understanding where a RAG pipeline breaks is as important as knowing where it works. We demonstrate three common failure modes.
When the query has no semantic content, retrieval returns low-relevance chunks and the LLM is forced to hallucinate or admit it doesn't know.
empty_query = "asdfjkl qwerty zzz"
logger.info("[Failure Mode 1] Empty/nonsense query: '%s'", empty_query)
response_empty = query_engine.query(empty_query)
print("Query :", empty_query)
print("Answer :", str(response_empty))
print(
"\nTop retrieved node score:",
f"{response_empty.source_nodes[0].score:.4f}"
if response_empty.source_nodes
else "none",
)
print(
"\n⚠️ Note: Low retrieval score indicates the context is not relevant to the query."
)
The document covers AI safety. A query about an unrelated topic will retrieve the least-bad chunks, but the answer will be unreliable.
ood_query = "What is the recipe for making sourdough bread?"
logger.info("[Failure Mode 2] Out-of-domain query: '%s'", ood_query)
response_ood = query_engine.query(ood_query)
print("Query :", ood_query)
print("Answer :", str(response_ood))
print(
"\n⚠️ Guardrail tip: Add a relevance score threshold. If max(node.score) < 0.4,"
)
print(
" return 'I don't have information about this topic' instead of hallucinating."
)
A simple but effective guardrail: if the best retrieval score is below a threshold, refuse to answer rather than hallucinate.
SCORE_THRESHOLD = 0.40
def safe_query(
query_engine, retriever, question: str, threshold: float = SCORE_THRESHOLD
) -> str:
"""Run RAG query with a relevance score guardrail.
Returns the LLM answer if the best retrieved chunk exceeds `threshold`,
otherwise returns a fallback message to prevent hallucination.
"""
nodes = retriever.retrieve(question)
if not nodes:
return "[GUARDRAIL] No documents retrieved."
best_score = max(n.score for n in nodes if n.score is not None)
logger.info(
"[safe_query] best_score=%.4f threshold=%.2f", best_score, threshold
)
if best_score < threshold:
return (
f"[GUARDRAIL] Best retrieval score ({best_score:.4f}) is below "
f"threshold ({threshold}). Cannot answer reliably."
)
return str(query_engine.query(question))
# In-domain question — should pass the guardrail
q_in = "What is reward hacking?"
# Out-of-domain question — should be blocked
q_out = "What is the capital of France?"
print("=" * 55)
print(f"Q (in-domain) : {q_in}")
print(f"A : {safe_query(query_engine, retriever, q_in)}")
print()
print(f"Q (out-domain): {q_out}")
print(f"A : {safe_query(query_engine, retriever, q_out)}")
print("=" * 55)
Run this cell to delete the persisted Chroma database and start fresh. Useful for testing the full pipeline from scratch.
# Uncomment to reset the vector store
# PERSIST_DIR = Path(cfg["chroma"]["persist_dir"])
# if PERSIST_DIR.exists():
# shutil.rmtree(PERSIST_DIR)
# logger.info("Deleted %s — re-run Cell 6 to rebuild the index.", PERSIST_DIR)
# else:
# logger.info("%s does not exist, nothing to clean up.", PERSIST_DIR)
print(
"Cleanup cell ready. Uncomment the lines above to reset the vector store."
)
| Component | Choice | Why |
|---|---|---|
| LLM | llama3.2:3b via Ollama | Free, local, no API key |
| Embeddings | nomic-embed-text via Ollama | High quality, 274 MB, fully local |
| Vector store | ChromaDB (persistent) | Simple, file-based, no server needed |
| Chunking | SentenceSplitter | Respects sentence boundaries |
| Eval | Keyword hit-rate + MRR | CI-friendly, zero LLM cost |
| Guardrail | Score threshold | Prevents hallucination on OOD queries |
llama3.2:3b for mistral or gemma3 in the config cell and re-runCORPUS_TEXT with your own documentssimilarity_top_k and observe the effect on MRRllama-index-postprocessor-cohere-rerank) after retrieval