RAG with MongoDB + VoyageAI

Step	Tech	Execution
Embedding	Voyage AI	🌐 Remote
Vector store	MongoDB	🌐 Remote
Gen AI	Azure Open AI	🌐 Remote

How to cook

This notebook demonstrates how to build a Retrieval-Augmented Generation (RAG) pipeline using MongoDB as a vector store and Voyage AI embedding models for semantic search. The workflow involves extracting and chunking text from documents, generating embeddings with Voyage AI, storing vectors in MongoDB, and leveraging OpenAI for generative responses.

MongoDB Vector Search: MongoDB supports storing and searching high-dimensional vectors, enabling efficient similarity search for RAG applications. Learn more: MongoDB Vector Search
Voyage AI Embeddings: Voyage AI provides state-of-the-art embedding models for text, supporting robust semantic search and retrieval. See: Voyage AI Documentation
OpenAI LLM Models: Azure OpenAI's models are used to generate answers based on retrieved context. More info: Azure OpenAI API

By combining these technologies, you can build scalable, production-ready RAG systems for advanced document understanding and question answering.

Setting Up Your Environment

First, we'll install the necessary libraries and configure our environment. These packages enable document processing, database connections, embedding generation, and AI model interaction. We're using Docling for document handling, PyMongo for MongoDB integration, VoyageAI for embeddings, and OpenAI client for generation capabilities.

python

%%capture
%pip install docling~="2.7.0"
%pip install pymongo[srv]
%pip install voyageai
%pip install openai

import logging
import warnings

warnings.filterwarnings("ignore")
logging.getLogger("pymongo").setLevel(logging.ERROR)

Part 1: Setting up Docling

Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.

The code below checks to see if a GPU is available, either via CUDA or MPS.

python

import torch

# Check if GPU or MPS is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"CUDA GPU is enabled: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("MPS GPU is enabled.")
else:
    raise OSError(
        "No GPU or MPS device found. Please check your environment and ensure GPU or MPS support is configured."
    )

Single-Document RAG Baseline

To begin, we will focus on a single seminal paper and treat it as the entire knowledge base. Building a Retrieval-Augmented Generation (RAG) pipeline on just one document serves as a clear, controlled baseline before scaling to multiple sources. This helps validate each stage of the workflow (parsing, chunking, embedding, retrieval, generation) without confounding factors introduced by inter-document noise.

python

# Influential machine learning papers
source_urls = [
    "https://arxiv.org/pdf/1706.03762"  # Attention is All You Need
]

Convert Source Documents to Markdown

Convert each source URL to Markdown with Docling, reusing any already-converted document to avoid redundant downloads/parsing. Produces a dict mapping URLs to their Markdown content.

There are other methods that can be used to

python

from pprint import pprint

from docling.document_converter import DocumentConverter

# Instantiate the doc converter
doc_converter = DocumentConverter()

# Since we want to use a single document, we will convert just the first URL. For multiple documents, you can use convert_all() method and then iterate through the list of converted documents.
pdf_doc = source_urls[0]
converted_doc = doc_converter.convert(pdf_doc).document

Post-process extracted document data

We use Docling's HierarchicalChunker() to perform hierarchy-aware chunking of our list of documents. This is meant to preserve some of the structure and relationships within the document, which enables more accurate and relevant retrieval in our RAG pipeline.

python

from docling_core.transforms.chunker import HierarchicalChunker

# Initialize the chunker
chunker = HierarchicalChunker()

# Perform hierarchical chunking on the converted document and get text from chunks
chunks = list(chunker.chunk(converted_doc))
chunk_texts = [chunk.text for chunk in chunks]
chunk_texts[:20]  # Display a few chunk texts

Part 2: VoyageAI (by MongoDB)

We will be using VoyageAI for embedding creation.

We will be using VoyageAI embedding model for converting the above chunks to embeddings, thereafter pushing them to MongoDB for further consumption.

VoyageAI has a load of offerings for embedding models, we will be using voyage-context-3 for best results in this case, which is a contextualized chunk embedding model, where chunk embedding encodes not only the chunk’s own content, but also captures the contextual information from the full document.

You can go through the blogpost to understand how it performas in comparison to other embedding models.

Create an account on Voyage and get you API key.

python

import voyageai

# Voyage API key
VOYAGE_API_KEY = "**********************"

# Initialize the VoyageAI client
vo = voyageai.Client(VOYAGE_API_KEY)
result = vo.contextualized_embed(inputs=[chunk_texts], model="voyage-context-3")
contextualized_chunk_embds = [emb for r in result.results for emb in r.embeddings]

python

# Check lengths to ensure they match
print("Chunk Texts Length:", chunk_texts.__len__())
print("Contextualized Chunk Embeddings Length:", contextualized_chunk_embds.__len__())

python

# Combine chunks with their embeddings
chunk_data = [
    {"text": text, "embedding": emb}
    for text, emb in zip(chunk_texts, contextualized_chunk_embds)
]

Part 3: Inserting to MongoDB

With the generated embeddings prepared, we now insert them into MongoDB so they can be leveraged in the RAG pipeline.

MongoDB is an ideal vector store for RAG applications because:

It supports efficient vector search capabilities through Atlas Vector Search
It scales well for large document collections
It offers flexible querying options for combining semantic and traditional search
It provides robust indexing for fast retrieval

The chunks with their embeddings will be stored in a MongoDB collection, allowing us to perform similarity searches when responding to user queries.

python

# Insert to MongoDB
from pymongo import MongoClient

client = MongoClient(
    "mongodb+srv://*******.mongodb.net/"
)  # Replace with your MongoDB connection string
db = client["rag_db"]  # Database name
collection = db["documents"]  # Collection name

# Insert chunk data into MongoDB
response = collection.insert_many(chunk_data)
print(f"Inserted {len(response.inserted_ids)} documents into MongoDB.")

Creating Atlas Vector search index

Using pymongo we can create a vector index, that will help us search through our vectors and respond to user queries. This index is crucial for efficient similarity searches between user questions and our document chunks. MongoDB Atlas Vector Search provides fast and accurate retrieval of semantically related content, which forms the foundation of our RAG pipeline.

python

from pymongo.operations import SearchIndexModel

# Create your index model, then create the search index
search_index_model = SearchIndexModel(
    definition={
        "fields": [
            {
                "type": "vector",
                "path": "embedding",
                "numDimensions": 1024,
                "similarity": "dotProduct",
            }
        ]
    },
    name="vector_index",
    type="vectorSearch",
)
result = collection.create_search_index(model=search_index_model)
print("New search index named " + result + " is building.")

Query the vectorized data

To perform a query on the vectorized data stored in MongoDB, we can use the $vectorSearch aggregation pipeline. This powerful feature of MongoDB Atlas enables semantic search capabilities by finding documents based on vector similarity.

When executing a vector search query:

MongoDB computes the similarity between the query vector and vectors stored in the collection
The documents are ranked by their similarity score
The top-N most similar results are returned

This enables us to find semantically related content rather than relying on exact keyword matches. The similarity metric we're using (dot product) measures the cosine similarity between vectors, allowing us to identify content that is conceptually similar even if it uses different terminology.

For RAG applications, this vector search capability is crucial as it allows us to retrieve the most relevant context from our document collection based on the semantic meaning of a user's query, providing the foundation for generating accurate and contextually appropriate responses.

Part 4: Perform RAG on parsed articles

We specify a prompt that includes the field we want to search through in the database (in this case it's text), a query that includes our search term, and the number of retrieved results to use in the generation.

python

import os

from openai import AzureOpenAI
from rich.console import Console
from rich.panel import Panel

# Create MongoDB vector search query for "Attention is All You Need"
# (prompt already defined above, reuse if present; else keep this definition)
prompt = "Give me top 3 learning points from `Attention is All You Need`, using only the retrieved context."

# Generate embedding for the query using VoyageAI (vo already initialized earlier)
query_embd_context = (
    vo.contextualized_embed(
        inputs=[[prompt]], model="voyage-context-3", input_type="query"
    )
    .results[0]
    .embeddings[0]
)

# Vector search pipeline
search_pipeline = [
    {
        "$vectorSearch": {
            "index": "vector_index",
            "path": "embedding",
            "queryVector": query_embd_context,
            "numCandidates": 10,
            "limit": 10,
        }
    },
    {"$project": {"text": 1, "_id": 0, "score": {"$meta": "vectorSearchScore"}}},
]

results = list(collection.aggregate(search_pipeline))
if not results:
    raise ValueError(
        "No vector search results returned. Verify the index is built before querying."
    )

context_texts = [doc["text"] for doc in results]
combined_context = "\n\n".join(context_texts)

# Expect these environment variables to be set (do NOT hardcode secrets):
#   AZURE_OPENAI_API_KEY
#   AZURE_OPENAI_ENDPOINT   -> e.g. https://your-resource-name.openai.azure.com/
#   AZURE_OPENAI_API_VERSION (optional, else fallback)
AZURE_OPENAI_API_KEY = "**********************"
AZURE_OPENAI_ENDPOINT = "**********************"
AZURE_OPENAI_API_VERSION = "**********************"

# Initialize Azure OpenAI client (endpoint must NOT include path segments)
client = AzureOpenAI(
    api_key=AZURE_OPENAI_API_KEY,
    azure_endpoint=AZURE_OPENAI_ENDPOINT.rstrip("/"),
    api_version=AZURE_OPENAI_API_VERSION,
)

# Chat completion using retrieved context
response = client.chat.completions.create(
    model="gpt-4o-mini",  # Azure deployment name
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Use only the provided context to answer questions. If the context is insufficient, say so.",
        },
        {
            "role": "user",
            "content": f"Context:\n{combined_context}\n\nQuestion: {prompt}",
        },
    ],
    temperature=0.2,
)

response_text = response.choices[0].message.content

console = Console()
console.print(Panel(f"{prompt}", title="Prompt", border_style="bold red"))
console.print(
    Panel(response_text, title="Generated Content", border_style="bold green")
)

This notebook demonstrated a powerful RAG pipeline using MongoDB, VoyageAI, and Azure OpenAI. By combining MongoDB's vector search capabilities with VoyageAI's embeddings and Azure OpenAI's language models, we created an intelligent document retrieval system.

Key Achievements:

Processed documents with Docling
Generated contextual embeddings with VoyageAI
Stored vectors in MongoDB Atlas
Implemented semantic search for relevant context retrieval
Generated accurate responses with Azure OpenAI

Next Steps:

Expand your knowledge base with more documents
Experiment with chunking and embedding parameters
Build a user interface
Implement evaluation metrics
Deploy to production with proper scaling

Start building your own intelligent document retrieval system today!