Back to Pageindex

Simple Vectorless RAG with PageIndex

cookbook/pageindex_RAG_simple.ipynb

latest7.9 KB
Original Source

<p align="center"><i>Reasoning-based RAG&nbsp; ✧ &nbsp;No Vector DB&nbsp; ✧ &nbsp;No Chunking&nbsp; ✧ &nbsp;Human-like Retrieval</i></p> <p align="center"> <a href="https://vectify.ai">🏠 Homepage</a>&nbsp; • &nbsp; <a href="https://dash.pageindex.ai">🖥️ Dashboard</a>&nbsp; • &nbsp; <a href="https://docs.pageindex.ai/quickstart">📚 API Docs</a>&nbsp; • &nbsp; <a href="https://github.com/VectifyAI/PageIndex">📦 GitHub</a>&nbsp; • &nbsp; <a href="https://discord.com/invite/VuXuf29EUj">💬 Discord</a>&nbsp; • &nbsp; <a href="https://ii2abc2jejf.typeform.com/to/tK3AXl8T">✉️ Contact</a>&nbsp; </p> <div align="center">

  

</div>

Simple Vectorless RAG with PageIndex

PageIndex Introduction

PageIndex is a new reasoning-based, vectorless RAG framework that performs retrieval in two steps:

  1. Generate a tree structure index of documents
  2. Perform reasoning-based retrieval through tree search
<div align="center"> </div>

Compared to traditional vector-based RAG, PageIndex features:

  • No Vectors Needed: Uses document structure and LLM reasoning for retrieval.
  • No Chunking Needed: Documents are organized into natural sections rather than artificial chunks.
  • Human-like Retrieval: Simulates how human experts navigate and extract knowledge from complex documents.
  • Transparent Retrieval Process: Retrieval based on reasoning — say goodbye to approximate semantic search ("vibe retrieval").

📝 Notebook Overview

This notebook demonstrates a simple, minimal example of vectorless RAG with PageIndex. You will learn how to:

  • Build a PageIndex tree structure of a document
  • Perform reasoning-based retrieval with tree search
  • Generate answers based on the retrieved context

⚡ Note: This is a minimal example to illustrate PageIndex's core philosophy and idea, not its full capabilities. More advanced examples are coming soon.


Step 0: Preparation

0.1 Install PageIndex

python
%pip install -q --upgrade pageindex

0.2 Setup PageIndex

python
from pageindex import PageIndexClient
import pageindex.utils as utils

# Get your PageIndex API key from https://dash.pageindex.ai/api-keys
PAGEINDEX_API_KEY = "YOUR_PAGEINDEX_API_KEY"
pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)

0.3 Setup LLM

Choose your preferred LLM for reasoning-based retrieval. In this example, we use OpenAI’s GPT-4.1.

python
import openai
OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"

async def call_llm(prompt, model="gpt-4.1", temperature=0):
    client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)
    response = await client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature
    )
    return response.choices[0].message.content.strip()

Step 1: PageIndex Tree Generation

1.1 Submit a document for generating PageIndex tree

python
import os, requests

# You can also use our GitHub repo to generate PageIndex tree
# https://github.com/VectifyAI/PageIndex

pdf_url = "https://arxiv.org/pdf/2501.12948.pdf"
pdf_path = os.path.join("../data", pdf_url.split('/')[-1])
os.makedirs(os.path.dirname(pdf_path), exist_ok=True)

response = requests.get(pdf_url)
with open(pdf_path, "wb") as f:
    f.write(response.content)
print(f"Downloaded {pdf_url}")

doc_id = pi_client.submit_document(pdf_path)["doc_id"]
print('Document Submitted:', doc_id)

1.2 Get the generated PageIndex tree structure

python
if pi_client.is_retrieval_ready(doc_id):
    tree = pi_client.get_tree(doc_id, node_summary=True)['result']
    print('Simplified Tree Structure of the Document:')
    utils.print_tree(tree)
else:
    print("Processing document, please try again later...")

2.1 Use LLM for tree search and identify nodes that might contain relevant context

python
import json

query = "What are the conclusions in this document?"

tree_without_text = utils.remove_fields(tree.copy(), fields=['text'])

search_prompt = f"""
You are given a question and a tree structure of a document.
Each node contains a node id, node title, and a corresponding summary.
Your task is to find all nodes that are likely to contain the answer to the question.

Question: {query}

Document tree structure:
{json.dumps(tree_without_text, indent=2)}

Please reply in the following JSON format:
{{
    "thinking": "<Your thinking process on which nodes are relevant to the question>",
    "node_list": ["node_id_1", "node_id_2", ..., "node_id_n"]
}}
Directly return the final JSON structure. Do not output anything else.
"""

tree_search_result = await call_llm(search_prompt)

2.2 Print retrieved nodes and reasoning process

python
node_map = utils.create_node_mapping(tree)
tree_search_result_json = json.loads(tree_search_result)

print('Reasoning Process:')
utils.print_wrapped(tree_search_result_json['thinking'])

print('\nRetrieved Nodes:')
for node_id in tree_search_result_json["node_list"]:
    node = node_map[node_id]
    print(f"Node ID: {node['node_id']}\t Page: {node['page_index']}\t Title: {node['title']}")

Step 3: Answer Generation

3.1 Extract relevant context from retrieved nodes

python
node_list = json.loads(tree_search_result)["node_list"]
relevant_content = "\n\n".join(node_map[node_id]["text"] for node_id in node_list)

print('Retrieved Context:\n')
utils.print_wrapped(relevant_content[:1000] + '...')

3.2 Generate answer based on retrieved context

python
answer_prompt = f"""
Answer the question based on the context:

Question: {query}
Context: {relevant_content}

Provide a clear, concise answer based only on the context provided.
"""

print('Generated Answer:\n')
answer = await call_llm(answer_prompt)
utils.print_wrapped(answer)

🎯 What's Next

This notebook has demonstrated a basic, minimal example of reasoning-based, vectorless RAG with PageIndex. The workflow illustrates the core idea:

Generating a hierarchical tree structure from a document, reasoning over that tree structure, and extracting relevant context, without relying on a vector database or top-k similarity search.

While this notebook highlights a minimal workflow, the PageIndex framework is built to support far more advanced use cases. In upcoming tutorials, we will introduce:

  • Multi-Node Reasoning with Content Extraction — Scale tree search to extract and select relevant content from multiple nodes.
  • Multi-Document Search — Enable reasoning-based navigation across large document collections, extending beyond a single file.
  • Efficient Tree Search — Improve tree search efficiency for long documents with a large number of nodes.
  • Expert Knowledge Integration and Preference Alignment — Incorporate user preferences or expert insights by adding knowledge directly into the LLM tree search, without the need for fine-tuning.

🔎 Learn More About PageIndex

<a href="https://vectify.ai">🏠 Homepage</a>  •   <a href="https://dash.pageindex.ai">🖥️ Dashboard</a>  •   <a href="https://docs.pageindex.ai/quickstart">📚 API Docs</a>  •   <a href="https://github.com/VectifyAI/PageIndex">📦 GitHub</a>  •   <a href="https://discord.com/invite/VuXuf29EUj">💬 Discord</a>  •   <a href="https://ii2abc2jejf.typeform.com/to/tK3AXl8T">✉️ Contact</a>

© 2025 Vectify AI