README.md
π₯ Releases:
π Articles:
π§ͺ Cookbooks:
Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic similarity rather than true relevance. But similarity β relevance β what we truly need in retrieval is relevance, and that requires reasoning. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short.
Inspired by AlphaGo, we propose PageIndex β a vectorless, reasoning-based RAG system that builds a hierarchical tree index from long documents and uses LLMs to reason over that index for agentic, context-aware retrieval. It simulates how human experts navigate and extract knowledge from complex documents through tree search, enabling LLMs to think and reason their way to the most relevant document sections. PageIndex performs retrieval in two steps:
Compared to traditional vector-based RAG, PageIndex features:
PageIndex powers a reasoning-based RAG system that achieved state-of-the-art 98.7% accuracy on FinanceBench, demonstrating superior performance over vector-based RAG solutions in professional document analysis (see our blog post for details).
To learn more, please see a detailed introduction of the PageIndex framework. Check out this GitHub repo for open-source code, and the cookbooks, tutorials, and blog for additional usage guides and examples.
The PageIndex service is available as a ChatGPT-style chat platform, or can be integrated via MCP or API.
PageIndex can transform lengthy PDF documents into a semantic tree structure, similar to a "table of contents" but optimized for use with Large Language Models (LLMs). It's ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.
Below is an example PageIndex tree structure. Also see more example documents and generated tree structures.
...
{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ...",
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
},
{
"title": "Domestic and International Cooperation and Coordination",
"node_id": "0008",
"start_index": 28,
"end_index": 31,
"summary": "In 2023, the Federal Reserve collaborated ..."
}
]
}
...
You can generate the PageIndex tree structure with this open-source repo, or use our API
You can follow these steps to generate a PageIndex tree from a PDF document.
pip3 install --upgrade -r requirements.txt
Create a .env file in the root directory and add your API key:
CHATGPT_API_KEY=your_openai_key_here
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
You can customize the processing with additional optional arguments:
--model OpenAI model to use (default: gpt-4o-2024-11-20)
--toc-check-pages Pages to check for table of contents (default: 20)
--max-pages-per-node Max pages per node (default: 10)
--max-tokens-per-node Max tokens per node (default: 20000)
--if-add-node-id Add node ID (yes/no, default: yes)
--if-add-node-summary Add node summary (yes/no, default: yes)
--if-add-doc-description Add doc description (yes/no, default: yes)
We also provide markdown support for PageIndex. You can use the -md_path flag to generate a tree structure for a markdown file.
python3 run_pageindex.py --md_path /path/to/your/document.md
</details> <!-- # βοΈ Improved Tree Generation with PageIndex OCR This repo is designed for generating PageIndex tree structure for simple PDFs, but many real-world use cases involve complex PDFs that are hard to parse by classic Python tools. However, extracting high-quality text from PDF documents remains a non-trivial challenge. Most OCR tools only extract page-level content, losing the broader document context and hierarchy. To address this, we introduced PageIndex OCR β the first long-context OCR model designed to preserve the global structure of documents. PageIndex OCR significantly outperforms other leading OCR tools, such as those from Mistral and Contextual AI, in recognizing true hierarchy and semantic relationships across document pages. - Experience next-level OCR quality with PageIndex OCR at our [Dashboard](https://dash.pageindex.ai/). - Integrate PageIndex OCR seamlessly into your stack via our [API](https://docs.pageindex.ai/quickstart). <p align="center"> </p> -->Note: in this function, we use "#" to determine node heading and their levels. For example, "##" is level 2, "###" is level 3, etc. Make sure your markdown file is formatted correctly. If your Markdown file was converted from a PDF or HTML, we don't recommend using this function, since most existing conversion tools cannot preserve the original hierarchy. Instead, use our PageIndex OCR, which is designed to preserve the original hierarchy, to convert the PDF to a markdown file and then use this function.
Mafin 2.5 is a reasoning-based RAG system for financial document analysis, powered by PageIndex. It achieved a state-of-the-art 98.7% accuracy on the FinanceBench benchmark, significantly outperforming traditional vector-based RAG systems.
PageIndex's hierarchical indexing and reasoning-driven retrieval enable precise navigation and extraction of relevant context from complex financial reports, such as SEC filings and earnings disclosures.
Explore the full benchmark results and our blog post for detailed comparisons and performance metrics.
<div align="center"> <a href="https://github.com/VectifyAI/Mafin2.5-FinanceBench"> </a> </div>Please cite this work as:
Mingtian Zhang, Yu Tang and PageIndex Team,
"PageIndex: Next-Generation Vectorless, Reasoning-based RAG",
PageIndex Blog, Sep 2025.
Or use the BibTeX citation:
@article{zhang2025pageindex,
author = {Mingtian Zhang and Yu Tang and PageIndex Team},
title = {PageIndex: Next-Generation Vectorless, Reasoning-based RAG},
journal = {PageIndex Blog},
year = {2025},
month = {September},
note = {https://pageindex.ai/blog/pageindex-intro},
}
Leave us a star π if you like our project. Thank you!
<p> </p>Β© 2025 Vectify AI