examples/python/rag/README.md
Working examples demonstrating how to use OpenDataLoader PDF in RAG (Retrieval-Augmented Generation) pipelines.
Examples use samples/pdf/1901.03003.pdf - a multi-page academic paper (arXiv:1901.03003) with:
basic_chunking.py demonstrates PDF-to-chunks conversion using only opendataloader-pdf and Python standard library. No external embedding or vector store dependencies.
Features:
Run:
pip install opendataloader-pdf
python basic_chunking.py
langchain_example.py shows integration with the official LangChain loader.
Features:
Run:
pip install -r requirements.txt
python langchain_example.py
Processing: 1901.03003.pdf
==================================================
Document: 1901.03003.pdf
Pages: 9
Elements: 187
--- Strategy 1: Chunk by Element ---
Created 156 chunks
[1] RoBERTa: A Robustly Optimized BERT Pretraining Approach
Source: 1901.03003.pdf, Page 1, Position (108, 655)
[2] Yinhan Liu† Myle Ott† Naman Goyal† Jingfei Du† ...
Source: 1901.03003.pdf, Page 1, Position (142, 603)
--- Strategy 2: Chunk by Section ---
Created 12 chunks
Section: RoBERTa: A Robustly Optimized BERT Pretraining Approach
Section: 1 Introduction
Section: 2 Background
...
After chunking, integrate with your preferred:
Each chunk includes text and metadata ready for embedding:
{
"text": "Language model pretraining has led to significant...",
"metadata": {
"type": "paragraph",
"page": 1,
"bbox": [108.0, 526.2, 286.5, 592.8],
"source": "1901.03003.pdf"
}
}