llama-index-integrations/readers/llama-index-readers-layoutir/README.md
LayoutIR Reader uses LayoutIR - a production-grade document ingestion and canonicalization engine with compiler-like architecture. Unlike simple PDF-to-Markdown converters, LayoutIR processes documents through an Intermediate Representation (IR) layer, enabling precise preservation of complex layouts, tables, and multi-column structures.
LayoutIR stands out for its:
pip install llama-index-readers-layoutir
For GPU acceleration, first install PyTorch with CUDA support:
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130
pip install llama-index-readers-layoutir
Load a PDF document with default settings:
from llama_index.readers.layoutir import LayoutIRReader
reader = LayoutIRReader()
documents = reader.load_data(file_path="document.pdf")
# Each document preserves block structure and metadata
for doc in documents:
print(f"Block Type: {doc.metadata['block_type']}")
print(f"Page: {doc.metadata['page_number']}")
print(f"Content: {doc.text[:100]}...")
Enable GPU processing for faster performance:
from llama_index.readers.layoutir import LayoutIRReader
reader = LayoutIRReader(use_gpu=True)
documents = reader.load_data(file_path="large_document.pdf")
Use semantic section-based chunking:
from llama_index.readers.layoutir import LayoutIRReader
reader = LayoutIRReader(
chunk_strategy="semantic",
max_heading_level=2, # Split at h1 and h2 headings
)
documents = reader.load_data(file_path="structured_document.pdf")
Process a batch of documents:
from llama_index.readers.layoutir import LayoutIRReader
from pathlib import Path
reader = LayoutIRReader(use_gpu=True)
file_paths = ["report_2024.pdf", "technical_spec.pdf", "user_manual.pdf"]
documents = reader.load_data(file_path=file_paths)
print(f"Loaded {len(documents)} document blocks from {len(file_paths)} files")
Build a searchable index from LayoutIR-processed documents:
from llama_index.readers.layoutir import LayoutIRReader
from llama_index.core import VectorStoreIndex
# Load documents with preserved layout structure
reader = LayoutIRReader(
use_gpu=True, chunk_strategy="semantic", max_heading_level=2
)
documents = reader.load_data(file_path="company_knowledge_base.pdf")
# Create index
index = VectorStoreIndex.from_documents(documents)
# Query with layout-aware context
query_engine = index.as_query_engine()
response = query_engine.query("What are the key financial metrics in Q4?")
print(response)
Integrate LayoutIR for PDF processing in directory operations:
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.layoutir import LayoutIRReader
reader = LayoutIRReader(use_gpu=True)
dir_reader = SimpleDirectoryReader(
input_dir="/path/to/documents",
file_extractor={".pdf": reader},
)
documents = dir_reader.load_data()
print(f"Processed {len(documents)} blocks")
Full configuration example:
from llama_index.readers.layoutir import LayoutIRReader
reader = LayoutIRReader(
use_gpu=True, # Enable GPU acceleration
chunk_strategy="semantic", # Use semantic chunking
max_heading_level=3, # Split up to h3 level
model_name="custom_model", # Optional: specify model
api_key="your_api_key", # Optional: for remote processing
)
documents = reader.load_data(
file_path="complex_layout.pdf",
extra_info={"department": "research", "year": 2026},
)
# Access rich metadata
for doc in documents:
print(f"ID: {doc.doc_id}")
print(f"Type: {doc.metadata['block_type']}")
print(f"Page: {doc.metadata['page_number']}")
print(f"Department: {doc.metadata['department']}")
Each Document includes the following metadata:
file_path: Source file pathfile_name: Source file nameblock_type: Type of content block (table, paragraph, heading, etc.)block_index: Index of the block in the documentpage_number: Page number where the block appearssource: Always "layoutir"extra_info passed to load_data()MIT
If you encounter CUDA/GPU issues (e.g. cuBLAS version mismatches or missing CUDA drivers), set CUDA_VISIBLE_DEVICES="" before running to force CPU-only processing:
CUDA_VISIBLE_DEVICES="" python your_script.py