examples/multi_format_indexing/README.md
No OCR, no text extraction, no brittle per-format parsers — in plain async Python.
</p> <p align="center"> <strong>Star us ❤️ →</strong> <a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a> · <a href="https://cocoindex.io/docs/examples/multi-format-indexing/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a> · <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> </p> <div align="center"> </div>Real document sets are a mix — scanned reports, slide exports, screenshots, and PDFs all jumbled together. Parsing each format into clean text is brittle and loses the layout (tables, charts, figures) that often is the answer. This pipeline sidesteps parsing entirely: render every PDF page to an image, embed pages and standalone images alike with the multi-vector ColPali model, and store them in one Qdrant collection. You declare the transformation in native Python — target_state = transformation(source_state) — the slow per-page inference runs on a GPU runner, and the Rust engine handles incremental processing, so adding a document embeds only its pages.
A file fans out to pages, so the shape is file → N pages → N points:
.pdf / .jpg / .jpeg / .png.pdf2image; a standalone image is a single page; anything else is skipped.One file-splitting function handles every format, and process_file fans each page out with coco.map. Read it in main.py:
@coco.fn.as_async(runner=coco.GPU)
def file_to_pages(filename: str, content: bytes) -> list[Page]:
mime_type, _ = mimetypes.guess_type(filename)
if mime_type == "application/pdf":
return [Page(page_number=i + 1, image=_to_png(img))
for i, img in enumerate(convert_from_bytes(content, dpi=PDF_RENDER_DPI))]
if mime_type and mime_type.startswith("image/"):
return [Page(page_number=None, image=content)]
return []
@coco.fn(memo=True) # unchanged file is never re-rendered or re-embedded
async def process_file(file: FileLike, target: qdrant.CollectionTarget) -> None:
filename = str(file.file_path.path)
pages = await file_to_pages(filename, await file.read())
await coco.map(process_page, pages, filename, target) # one point per page
The Qdrant collection is declared with a MultiVectorSchema and multivector_comparator="max_sim", so a text query is scored against the best-matching patch of each page — the same query reaches pages from PDFs and standalone images alike.
Step-by-step walkthrough with the file-to-pages split, the GPU runner, the multi-vector MaxSim collection, and cross-format search.
</p>file_to_pages path — a query reaches them all, no per-format retrievers.MultiVectorSchema + max_sim comparator scores a query against each page's best-matching patches, late-interaction style.coco.map, each its own point keyed by (filename, page) — re-running reconciles cleanly instead of duplicating.coco.GPU; @coco.fn(memo=True) means adding a document embeds only its pages and leaves the rest untouched.Needs Qdrant plus the ColPali deps (
torch,transformers,pdf2image).pdf2imageneeds poppler installed for PDF rendering (brew install poppler/apt install poppler-utils).
1. Start Qdrant:
docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant
2. Configure & install:
cp .env.example .env # QDRANT_URL (defaults to the local container above)
pip install -e .
3. Build the index — the example ships a source_files/ folder mixing PDFs (papers) and images (financial report pages). A PDF expands to one point per page (the sample BERT paper alone is 16 pages):
cocoindex update main # or: cocoindex update -L main (keep watching the folder)
4. Search across formats — embed a text query with ColPali; the same query reaches pages from PDFs and standalone images alike:
python main.py "revenue growth"
On the sample set, "revenue growth" ranks the two financial-report images at the top (Sweetgreen, then Restaurant Brands), above an unrelated healthcare page — MaxSim matching the query against the most relevant patches of each page, with zero text extraction.
<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/multi-format-indexing/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples"><b>See all examples →</b></a>
</p>