examples/pdf_embedding/README.md
Papers, RFCs, manuals, contracts — searchable in plain English, in plain async Python.
</p> <p align="center"> <strong>Star us ❤️ →</strong> <a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a> · <a href="https://cocoindex.io/docs/examples/pdf-embedding/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a> · <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> </p> <div align="center"> </div>Take a folder of PDFs and turn it into a vector index you can search in plain English. The trick PDFs add over plain text: they have to be parsed first. This pipeline uses docling to convert each PDF to clean Markdown, then chunks, embeds, and stores the vectors in Postgres. You declare the transformation in native Python and your own types — target_state = transformation(source_state) — and the heavy lifting (incremental processing, change tracking, managed targets) runs in a Rust engine underneath, so only changed PDFs get re-parsed and re-embedded.
The one genuinely expensive step is PDF parsing, so it runs on a GPU runner and the docling converter is built once with @functools.cache. process_file converts the PDF to Markdown, splits it into overlapping chunks, and maps each chunk to process_chunk for embedding. Read it in main.py:
@coco.fn.as_async(runner=coco.GPU)
def pdf_to_markdown(content: bytes) -> str:
source = DocumentStream(name="input.pdf", stream=io.BytesIO(content))
return pdf_converter().convert(source).document.export_to_markdown()
@coco.fn(memo=True)
async def process_file(file: FileLike, table: postgres.TableTarget[PdfEmbedding]) -> None:
markdown = await pdf_to_markdown(await file.read())
chunks = _splitter.split(markdown, chunk_size=2000, chunk_overlap=500, language="markdown")
id_gen = IdGenerator()
await coco.map(process_chunk, chunks, file.file_path.path, id_gen, table)
@coco.fn.as_async(runner=coco.GPU) wraps the synchronous, GPU-heavy parse so it runs off the async event loop. Each chunk's row id is derived from its text, so a chunk that survives a re-parse keeps its row.
Step-by-step walkthrough with the docling converter, the GPU runner, the row schema, and the query.
</p>@coco.fn.as_async(runner=coco.GPU) offloads PDF parsing to a dedicated GPU runner; @functools.cache loads the docling model once, not per file.@coco.fn(memo=True) skips a PDF whose bytes and code are unchanged, so docling never re-parses a file you've already converted; mount_table_target upserts only changed rows and deletes rows whose source is gone.live=True — pass -L and added, replaced, or deleted PDFs are picked up as they change.all-MiniLM-L6-v2 embedder, no API key; swap EMBED_MODEL for any of the 12k+ sentence-transformer models on Hugging Face.1. Start Postgres + pgvector (the repo ships a compose file):
docker compose -f ../../dev/postgres.yaml up -d
2. Configure & install (docling pulls in the PDF parser):
cp .env.example .env # set POSTGRES_URL
pip install -e .
3. Build the index — the example ships a pdf_files/ folder of sample papers/RFCs; catch-up or live:
cocoindex update main # catch-up
cocoindex update -L main # live: keep watching for file changes
4. Search from the command line:
python main.py "what is attention?"
With the sample papers indexed, the most semantically similar passages come back ranked — even when they share none of the words in your query. This example keeps it minimal and doesn't declare a vector index, so queries do a sequential scan. For a larger corpus, add target_table.declare_vector_index(column="embedding") exactly as Semantic Search 101 does.
<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/pdf-embedding/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples"><b>See all examples →</b></a>
</p>