examples/paper_metadata/README.md
One PDF fans out into three Postgres tables, and CocoIndex keeps all three in sync as the folder changes.
</p> <p align="center"> <strong>Star us ❤️ →</strong> <a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a> · <a href="https://cocoindex.io/docs/examples/paper-metadata/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a> · <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> </p> <div align="center"> </div>The first page of a paper holds almost everything you'd want to query — title, authors, abstract — but it's locked in PDF prose. This pipeline reads just that page, hands the text to an LLM with a strict schema, and gets back clean typed JSON; the same metadata is then embedded so you can search by meaning, not exact words. You declare the transformation in native Python and your own types — target_state = transformation(source_state) — and the heavy lifting (incremental processing, change tracking, managed targets) runs in a Rust engine underneath, so only changed PDFs get re-extracted and re-embedded.
One PDF flows through three small functions and fans into three tables:
extract_basic_info slices the first page out of the PDF and counts the pages; pdf_to_markdown pulls the text off that page with pypdf.extract_metadata hands that text to gpt-4o (via the openai SDK) with response_format={"type": "json_object"} and temperature=0, then model_validate_json parses it into a typed PaperMetadataModel — a malformed response fails loudly instead of writing junk.process_file declares the rows: one metadata row, one author-index row per author, one embedding row for the title plus one per abstract chunk.Read it in main.py:
@coco.fn(memo=True)
async def process_file(
file: FileLike,
metadata_table: postgres.TableTarget[PaperMetadataRow],
author_table: postgres.TableTarget[AuthorPaperRow],
embedding_table: postgres.TableTarget[MetadataEmbeddingRow],
) -> None:
content = await file.read()
basic_info = extract_basic_info(content)
metadata = extract_metadata(pdf_to_markdown(basic_info.first_page))
metadata_table.declare_row(row=PaperMetadataRow(
filename=str(file.file_path.path), title=metadata.title,
authors=[a.model_dump() for a in metadata.authors],
abstract=metadata.abstract, num_pages=basic_info.num_pages,
))
for author in metadata.authors:
if author.name:
author_table.declare_row(row=AuthorPaperRow(
author_name=author.name, filename=str(file.file_path.path)))
title_embedding = await coco.use_context(EMBEDDER).embed(metadata.title)
embedding_table.declare_row(row=MetadataEmbeddingRow(
id=uuid.uuid4(), filename=str(file.file_path.path),
location="title", text=metadata.title, embedding=title_embedding))
for chunk in _abstract_splitter.split(metadata.abstract, chunk_size=500, ...):
embedding_table.declare_row(row=MetadataEmbeddingRow(
id=uuid.uuid4(), filename=str(file.file_path.path), location="abstract",
text=chunk.text, embedding=await coco.use_context(EMBEDDER).embed(chunk.text)))
embedding: Annotated[NDArray, EMBEDDER] ties the vector column to the embedder, so its dimensions are inferred automatically. app_main mounts the three tables (with different primary keys), walks the source for *.pdf, and runs one process_file component per file with mount_each.
Step-by-step walkthrough with the Pydantic schema, the three-table fan-out, the abstract splitter, and the pgvector query.
</p>mount_table_target upserts only what changed and removes rows whose PDF is gone, across all three.gpt-4o returns JSON, PaperMetadataModel.model_validate_json rejects anything off-schema — junk never reaches Postgres.@coco.fn(memo=True) skips a PDF entirely when its bytes and the function's code are unchanged, so you never re-pay for the LLM call or the embeddings on a file you've seen.EMBEDDER is declared with detect_change=True, so swapping the embedding model re-embeds everything with no cache to clear by hand.1. Start Postgres (with pgvector):
docker compose -f ../../dev/postgres.yaml up -d
2. Configure & install — the example ships a papers/ folder of well-known papers:
cp .env.example .env # set POSTGRES_URL and OPENAI_API_KEY
pip install -e .
3. Build the index — catch-up (scan, sync, exit) or live (catch up, then keep watching):
cocoindex update main # catch-up run
cocoindex update -L main # live run — watch the papers/ folder for changes
This reads each PDF's first page, LLM-extracts the metadata, embeds the title and abstract chunks, and writes the coco_examples_v1 schema's three tables.
4. Search by meaning — a plain SQL query over pgvector's cosine distance, reusing the same embedder:
python main.py "graph neural networks"
The most semantically similar titles and abstracts come back ranked — even when they share none of the query's words. Note: to keep the example minimal it declares no vector index, so queries do a sequential scan (fine for a handful of papers).
<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/paper-metadata/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples"><b>See all examples →</b></a>
</p>