README - Cocoindex — ContextQMD

<a href="https://cocoindex.io/docs/examples/paper-metadata/" title="Index academic PDFs into structured metadata with CocoIndex — LLM-extract title/authors/abstract, embed for semantic search, store in Postgres pgvector, in plain async Python"> </a> <h1 align="center">Turn a folder of papers into structured metadata.</h1> Read just the first page, LLM-extract title, authors, abstract into typed rows, then embed the metadata so you can search papers by meaning — in plain async Python.

One PDF fans out into three Postgres tables, and CocoIndex keeps all three in sync as the folder changes.

Star us ❤️ → <a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a>  ·  <a href="https://cocoindex.io/docs/examples/paper-metadata/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a>  ·  <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> <div align="center">

</div>

The first page of a paper holds almost everything you'd want to query — title, authors, abstract — but it's locked in PDF prose. This pipeline reads just that page, hands the text to an LLM with a strict schema, and gets back clean typed JSON; the same metadata is then embedded so you can search by meaning, not exact words. You declare the transformation in native Python and your own types — target_state = transformation(source_state) — and the heavy lifting (incremental processing, change tracking, managed targets) runs in a Rust engine underneath, so only changed PDFs get re-extracted and re-embedded.

How it works

One PDF flows through three small functions and fans into three tables:

extract_basic_info slices the first page out of the PDF and counts the pages; pdf_to_markdown pulls the text off that page with pypdf.
extract_metadata hands that text to gpt-4o (via the openai SDK) with response_format={"type": "json_object"} and temperature=0, then model_validate_json parses it into a typed PaperMetadataModel — a malformed response fails loudly instead of writing junk.
process_file declares the rows: one metadata row, one author-index row per author, one embedding row for the title plus one per abstract chunk.

Read it in main.py:

python

@coco.fn(memo=True)
async def process_file(
    file: FileLike,
    metadata_table: postgres.TableTarget[PaperMetadataRow],
    author_table: postgres.TableTarget[AuthorPaperRow],
    embedding_table: postgres.TableTarget[MetadataEmbeddingRow],
) -> None:
    content = await file.read()
    basic_info = extract_basic_info(content)
    metadata = extract_metadata(pdf_to_markdown(basic_info.first_page))

    metadata_table.declare_row(row=PaperMetadataRow(
        filename=str(file.file_path.path), title=metadata.title,
        authors=[a.model_dump() for a in metadata.authors],
        abstract=metadata.abstract, num_pages=basic_info.num_pages,
    ))
    for author in metadata.authors:
        if author.name:
            author_table.declare_row(row=AuthorPaperRow(
                author_name=author.name, filename=str(file.file_path.path)))

    title_embedding = await coco.use_context(EMBEDDER).embed(metadata.title)
    embedding_table.declare_row(row=MetadataEmbeddingRow(
        id=uuid.uuid4(), filename=str(file.file_path.path),
        location="title", text=metadata.title, embedding=title_embedding))
    for chunk in _abstract_splitter.split(metadata.abstract, chunk_size=500, ...):
        embedding_table.declare_row(row=MetadataEmbeddingRow(
            id=uuid.uuid4(), filename=str(file.file_path.path), location="abstract",
            text=chunk.text, embedding=await coco.use_context(EMBEDDER).embed(chunk.text)))

embedding: Annotated[NDArray, EMBEDDER] ties the vector column to the embedder, so its dimensions are inferred automatically. app_main mounts the three tables (with different primary keys), walks the source for *.pdf, and runs one process_file component per file with mount_each.

📘 <a href="https://cocoindex.io/docs/examples/paper-metadata/">Full Tutorial →</a>

Step-by-step walkthrough with the Pydantic schema, the three-table fan-out, the abstract splitter, and the pgvector query.

Why it's worth a star ⭐

One file, three tables, kept in sync. Paper metadata, an author-to-paper index, and embeddings — mount_table_target upserts only what changed and removes rows whose PDF is gone, across all three.
First page only, capped at 4000 chars. That's almost always enough for the title block and abstract, and it keeps token cost flat regardless of paper length.
Typed extraction, validated loud. gpt-4o returns JSON, PaperMetadataModel.model_validate_json rejects anything off-schema — junk never reaches Postgres.
Incremental by default. @coco.fn(memo=True) skips a PDF entirely when its bytes and the function's code are unchanged, so you never re-pay for the LLM call or the embeddings on a file you've seen.
Honest cache busting. EMBEDDER is declared with detect_change=True, so swapping the embedding model re-embeds everything with no cache to clear by hand.

Run it

1. Start Postgres (with pgvector):

docker compose -f ../../dev/postgres.yaml up -d

2. Configure & install — the example ships a papers/ folder of well-known papers:

cp .env.example .env     # set POSTGRES_URL and OPENAI_API_KEY
pip install -e .

3. Build the index — catch-up (scan, sync, exit) or live (catch up, then keep watching):

cocoindex update main       # catch-up run
cocoindex update -L main    # live run — watch the papers/ folder for changes

This reads each PDF's first page, LLM-extracts the metadata, embeds the title and abstract chunks, and writes the coco_examples_v1 schema's three tables.

4. Search by meaning — a plain SQL query over pgvector's cosine distance, reusing the same embedder:

python main.py "graph neural networks"

The most semantically similar titles and abstracts come back ranked — even when they share none of the query's words. Note: to keep the example minimal it declares no vector index, so queries do a sequential scan (fine for a handful of papers).

If this turned your PDFs into searchable rows, <a href="https://github.com/cocoindex-io/cocoindex">give CocoIndex a star ⭐</a> — it helps a lot.

<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/paper-metadata/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples">See all examples →</a>