examples/patient_intake_extraction_baml/README.md
The hard part isn't reading the PDF; it's getting data that matches a schema so downstream code can trust it.
</p> <p align="center"> <strong>Star us ❤️ →</strong> <a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a> · <a href="https://cocoindex.io/docs/examples/patient-intake-baml/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a> · <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> </p> <div align="center"> </div>Intake forms are messy, multi-section PDFs — demographics, insurance, medications, allergies, consent — and the real challenge is getting back data that matches a schema every time. This pipeline uses BAML to declare that schema and run a single type-safe extraction per form against a Gemini vision model. You declare the transformation in native Python and your own types — target_state = transformation(source_state) — and the heavy lifting (incremental processing, change tracking, managed targets) runs in a Rust engine underneath, so only changed PDFs get re-extracted and the LLM call is skipped entirely for forms you've already processed.
The schema lives in baml_src/patient.baml, not in Python. You describe the Patient record as BAML classes; the same file holds the extraction function and the Gemini client (pointing at gemini-2.5-flash), which reads PDFs natively as vision input — no separate parse or OCR step. Running baml generate compiles this into a baml_client/ package you import from Python: b.ExtractPatientInfo(...) and the Patient Pydantic model.
The CocoIndex side is two short functions — wrap BAML in a @coco.fn, then declare one JSON file per form. Read it in main.py:
@coco.fn
async def extract_patient_info(content: bytes) -> Patient:
"""Extract patient information from PDF content using BAML."""
pdf = baml_py.Pdf.from_base64(base64.b64encode(content).decode("utf-8"))
return await b.ExtractPatientInfo(pdf)
@coco.fn(memo=True)
async def process_patient_form(file: FileLike, outdir: pathlib.Path) -> None:
content = await file.read()
patient_info = await extract_patient_info(content)
patient_json = patient_info.model_dump_json(indent=2)
output_filename = file.file_path.path.stem + ".json"
localfs.declare_file(outdir / output_filename, patient_json, create_parent_dirs=True)
The return type is Patient — the actual Pydantic class BAML generated, not a dict — so everything downstream is typed and validated. There's no prompt engineering or response parsing here; that all lives in the BAML schema, and the LLM call is one await. app_main walks data/patient_forms/ for *.pdf and runs one process_patient_form component per file with mount_each.
Step-by-step walkthrough with the BAML schema, the type-safe extraction function, the per-form component, and what happens on each kind of change.
</p>ExtractPatientInfo(intake_form: pdf) -> Patient is the whole spec — BAML forces the model's output to conform, so every record has the same shape, ready to load into a database or chart.Gemini client reads the PDF directly as vision input — checkboxes, hand-filled fields, tables — with no separate parse or Markdown step.b.ExtractPatientInfo returns a generated Pydantic Patient, not a string to parse — model_dump_json serializes the validated model straight to disk.@coco.fn(memo=True) skips a form entirely when its bytes and the function's code are unchanged, so you never pay for a second Gemini call on a PDF you've already extracted.1. Install:
pip install -e .
2. Generate the BAML client — compiles baml_src/patient.baml into the baml_client/ package that main.py imports (required):
baml generate
3. Configure — the extraction uses a Gemini vision model:
cp .env.example .env # set GEMINI_API_KEY
4. Run the pipeline — a catch-up run scans the forms, extracts, writes, and exits:
cocoindex update main.py
This reads each PDF in data/patient_forms/, extracts a Patient, and writes one JSON file per form to output_patients/:
ls output_patients/
# Patient_Intake_Form_David_Artificial.json
# Patient_Intake_Form_Emily_Artificial.json
# ...one .json per intake PDF
Each file is a fully populated, schema-validated patient record — the same shape every time. Edit the BAML schema or swap the model, run baml generate again, and the next cocoindex update main.py re-extracts only what changed.
<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/patient-intake-baml/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples"><b>See all examples →</b></a>
</p>