cookbook/data_labeling/_16_document_extraction/README.md
Multipage PDF → typed Pydantic object. The closest neighbor in production labeling: invoice fields, contract clauses, statement line items, lab report fields.
basic.py — extract top-level document metadata.with_line_items.py — extract a list of nested sub-objects (the
line-item shape: invoice line items, recipe steps, contract clauses).with_confidence.py — adds per-field confidence.These examples use a public recipe book PDF as the demo input so the
cookbook runs out of the box. For the production case, swap the URL or
provide a local File(filepath="...") to your own invoice / contract /
report PDF, and adapt the schema.
If you only need a document-type label, use
_15_document_classification/. For multi-agent
quality control on top of this primitive, see
_18_quality_review/.
python cookbook/data_labeling/_16_document_extraction/basic.py
python cookbook/data_labeling/_16_document_extraction/with_line_items.py
python cookbook/data_labeling/_16_document_extraction/with_confidence.py
Requires OPENAI_API_KEY.