Back to Agno

Document Extraction

cookbook/data_labeling/_16_document_extraction/README.md

2.6.81.3 KB
Original Source

Document Extraction

Multipage PDF → typed Pydantic object. The closest neighbor in production labeling: invoice fields, contract clauses, statement line items, lab report fields.

Files

  • basic.py — extract top-level document metadata.
  • with_line_items.py — extract a list of nested sub-objects (the line-item shape: invoice line items, recipe steps, contract clauses).
  • with_confidence.py — adds per-field confidence.

These examples use a public recipe book PDF as the demo input so the cookbook runs out of the box. For the production case, swap the URL or provide a local File(filepath="...") to your own invoice / contract / report PDF, and adapt the schema.

When to use

  • Lift invoice / receipt / statement fields into a database row.
  • Extract contract clauses for review queues.
  • Build a structured index over a PDF corpus.

If you only need a document-type label, use _15_document_classification/. For multi-agent quality control on top of this primitive, see _18_quality_review/.

Run

bash
python cookbook/data_labeling/_16_document_extraction/basic.py
python cookbook/data_labeling/_16_document_extraction/with_line_items.py
python cookbook/data_labeling/_16_document_extraction/with_confidence.py

Requires OPENAI_API_KEY.