Back to Docling

Information extraction

docs/examples/extraction.ipynb

2.92.04.4 KB
Original Source

Information extraction

Docling provides the capability of extracting information, i.e. structured data, from unstructured documents.

The user can provide the desired data schema AKA template, either as a dictionary or as a Pydantic model, and Docling will return the extracted data as a standardized output, organized by page.

Check out the subsections below for different usage scenarios.

python
%pip install -q docling[vlm]  # Install the Docling package with VLM support
python
from IPython import display
from pydantic import BaseModel, Field
from rich import print

In this notebook, we will work with an example input image — let's quickly inspect it:

python
file_path = (
    "https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg"
)
display.HTML(f"")

Defining the extractor

Let's first define our extractor:

python
from docling.datamodel.base_models import InputFormat
from docling.document_extractor import DocumentExtractor

extractor = DocumentExtractor(allowed_formats=[InputFormat.IMAGE, InputFormat.PDF])

Following, we look at different ways to define the data template.

Using a string template

python
result = extractor.extract(
    source=file_path,
    template='{"bill_no": "string", "total": "float"}',
)
print(result.pages)

Using a dict template

python
result = extractor.extract(
    source=file_path,
    template={
        "bill_no": "string",
        "total": "float",
    },
)
print(result.pages)

Using a Pydantic model template

First we define the Pydantic model we want to use

python
from typing import Optional


class Invoice(BaseModel):
    bill_no: str = Field(
        examples=["A123", "5414"]
    )  # provide some examples, but no default value
    total: float = Field(
        default=10, examples=[20]
    )  # provide some examples and a default value
    tax_id: Optional[str] = Field(default=None, examples=["1234567890"])

The class itself can then be used directly as the template:

python
result = extractor.extract(
    source=file_path,
    template=Invoice,
)
print(result.pages)

Alternatively, a Pydantic model instance can be passed as a template instead, allowing to override the default values.

This can be very useful in scenarios where we happen to have available context that is more relevant than the default values predefined in the model definition.

E.g. in the example below:

  • bill_no and total are actually set from the value extracted from the data,
  • there was no tax_id to be extracted, so the updated default we provided was applied
python
result = extractor.extract(
    source=file_path,
    template=Invoice(
        bill_no="41",
        total=100,
        tax_id="42",
    ),
)
print(result.pages)

Advanced Pydantic model

Besides a flat template, we can in principle use any Pydantic model, thus leveraging reuse and being able to capture hierarchies:

python
class Contact(BaseModel):
    name: Optional[str] = Field(default=None, examples=["Smith"])
    address: str = Field(default="123 Main St", examples=["456 Elm St"])
    postal_code: str = Field(default="12345", examples=["67890"])
    city: str = Field(default="Anytown", examples=["Othertown"])
    country: Optional[str] = Field(default=None, examples=["Canada"])


class ExtendedInvoice(BaseModel):
    bill_no: str = Field(
        examples=["A123", "5414"]
    )  # provide some examples, but not the actual value of the test sample
    total: float = Field(
        default=10, examples=[20]
    )  # provide a default value and some examples
    garden_work_hours: int = Field(default=1, examples=[2])
    sender: Contact = Field(default=Contact(), examples=[Contact()])
    receiver: Contact = Field(default=Contact(), examples=[Contact()])
python
result = extractor.extract(
    source=file_path,
    template=ExtendedInvoice,
)
print(result.pages)

Validating and loading the extracted data

The generated response data can be easily validated and loaded via Pydantic:

python
invoice = ExtendedInvoice.model_validate(result.pages[0].extracted_data)
print(invoice)

This way, we can get from completely unstructured data to a very structured and developer-friendly representation:

python
print(
    f"Invoice #{invoice.bill_no} was sent by {invoice.sender.name} "
    f"to {invoice.receiver.name} at {invoice.sender.address}."
)