docs/examples/serialization.ipynb
In this notebook we showcase the usage of Docling serializers.
%pip install -qU pip docling docling-core~=2.29 rich
DOC_SOURCE = "https://arxiv.org/pdf/2311.18481"
# we set some start-stop cues for defining an excerpt to print
start_cue = "Copyright © 2024"
stop_cue = "Application of NLP to ESG"
from rich.console import Console
from rich.panel import Panel
console = Console(width=210) # for preventing Markdown table wrapped rendering
def print_in_console(text):
console.print(Panel(text))
We first convert the document:
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
doc = converter.convert(source=DOC_SOURCE).document
We can now apply any BaseDocSerializer on the produced document.
👉 Note that, to keep the shown output brief, we only print an excerpt.
E.g. below we apply an HTMLDocSerializer:
from docling_core.transforms.serializer.html import HTMLDocSerializer
serializer = HTMLDocSerializer(doc=doc)
ser_result = serializer.serialize()
ser_text = ser_result.text
# we here only print an excerpt to keep the output brief:
print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
In the following example, we use a MarkdownDocSerializer:
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer
serializer = MarkdownDocSerializer(doc=doc)
ser_result = serializer.serialize()
ser_text = ser_result.text
print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
Let's now assume we would like to reconfigure the Markdown serialization such that:
Check out the following configuration and notice the serialization differences in the output further below:
from docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer
from docling_core.transforms.serializer.markdown import MarkdownParams
serializer = MarkdownDocSerializer(
doc=doc,
table_serializer=TripletTableSerializer(),
params=MarkdownParams(
image_placeholder="<!-- demo picture placeholder -->",
# ...
),
)
ser_result = serializer.serialize()
ser_text = ser_result.text
print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
In the examples above, we were able to reuse existing implementations for our desired serialization strategy, but let's now assume we want to define a custom serialization logic, e.g. we would like picture serialization to include any available picture description (captioning) annotations.
To that end, we first need to revisit our conversion and include all pipeline options needed for picture description enrichment.
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
PictureDescriptionVlmOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
pipeline_options = PdfPipelineOptions(
do_picture_description=True,
picture_description_options=PictureDescriptionVlmOptions(
repo_id="HuggingFaceTB/SmolVLM-256M-Instruct",
prompt="Describe this picture in three to five sentences. Be precise and concise.",
),
generate_picture_images=True,
images_scale=2,
)
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)
doc = converter.convert(source=DOC_SOURCE).document
We can then define our custom picture serializer:
from typing import Any, Optional
from docling_core.transforms.serializer.base import (
BaseDocSerializer,
SerializationResult,
)
from docling_core.transforms.serializer.common import create_ser_result
from docling_core.transforms.serializer.markdown import (
MarkdownParams,
MarkdownPictureSerializer,
)
from docling_core.types.doc.document import (
DoclingDocument,
ImageRefMode,
PictureDescriptionData,
PictureItem,
)
from typing_extensions import override
class AnnotationPictureSerializer(MarkdownPictureSerializer):
@override
def serialize(
self,
*,
item: PictureItem,
doc_serializer: BaseDocSerializer,
doc: DoclingDocument,
separator: Optional[str] = None,
**kwargs: Any,
) -> SerializationResult:
text_parts: list[str] = []
# reusing the existing result:
parent_res = super().serialize(
item=item,
doc_serializer=doc_serializer,
doc=doc,
**kwargs,
)
text_parts.append(parent_res.text)
# appending annotations:
if item.meta is not None and item.meta.description is not None:
text_parts.append(
f"<!-- Picture description: {item.meta.description.text} -->"
)
text_res = (separator or "\n").join(text_parts)
return create_ser_result(text=text_res, span_source=item)
Last but not least, we define a new doc serializer which leverages our custom picture serializer.
Notice the picture description annotations in the output below:
serializer = MarkdownDocSerializer(
doc=doc,
picture_serializer=AnnotationPictureSerializer(),
params=MarkdownParams(
image_mode=ImageRefMode.PLACEHOLDER,
image_placeholder="",
),
)
ser_result = serializer.serialize()
ser_text = ser_result.text
print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])
Another common use case is to uniquely identify each picture in the serialized output,
e.g. for downstream matching or cross-referencing with the original DoclingDocument.
The example below derives a per-picture index from self_ref and resolves an
{index} token inside the placeholder string:
from docling_core.transforms.serializer.base import SerializationResult
from docling_core.transforms.serializer.common import create_ser_result
from docling_core.transforms.serializer.markdown import (
MarkdownPictureSerializer,
)
from docling_core.types.doc.base import ImageRefMode
from docling_core.types.doc.document import DoclingDocument, PictureItem
class IndexedMarkdownPictureSerializer(MarkdownPictureSerializer):
"""Custom picture serializer that supports {index} in the placeholder."""
def _serialize_image_part(
self,
item: PictureItem,
doc: DoclingDocument,
image_mode: ImageRefMode,
image_placeholder: str,
**kwargs,
) -> SerializationResult:
pic_idx = item.self_ref.rsplit("/", 1)[-1]
resolved_placeholder = image_placeholder.replace("{index}", pic_idx)
if image_mode != ImageRefMode.PLACEHOLDER:
return super()._serialize_image_part(
item=item,
doc=doc,
image_mode=image_mode,
image_placeholder=resolved_placeholder,
**kwargs,
)
return create_ser_result(text=resolved_placeholder, span_source=item)
We can now use this serializer with a placeholder containing {index}.
Each picture will receive its own unique identifier in the output:
serializer = MarkdownDocSerializer(
doc=doc,
picture_serializer=IndexedMarkdownPictureSerializer(),
params=MarkdownParams(
image_mode=ImageRefMode.PLACEHOLDER,
image_placeholder="<!-- image_{index} -->",
),
)
ser_result = serializer.serialize()
ser_text = ser_result.text
print_in_console(ser_text[ser_text.find(start_cue) : ser_text.find(stop_cue)])