EPUB Document Conversion

This example demonstrates how to convert EPUB (Electronic Publication) files using Docling's EPUB backend.

EPUB is a widely-used open standard format for e-books and digital publications. It's based on XHTML and can contain text, images, and metadata in a structured ZIP archive.

What you'll learn

How to convert EPUB files to structured DoclingDocument format
How to extract and handle images from EPUB archives
How to access EPUB metadata (title, author, language, etc.)
How to export EPUB content to various formats (Markdown, JSON, etc.)
Understanding EPUB structure and conversion features

Setup

Install Docling:

python

%pip install -q docling

Download Sample EPUB File

For this example, we'll use a public domain EPUB file from Standard Ebooks, a volunteer-driven project that produces high-quality, carefully formatted public domain ebooks.

The book we'll use is "Poetry" by Sarah Louisa Forten Purvis, available at: https://standardebooks.org/ebooks/sarah-louisa-forten-purvis/poetry

Standard Ebooks dedicates their ebook files to the public domain via the CC0 1.0 Universal Public Domain Dedication.

python

import urllib.request
from pathlib import Path

# Create directory for EPUB data
data_dir = Path("epub_data")
data_dir.mkdir(exist_ok=True)

# Download sample EPUB file from Standard Ebooks
# Note: We use the Docling test data mirror for reliable downloads in notebooks
# Original source: https://standardebooks.org/ebooks/sarah-louisa-forten-purvis/poetry
epub_file = data_dir / "sarah-louisa-forten-purvis_poetry.epub"
if not epub_file.exists():
    print("Downloading sample EPUB file...")
    print("Source: 'Poetry' by Sarah Louisa Forten Purvis from Standard Ebooks")
    # Using Docling test data for reliable notebook execution
    epub_url = "https://raw.githubusercontent.com/docling-project/docling/main/tests/data/epub/epub_purvis_poetry.epub"
    urllib.request.urlretrieve(epub_url, epub_file)
    print(f"Downloaded: {epub_file}")
    print(f"File size: {epub_file.stat().st_size / 1024:.1f} KB")
else:
    print(f"Using existing file: {epub_file}")
    print(f"File size: {epub_file.stat().st_size / 1024:.1f} KB")

Basic EPUB Conversion

Let's start with a simple conversion using the default settings:

python

from docling.document_converter import DocumentConverter

# Create converter instance
converter = DocumentConverter()

# Convert the EPUB file
print(f"Converting EPUB document: {epub_file}")
result = converter.convert(epub_file)
doc = result.document

print("\nConversion successful!")
print(f"Document name: {doc.name}")
print(f"Number of items: {len(list(doc.iterate_items()))}")

Inspect Document Structure

Let's examine the structure of the converted document:

python

from docling_core.types.doc import DocItemLabel

# Count items by type
item_counts = {}
for item, _ in doc.iterate_items():
    label = item.label
    item_counts[label] = item_counts.get(label, 0) + 1

print("Document structure:")
for label, count in sorted(item_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"  {label.value}: {count}")

View Sample Content

Let's look at some of the extracted content:

python

# Display first few text items
print("Sample text content:\n")
text_count = 0
for item, _ in doc.iterate_items():
    if item.label == DocItemLabel.TEXT and text_count < 5:
        print(f"- {item.text[:150]}..." if len(item.text) > 150 else f"- {item.text}")
        print()
        text_count += 1

Export to Markdown (Basic)

Export the document to Markdown format without images:

python

# Export to Markdown without images
markdown_content = doc.export_to_markdown()

# Display first 1500 characters
print("Markdown export (first 1500 characters):\n")
print(markdown_content[:1500])
print("\n...")

# Save to file using save_as_markdown (faster than write_text)
output_md = data_dir / "output_basic.md"
doc.save_as_markdown(output_md)
print(f"\nFull markdown saved to: {output_md}")

EPUB Conversion with Image Extraction

Now let's configure the converter to extract images from the EPUB archive:

python

from docling.datamodel.backend_options import EpubBackendOptions
from docling.document_converter import DocumentConverter, EpubFormatOption

# Configure EPUB options to extract images
epub_options = EpubBackendOptions(
    fetch_images=True,  # Extract images from EPUB archive
    enable_local_fetch=True,  # Allow reading local image files
    enable_remote_fetch=False,  # Disable fetching remote images
)

# Create converter with EPUB options
converter_with_images = DocumentConverter(
    format_options={"epub": EpubFormatOption(backend_options=epub_options)}
)

# Convert the EPUB with image extraction
print("Converting EPUB with image extraction...")
result_with_images = converter_with_images.convert(epub_file)
doc_with_images = result_with_images.document

print("\nConversion with images successful!")
print(f"Number of items: {len(list(doc_with_images.iterate_items()))}")

Export with Embedded Images

Export the document with images embedded as base64 data URIs:

python

# Export with embedded images (base64-encoded)
markdown_with_images = doc_with_images.export_to_markdown(image_mode="embedded")

# Display first 1500 characters
print("Markdown with embedded images (first 1500 characters):\n")
print(markdown_with_images[:1500])
print("\n...")

# Save to file using save_as_markdown
output_md_images = data_dir / "output_with_images.md"
doc_with_images.save_as_markdown(output_md_images, image_mode="embedded")
print(f"\nMarkdown with embedded images saved to: {output_md_images}")

Check for Images in Document

Let's check if the EPUB contains any images:

python

# Check for pictures in the document
from docling_core.types.doc import PictureItem

pictures = [
    item for item, _ in doc_with_images.iterate_items() if isinstance(item, PictureItem)
]

if pictures:
    print(f"Found {len(pictures)} image(s) in the EPUB:")
    for i, pic in enumerate(pictures[:5], 1):  # Show first 5
        print(f"  {i}. Image at position {pic.self_ref}")
        if hasattr(pic, "image") and pic.image:
            print(
                f"     Size: {pic.image.size if hasattr(pic.image, 'size') else 'unknown'}"
            )
else:
    print("No images found in this EPUB.")
    print("Note: This particular EPUB (poetry collection) may not contain images.")

Export to JSON

Export the complete document structure to JSON:

python

import json

# Export to JSON
output_json = data_dir / "output.json"
doc_with_images.save_as_json(output_json)

print(f"Document exported to JSON: {output_json}")
print(f"File size: {output_json.stat().st_size / 1024:.2f} KB")

# Display a sample of the JSON structure
with open(output_json) as f:
    json_data = json.load(f)
    print("\nJSON structure (top-level keys):")
    for key in json_data.keys():
        print(f"  - {key}")

Understanding EPUB Features

The EPUB backend provides several key features:

Structure Parsing

Parses EPUB structure: Reads the container.xml and content.opf files to understand the book's organization
Preserves reading order: Processes content files in the order specified by the spine element
Handles internal links: Automatically fixes cross-file references (e.g., footnote links) when combining XHTML files

Metadata Extraction

Retrieves title, author, language, and other Dublin Core metadata from the OPF file
Metadata is accessible through the DoclingDocument structure

Image Handling

Can extract and embed images from the EPUB archive when fetch_images=True
Supports multiple export modes:
- image_mode='placeholder' (default): Replaces images with  comments
- image_mode='embedded': Embeds images as base64 data URIs in the markdown

HTML Backend Integration

Leverages the existing HTML backend for robust XHTML content processing
Ensures consistent handling of HTML elements across different document types

Batch Conversion Example

Using Python API

Here's how you would convert multiple EPUB files in a directory using Python:

python

from pathlib import Path
from docling.document_converter import DocumentConverter

converter = DocumentConverter()

# Convert all EPUB files in a directory
epub_dir = Path("path/to/epub/directory")
for epub_file in epub_dir.glob("*.epub"):
    print(f"Converting {epub_file.name}...")
    result = converter.convert(str(epub_file))

    # Save to markdown with embedded images
    output_path = epub_file.with_suffix(".md")
    result.document.save_as_markdown(output_path, image_mode="embedded")
    print(f"Saved to {output_path}")

Using CLI

Alternatively, you can use the Docling CLI for batch conversion, which is even simpler:

bash

docling --to md --from epub path/to/epub/directory

Known Limitations

Internal Anchor Links

Internal anchor links (such as footnote references) are partially supported:

Links are converted: References like [1](#note-1) will appear in the output
Anchor targets are not preserved: The corresponding anchor IDs (e.g., id="note-1") are lost during HTML-to-DoclingDocument conversion
Impact: Clicking on footnote links in the exported Markdown won't jump to the footnote location

This is a limitation of the underlying HTML backend's conversion process, which focuses on extracting content structure rather than preserving HTML anchor IDs.

Example:

markdown

<!-- In the text -->
...five versts [1](#note-1) from Durnovka...

<!-- At the end (footnote section) -->
1. A verst is two-thirds of a mile. [↩︎](#noteref-1)

The links [1](#note-1) and [↩︎](#noteref-1) will be present, but the anchor targets they reference won't be accessible in the Markdown output.

Technical Details

EPUB files are ZIP archives containing:

XHTML content files
Metadata (OPF file)
Navigation structure
Images and other resources

The backend processing workflow:

Extracts the ZIP archive
Parses the container.xml to locate the OPF file
Reads the OPF file to get metadata and reading order
Combines all XHTML content files in spine order
Fixes internal cross-file links
Delegates to the HTML backend for final processing

Supported EPUB Versions

The backend supports EPUB 2 and EPUB 3 formats, which are the most common versions used for e-books.

Summary

In this example, we demonstrated:

✅ How to convert EPUB files to DoclingDocument format
✅ How to extract and handle images from EPUB archives
✅ How to export EPUB content to Markdown and JSON formats
✅ Different image export modes (placeholder, embedded, reference)
✅ Understanding EPUB structure and conversion features

Key Points

Simple conversion: Basic EPUB conversion works out of the box with DocumentConverter()
Image extraction: Enable with fetch_images=True in EpubBackendOptions
Flexible export: Choose between embedded images or placeholders
Metadata preservation: EPUB metadata is extracted and accessible in the document
Reading order: Content is processed in the correct reading order as specified in the EPUB

Next Steps

Try converting your own EPUB files
Experiment with different image export modes
Combine EPUB conversion with other Docling features like chunking for RAG applications
Explore the DoclingDocument API for more advanced document manipulation