Back to Docling

EPUB Document Conversion

docs/examples/epub_conversion.ipynb

2.107.011.5 KB
Original Source

<a href="https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/epub_conversion.ipynb" target="_parent"></a>

EPUB Document Conversion

This example demonstrates how to convert EPUB (Electronic Publication) files using Docling's EPUB backend.

EPUB is a widely-used open standard format for e-books and digital publications. It's based on XHTML and can contain text, images, and metadata in a structured ZIP archive.

What you'll learn

  • How to convert EPUB files to structured DoclingDocument format
  • How to extract and handle images from EPUB archives
  • How to access EPUB metadata (title, author, language, etc.)
  • How to export EPUB content to various formats (Markdown, JSON, etc.)
  • Understanding EPUB structure and conversion features

Setup

Install Docling:

python
%pip install -q docling

Download Sample EPUB File

For this example, we'll use a public domain EPUB file from Standard Ebooks, a volunteer-driven project that produces high-quality, carefully formatted public domain ebooks.

The book we'll use is "Poetry" by Sarah Louisa Forten Purvis, available at: https://standardebooks.org/ebooks/sarah-louisa-forten-purvis/poetry

Standard Ebooks dedicates their ebook files to the public domain via the CC0 1.0 Universal Public Domain Dedication.

python
import urllib.request
from pathlib import Path

# Create directory for EPUB data
data_dir = Path("epub_data")
data_dir.mkdir(exist_ok=True)

# Download sample EPUB file from Standard Ebooks
# Note: We use the Docling test data mirror for reliable downloads in notebooks
# Original source: https://standardebooks.org/ebooks/sarah-louisa-forten-purvis/poetry
epub_file = data_dir / "sarah-louisa-forten-purvis_poetry.epub"
if not epub_file.exists():
    print("Downloading sample EPUB file...")
    print("Source: 'Poetry' by Sarah Louisa Forten Purvis from Standard Ebooks")
    # Using Docling test data for reliable notebook execution
    epub_url = "https://raw.githubusercontent.com/docling-project/docling/main/tests/data/epub/epub_purvis_poetry.epub"
    urllib.request.urlretrieve(epub_url, epub_file)
    print(f"Downloaded: {epub_file}")
    print(f"File size: {epub_file.stat().st_size / 1024:.1f} KB")
else:
    print(f"Using existing file: {epub_file}")
    print(f"File size: {epub_file.stat().st_size / 1024:.1f} KB")

Basic EPUB Conversion

Let's start with a simple conversion using the default settings:

python
from docling.document_converter import DocumentConverter

# Create converter instance
converter = DocumentConverter()

# Convert the EPUB file
print(f"Converting EPUB document: {epub_file}")
result = converter.convert(epub_file)
doc = result.document

print("\nConversion successful!")
print(f"Document name: {doc.name}")
print(f"Number of items: {len(list(doc.iterate_items()))}")

Inspect Document Structure

Let's examine the structure of the converted document:

python
from docling_core.types.doc import DocItemLabel

# Count items by type
item_counts = {}
for item, _ in doc.iterate_items():
    label = item.label
    item_counts[label] = item_counts.get(label, 0) + 1

print("Document structure:")
for label, count in sorted(item_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"  {label.value}: {count}")

View Sample Content

Let's look at some of the extracted content:

python
# Display first few text items
print("Sample text content:\n")
text_count = 0
for item, _ in doc.iterate_items():
    if item.label == DocItemLabel.TEXT and text_count < 5:
        print(f"- {item.text[:150]}..." if len(item.text) > 150 else f"- {item.text}")
        print()
        text_count += 1

Export to Markdown (Basic)

Export the document to Markdown format without images:

python
# Export to Markdown without images
markdown_content = doc.export_to_markdown()

# Display first 1500 characters
print("Markdown export (first 1500 characters):\n")
print(markdown_content[:1500])
print("\n...")

# Save to file using save_as_markdown (faster than write_text)
output_md = data_dir / "output_basic.md"
doc.save_as_markdown(output_md)
print(f"\nFull markdown saved to: {output_md}")

EPUB Conversion with Image Extraction

Now let's configure the converter to extract images from the EPUB archive:

python
from docling.datamodel.backend_options import EpubBackendOptions
from docling.document_converter import DocumentConverter, EpubFormatOption

# Configure EPUB options to extract images
epub_options = EpubBackendOptions(
    fetch_images=True,  # Extract images from EPUB archive
    enable_local_fetch=True,  # Allow reading local image files
    enable_remote_fetch=False,  # Disable fetching remote images
)

# Create converter with EPUB options
converter_with_images = DocumentConverter(
    format_options={"epub": EpubFormatOption(backend_options=epub_options)}
)

# Convert the EPUB with image extraction
print("Converting EPUB with image extraction...")
result_with_images = converter_with_images.convert(epub_file)
doc_with_images = result_with_images.document

print("\nConversion with images successful!")
print(f"Number of items: {len(list(doc_with_images.iterate_items()))}")

Export with Embedded Images

Export the document with images embedded as base64 data URIs:

python
# Export with embedded images (base64-encoded)
markdown_with_images = doc_with_images.export_to_markdown(image_mode="embedded")

# Display first 1500 characters
print("Markdown with embedded images (first 1500 characters):\n")
print(markdown_with_images[:1500])
print("\n...")

# Save to file using save_as_markdown
output_md_images = data_dir / "output_with_images.md"
doc_with_images.save_as_markdown(output_md_images, image_mode="embedded")
print(f"\nMarkdown with embedded images saved to: {output_md_images}")

Check for Images in Document

Let's check if the EPUB contains any images:

python
# Check for pictures in the document
from docling_core.types.doc import PictureItem

pictures = [
    item for item, _ in doc_with_images.iterate_items() if isinstance(item, PictureItem)
]

if pictures:
    print(f"Found {len(pictures)} image(s) in the EPUB:")
    for i, pic in enumerate(pictures[:5], 1):  # Show first 5
        print(f"  {i}. Image at position {pic.self_ref}")
        if hasattr(pic, "image") and pic.image:
            print(
                f"     Size: {pic.image.size if hasattr(pic.image, 'size') else 'unknown'}"
            )
else:
    print("No images found in this EPUB.")
    print("Note: This particular EPUB (poetry collection) may not contain images.")

Export to JSON

Export the complete document structure to JSON:

python
import json

# Export to JSON
output_json = data_dir / "output.json"
doc_with_images.save_as_json(output_json)

print(f"Document exported to JSON: {output_json}")
print(f"File size: {output_json.stat().st_size / 1024:.2f} KB")

# Display a sample of the JSON structure
with open(output_json) as f:
    json_data = json.load(f)
    print("\nJSON structure (top-level keys):")
    for key in json_data.keys():
        print(f"  - {key}")

Understanding EPUB Features

The EPUB backend provides several key features:

Structure Parsing

  • Parses EPUB structure: Reads the container.xml and content.opf files to understand the book's organization
  • Preserves reading order: Processes content files in the order specified by the spine element
  • Handles internal links: Automatically fixes cross-file references (e.g., footnote links) when combining XHTML files

Metadata Extraction

  • Retrieves title, author, language, and other Dublin Core metadata from the OPF file
  • Metadata is accessible through the DoclingDocument structure

Image Handling

  • Can extract and embed images from the EPUB archive when fetch_images=True
  • Supports multiple export modes:
    • image_mode='placeholder' (default): Replaces images with <!-- image --> comments
    • image_mode='embedded': Embeds images as base64 data URIs in the markdown

HTML Backend Integration

  • Leverages the existing HTML backend for robust XHTML content processing
  • Ensures consistent handling of HTML elements across different document types

Batch Conversion Example

Using Python API

Here's how you would convert multiple EPUB files in a directory using Python:

python
from pathlib import Path
from docling.document_converter import DocumentConverter

converter = DocumentConverter()

# Convert all EPUB files in a directory
epub_dir = Path("path/to/epub/directory")
for epub_file in epub_dir.glob("*.epub"):
    print(f"Converting {epub_file.name}...")
    result = converter.convert(str(epub_file))

    # Save to markdown with embedded images
    output_path = epub_file.with_suffix(".md")
    result.document.save_as_markdown(output_path, image_mode="embedded")
    print(f"Saved to {output_path}")

Using CLI

Alternatively, you can use the Docling CLI for batch conversion, which is even simpler:

bash
docling --to md --from epub path/to/epub/directory

Known Limitations

Internal anchor links (such as footnote references) are partially supported:

  • Links are converted: References like [1](#note-1) will appear in the output
  • Anchor targets are not preserved: The corresponding anchor IDs (e.g., id="note-1") are lost during HTML-to-DoclingDocument conversion
  • Impact: Clicking on footnote links in the exported Markdown won't jump to the footnote location

This is a limitation of the underlying HTML backend's conversion process, which focuses on extracting content structure rather than preserving HTML anchor IDs.

Example:

markdown
<!-- In the text -->
...five versts [1](#note-1) from Durnovka...

<!-- At the end (footnote section) -->
1. A verst is two-thirds of a mile. [↩︎](#noteref-1)

The links [1](#note-1) and [↩︎](#noteref-1) will be present, but the anchor targets they reference won't be accessible in the Markdown output.

Technical Details

EPUB files are ZIP archives containing:

  • XHTML content files
  • Metadata (OPF file)
  • Navigation structure
  • Images and other resources

The backend processing workflow:

  1. Extracts the ZIP archive
  2. Parses the container.xml to locate the OPF file
  3. Reads the OPF file to get metadata and reading order
  4. Combines all XHTML content files in spine order
  5. Fixes internal cross-file links
  6. Delegates to the HTML backend for final processing

Supported EPUB Versions

The backend supports EPUB 2 and EPUB 3 formats, which are the most common versions used for e-books.

Summary

In this example, we demonstrated:

✅ How to convert EPUB files to DoclingDocument format
✅ How to extract and handle images from EPUB archives
✅ How to export EPUB content to Markdown and JSON formats
✅ Different image export modes (placeholder, embedded, reference)
✅ Understanding EPUB structure and conversion features

Key Points

  • Simple conversion: Basic EPUB conversion works out of the box with DocumentConverter()
  • Image extraction: Enable with fetch_images=True in EpubBackendOptions
  • Flexible export: Choose between embedded images or placeholders
  • Metadata preservation: EPUB metadata is extracted and accessible in the document
  • Reading order: Content is processed in the correct reading order as specified in the EPUB

Next Steps

  • Try converting your own EPUB files
  • Experiment with different image export modes
  • Combine EPUB conversion with other Docling features like chunking for RAG applications
  • Explore the DoclingDocument API for more advanced document manipulation