docs/examples/epub_conversion.ipynb
<a href="https://colab.research.google.com/github/docling-project/docling/blob/main/docs/examples/epub_conversion.ipynb" target="_parent"></a>
This example demonstrates how to convert EPUB (Electronic Publication) files using Docling's EPUB backend.
EPUB is a widely-used open standard format for e-books and digital publications. It's based on XHTML and can contain text, images, and metadata in a structured ZIP archive.
Install Docling:
%pip install -q docling
For this example, we'll use a public domain EPUB file from Standard Ebooks, a volunteer-driven project that produces high-quality, carefully formatted public domain ebooks.
The book we'll use is "Poetry" by Sarah Louisa Forten Purvis, available at: https://standardebooks.org/ebooks/sarah-louisa-forten-purvis/poetry
Standard Ebooks dedicates their ebook files to the public domain via the CC0 1.0 Universal Public Domain Dedication.
import urllib.request
from pathlib import Path
# Create directory for EPUB data
data_dir = Path("epub_data")
data_dir.mkdir(exist_ok=True)
# Download sample EPUB file from Standard Ebooks
# Note: We use the Docling test data mirror for reliable downloads in notebooks
# Original source: https://standardebooks.org/ebooks/sarah-louisa-forten-purvis/poetry
epub_file = data_dir / "sarah-louisa-forten-purvis_poetry.epub"
if not epub_file.exists():
print("Downloading sample EPUB file...")
print("Source: 'Poetry' by Sarah Louisa Forten Purvis from Standard Ebooks")
# Using Docling test data for reliable notebook execution
epub_url = "https://raw.githubusercontent.com/docling-project/docling/main/tests/data/epub/epub_purvis_poetry.epub"
urllib.request.urlretrieve(epub_url, epub_file)
print(f"Downloaded: {epub_file}")
print(f"File size: {epub_file.stat().st_size / 1024:.1f} KB")
else:
print(f"Using existing file: {epub_file}")
print(f"File size: {epub_file.stat().st_size / 1024:.1f} KB")
Let's start with a simple conversion using the default settings:
from docling.document_converter import DocumentConverter
# Create converter instance
converter = DocumentConverter()
# Convert the EPUB file
print(f"Converting EPUB document: {epub_file}")
result = converter.convert(epub_file)
doc = result.document
print("\nConversion successful!")
print(f"Document name: {doc.name}")
print(f"Number of items: {len(list(doc.iterate_items()))}")
Let's examine the structure of the converted document:
from docling_core.types.doc import DocItemLabel
# Count items by type
item_counts = {}
for item, _ in doc.iterate_items():
label = item.label
item_counts[label] = item_counts.get(label, 0) + 1
print("Document structure:")
for label, count in sorted(item_counts.items(), key=lambda x: x[1], reverse=True):
print(f" {label.value}: {count}")
Let's look at some of the extracted content:
# Display first few text items
print("Sample text content:\n")
text_count = 0
for item, _ in doc.iterate_items():
if item.label == DocItemLabel.TEXT and text_count < 5:
print(f"- {item.text[:150]}..." if len(item.text) > 150 else f"- {item.text}")
print()
text_count += 1
Export the document to Markdown format without images:
# Export to Markdown without images
markdown_content = doc.export_to_markdown()
# Display first 1500 characters
print("Markdown export (first 1500 characters):\n")
print(markdown_content[:1500])
print("\n...")
# Save to file using save_as_markdown (faster than write_text)
output_md = data_dir / "output_basic.md"
doc.save_as_markdown(output_md)
print(f"\nFull markdown saved to: {output_md}")
Now let's configure the converter to extract images from the EPUB archive:
from docling.datamodel.backend_options import EpubBackendOptions
from docling.document_converter import DocumentConverter, EpubFormatOption
# Configure EPUB options to extract images
epub_options = EpubBackendOptions(
fetch_images=True, # Extract images from EPUB archive
enable_local_fetch=True, # Allow reading local image files
enable_remote_fetch=False, # Disable fetching remote images
)
# Create converter with EPUB options
converter_with_images = DocumentConverter(
format_options={"epub": EpubFormatOption(backend_options=epub_options)}
)
# Convert the EPUB with image extraction
print("Converting EPUB with image extraction...")
result_with_images = converter_with_images.convert(epub_file)
doc_with_images = result_with_images.document
print("\nConversion with images successful!")
print(f"Number of items: {len(list(doc_with_images.iterate_items()))}")
Export the document with images embedded as base64 data URIs:
# Export with embedded images (base64-encoded)
markdown_with_images = doc_with_images.export_to_markdown(image_mode="embedded")
# Display first 1500 characters
print("Markdown with embedded images (first 1500 characters):\n")
print(markdown_with_images[:1500])
print("\n...")
# Save to file using save_as_markdown
output_md_images = data_dir / "output_with_images.md"
doc_with_images.save_as_markdown(output_md_images, image_mode="embedded")
print(f"\nMarkdown with embedded images saved to: {output_md_images}")
Let's check if the EPUB contains any images:
# Check for pictures in the document
from docling_core.types.doc import PictureItem
pictures = [
item for item, _ in doc_with_images.iterate_items() if isinstance(item, PictureItem)
]
if pictures:
print(f"Found {len(pictures)} image(s) in the EPUB:")
for i, pic in enumerate(pictures[:5], 1): # Show first 5
print(f" {i}. Image at position {pic.self_ref}")
if hasattr(pic, "image") and pic.image:
print(
f" Size: {pic.image.size if hasattr(pic.image, 'size') else 'unknown'}"
)
else:
print("No images found in this EPUB.")
print("Note: This particular EPUB (poetry collection) may not contain images.")
Export the complete document structure to JSON:
import json
# Export to JSON
output_json = data_dir / "output.json"
doc_with_images.save_as_json(output_json)
print(f"Document exported to JSON: {output_json}")
print(f"File size: {output_json.stat().st_size / 1024:.2f} KB")
# Display a sample of the JSON structure
with open(output_json) as f:
json_data = json.load(f)
print("\nJSON structure (top-level keys):")
for key in json_data.keys():
print(f" - {key}")
The EPUB backend provides several key features:
container.xml and content.opf files to understand the book's organizationfetch_images=Trueimage_mode='placeholder' (default): Replaces images with <!-- image --> commentsimage_mode='embedded': Embeds images as base64 data URIs in the markdownHere's how you would convert multiple EPUB files in a directory using Python:
from pathlib import Path
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
# Convert all EPUB files in a directory
epub_dir = Path("path/to/epub/directory")
for epub_file in epub_dir.glob("*.epub"):
print(f"Converting {epub_file.name}...")
result = converter.convert(str(epub_file))
# Save to markdown with embedded images
output_path = epub_file.with_suffix(".md")
result.document.save_as_markdown(output_path, image_mode="embedded")
print(f"Saved to {output_path}")
Alternatively, you can use the Docling CLI for batch conversion, which is even simpler:
docling --to md --from epub path/to/epub/directory
Internal anchor links (such as footnote references) are partially supported:
[1](#note-1) will appear in the outputid="note-1") are lost during HTML-to-DoclingDocument conversionThis is a limitation of the underlying HTML backend's conversion process, which focuses on extracting content structure rather than preserving HTML anchor IDs.
Example:
<!-- In the text -->
...five versts [1](#note-1) from Durnovka...
<!-- At the end (footnote section) -->
1. A verst is two-thirds of a mile. [↩︎](#noteref-1)
The links [1](#note-1) and [↩︎](#noteref-1) will be present, but the anchor targets they reference won't be accessible in the Markdown output.
EPUB files are ZIP archives containing:
The backend processing workflow:
container.xml to locate the OPF fileThe backend supports EPUB 2 and EPUB 3 formats, which are the most common versions used for e-books.
In this example, we demonstrated:
✅ How to convert EPUB files to DoclingDocument format
✅ How to extract and handle images from EPUB archives
✅ How to export EPUB content to Markdown and JSON formats
✅ Different image export modes (placeholder, embedded, reference)
✅ Understanding EPUB structure and conversion features
DocumentConverter()fetch_images=True in EpubBackendOptions