docs/releasenotes/version17.md
% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0
PIL.Image.MAX_IMAGE_PIXELS
when the caller did not explicitly set max_image_mpixels. Host
applications (e.g. Paperless-NGX) that configure the PIL limit before
invoking ocrmypdf.ocr() now have their setting respected. The CLI
default of 250 megapixels is unchanged. {issue}166516661655work_folder not being set in PdfContext options when using
the Python API. Thanks @bluebox-steven. {issue}1613--no-overwrite / -n option to prevent overwriting output files.
If the destination file already exists, OCRmyPDF exits with code 5
(OutputFileAccessError). {issue}16421635optimize=2 or optimize=3 crash when using the Python API without
explicitly setting jpg_quality or png_quality. {issue}1641verapdf availability check crashing with NotADirectoryError on
some platforms. {issue}1638language parameter, always defaulting to
eng. The API now correctly maps language to OcrOptions languages
and splits +-separated codes (e.g. eng+deu) to match CLI behavior.
{issue}1640tesseract_timeout
defaulted to 0, causing Tesseract to time out immediately. The default is
now None, falling back to the plugin's 180-second timeout. {issue}16361630--image) for the hocrtransform tool,
enabling sandwich PDF output with the fpdf2 renderer. {issue}163416321631--redo-ocr
mode would shift text vertically on these files. {issue}16301612--tagged-pdf-mode to allow skipping the TaggedPDF error message, if desired.--mode force in particular).Breaking changes
OcrOptions objects instead of
argparse.Namespace objects. Most plugins will continue working due to duck-typing
compatibility, but plugin developers should update their type hints from Namespace
to OcrOptions.--jbig2-lossy and --jbig2-page-group-size options have been
removed due to well-documented risks of character substitution errors. These options are now
deprecated and will emit warnings if used. Only lossless JBIG2 compression is supported.--output-type auto (the new default) will produce a standard PDF instead of PDF/A. This is
a change from previous versions where Ghostscript was required and PDF/A was always produced.
This configuration is rare but users should be aware of the change.New features
pypdfium2 rasterizer: Added optional pypdfium2-based PDF rasterization plugin as an
alternative to Ghostscript for page rendering. Use --rasterizer pypdfium to enable
(requires pip install pypdfium2). The default --rasterizer auto prefers pypdfium when
available and falls back to Ghostscript.
Pluggable OCR engines: New --ocr-engine option allows selecting OCR engines:
auto (default): Uses Tesseracttesseract: Explicit Tesseract selectionnone: Skip OCR entirely for PDF processing-only workflowsThis prepares the foundation for future third-party OCR engine plugins.
Smart PDF/A conversion: New --output-type auto (now the default) produces best-effort
PDF/A output without requiring Ghostscript when the verapdf validator is available. Falls back
to traditional Ghostscript conversion when needed.
verapdf integration: Added optional verapdf validation for fast PDF/A conversion. When available, OCRmyPDF attempts speculative PDF/A conversion using pikepdf, validates with verapdf, and skips Ghostscript if validation passes.
Optional Ghostscript: As a consequence of the changes above, Ghostscript is no longer a required dependency. It is optional.
fpdf2 text renderer: Replaced legacy hOCR text renderer with new fpdf2-based implementation, providing better multilingual support and more accurate text positioning.
Improved Occulta glyphless font: The new Occulta font provides better handling of zero-width markers and double-width CJK characters for accurate text layer positioning.
Expanded multilingual font support: Added FontProvider infrastructure with language-aware font selection for Devanagari (Hindi, Sanskrit, Marathi, Nepali), CJK (Chinese, Japanese, Korean), Arabic script, and many other scripts. System font discovery reduces package size.
Simplified mode selection: New --mode (-m) argument consolidates processing options:
default: Error if text is found (standard behavior)force: Rasterize all content and run OCR (replaces --force-ocr)skip: Skip pages with existing text (replaces --skip-text)redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr)Legacy flags remain as silent aliases for backward compatibility.
API improvements
OcrOptions Pydantic modelOcrElement, OcrClass, and BoundingBox exports for OCR engine plugin developersOcrEngine ABC with generate_ocr() method for direct OCR tree output, eliding the need to translate a modern engine's output to hOCR or directly write to PDF.Bug fixes
Documentation
--tesseract-timeout 0 with --ocr-engine none.Dependency changes
pypdfium2 or ghostscript for PDF rasterization (PDF to image)
verapdf or ghostscript for PDF/A generation
pypdfium2 for PDF rasterization (new dependency)ghostscript (used to be Required)verapdf for fast PDF/A validation (new dependency)fpdf2 for text layer rendering (new dependency)typer with cyclopts in misc scripts (new dependency)Migration guide for plugin developers
from ocrmypdf._options import OcrOptionsdef check_options(options: OcrOptions) instead of options: Namespaceoptions.languages, options.output_type, etc.