% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0

v17

v17.4.2

Fixed Python API unconditionally overriding PIL.Image.MAX_IMAGE_PIXELS when the caller did not explicitly set max_image_mpixels. Host applications (e.g. Paperless-NGX) that configure the PIL limit before invoking ocrmypdf.ocr() now have their setting respected. The CLI default of 250 megapixels is unchanged. {issue}1665
Updated uv.lock to avoid pinning a vulnerable version of Pillow. {issue}1666

v17.4.1

Fixed RTL text extraction order in the fpdf2 renderer. Arabic lam-alef ligatures and other multi-character CMap entries were garbled by the bidi algorithm during text extraction. {issue}1655
Fixed work_folder not being set in PdfContext options when using the Python API. Thanks @bluebox-steven. {issue}1613
Updated Ghostscript JPEG corruption warning to include the detected version number, confirming the bug persists in Ghostscript 10.7.0.
Internal refactoring.
CI dependency updates.

v17.4.0

Added --no-overwrite / -n option to prevent overwriting output files. If the destination file already exists, OCRmyPDF exits with code 5 (OutputFileAccessError). {issue}1642
Fixed text layer stretching in the fpdf2 renderer for widely-spaced words. The horizontal scaling (Tz) was incorrectly stretched to fill inter-word gaps instead of relying on Td positioning, causing text selection to highlight far beyond the actual word boundaries. {issue}1635
Fixed optimize=2 or optimize=3 crash when using the Python API without explicitly setting jpg_quality or png_quality. {issue}1641
Fixed verapdf availability check crashing with NotADirectoryError on some platforms. {issue}1638

v17.3.0

Fixed Python API ignoring the language parameter, always defaulting to eng. The API now correctly maps language to OcrOptions languages and splits +-separated codes (e.g. eng+deu) to match CLI behavior. {issue}1640
Fixed Python API producing empty OCR output because tesseract_timeout defaulted to 0, causing Tesseract to time out immediately. The default is now None, falling back to the plugin's 180-second timeout. {issue}1636
Fixed OCR text layer displacement on PDFs with non-zero MediaBox origins (e.g. JSTOR or cropped PDFs). The coordinate transformation matrix is now always computed, not skipped when rotation is zero. {issue}1630
Restored image overlay support (--image) for the hocrtransform tool, enabling sandwich PDF output with the fpdf2 renderer. {issue}1634
Docker: updated Alpine base image to 3.23.
Documentation restructured into per-major-version release notes files.
Release process improvements.

v17.2.0

Fixed incorrect word spacing in poppler-based PDF viewers and tools (Evince, pdftotext, and others) where words on the same line appeared separated by double newlines. This works around a poppler bug where Tz (horizontal scaling) is not carried across BT/ET boundaries. {issue}1632
Fixed OCR text layer being visible instead of invisible due to incorrect fpdf2 text rendering mode attribute. This caused OCR text to appear when images were removed from the PDF. {issue}1631
Fixed OCR text layer misalignment with non-zero mediabox origins, which affected cropped PDFs and JSTOR PDFs generated by iText. The --redo-ocr mode would shift text vertically on these files. {issue}1630
Fixed Ghostscript rasterization failure with very low DPI values (below 10). OCRmyPDF now renders at a minimum of 10 DPI and resizes the output to match the originally requested dimensions. {issue}1612

v17.1.0

Added --tagged-pdf-mode to allow skipping the TaggedPDF error message, if desired.
Fixed an issue where deflated JPEGs (FlateDecode + DCTDecode) were counted as lossless images for the purpose of determining whether to compress to JPEG, causing file size inflation with some workflows (--mode force in particular).

v17.0.1

Fixed output file size inflation when using pypdfium as rasterizer and force-ocr mode.

v17.0.0

Breaking changes

Plugin interface migration: Plugin hooks now receive OcrOptions objects instead of argparse.Namespace objects. Most plugins will continue working due to duck-typing compatibility, but plugin developers should update their type hints from Namespace to OcrOptions.
Built-in plugins no longer modify options in-place, improving immutability and code clarity.
Lossy JBIG2 removed: The --jbig2-lossy and --jbig2-page-group-size options have been removed due to well-documented risks of character substitution errors. These options are now deprecated and will emit warnings if used. Only lossless JBIG2 compression is supported.
PDF/A output behavior change: If neither Ghostscript nor verapdf is installed, --output-type auto (the new default) will produce a standard PDF instead of PDF/A. This is a change from previous versions where Ghostscript was required and PDF/A was always produced. This configuration is rare but users should be aware of the change.

New features

pypdfium2 rasterizer: Added optional pypdfium2-based PDF rasterization plugin as an alternative to Ghostscript for page rendering. Use --rasterizer pypdfium to enable (requires pip install pypdfium2). The default --rasterizer auto prefers pypdfium when available and falls back to Ghostscript.
Pluggable OCR engines: New --ocr-engine option allows selecting OCR engines:
- auto (default): Uses Tesseract
- tesseract: Explicit Tesseract selection
- none: Skip OCR entirely for PDF processing-only workflows
This prepares the foundation for future third-party OCR engine plugins.
Smart PDF/A conversion: New --output-type auto (now the default) produces best-effort PDF/A output without requiring Ghostscript when the verapdf validator is available. Falls back to traditional Ghostscript conversion when needed.
verapdf integration: Added optional verapdf validation for fast PDF/A conversion. When available, OCRmyPDF attempts speculative PDF/A conversion using pikepdf, validates with verapdf, and skips Ghostscript if validation passes.
Optional Ghostscript: As a consequence of the changes above, Ghostscript is no longer a required dependency. It is optional.
fpdf2 text renderer: Replaced legacy hOCR text renderer with new fpdf2-based implementation, providing better multilingual support and more accurate text positioning.
Improved Occulta glyphless font: The new Occulta font provides better handling of zero-width markers and double-width CJK characters for accurate text layer positioning.
Expanded multilingual font support: Added FontProvider infrastructure with language-aware font selection for Devanagari (Hindi, Sanskrit, Marathi, Nepali), CJK (Chinese, Japanese, Korean), Arabic script, and many other scripts. System font discovery reduces package size.
Simplified mode selection: New --mode (-m) argument consolidates processing options:
- default: Error if text is found (standard behavior)
- force: Rasterize all content and run OCR (replaces --force-ocr)
- skip: Skip pages with existing text (replaces --skip-text)
- redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr)
Legacy flags remain as silent aliases for backward compatibility.

API improvements

Centralized validation logic in the OcrOptions Pydantic model
Removed scattered option mutation throughout the codebase
Better type safety for plugin development
Simplified plugin option handling
New OcrElement, OcrClass, and BoundingBox exports for OCR engine plugin developers
Extended OcrEngine ABC with generate_ocr() method for direct OCR tree output, eliding the need to translate a modern engine's output to hOCR or directly write to PDF.

Bug fixes

Fixed double-compression of already-deflated JPEGs.
Fixed tesseract_cache plugin to properly handle cache misses.
Fixed handling of PDF page boxes (ArtBox, BleedBox) which were not being processed correctly.
Added thread safety lock to pypdfium plugin for concurrent operations.
Improved pdfminer.six compatibility with explicit word spacing.

Documentation

Updated cookbook to replace deprecated --tesseract-timeout 0 with --ocr-engine none.
Added comprehensive plugin documentation for new OCR engine framework.

Dependency changes

Requires: one of pypdfium2 or ghostscript for PDF rasterization (PDF to image)
- Preferred: both
Requires: one of verapdf or ghostscript for PDF/A generation
- Preferred: both
Recommended: pypdfium2 for PDF rasterization (new dependency)
Recommended: ghostscript (used to be Required)
Recommended: Noto fonts for improved OCR text positioning
Optional: verapdf for fast PDF/A validation (new dependency)
Requires: fpdf2 for text layer rendering (new dependency)
Recommended: replace typer with cyclopts in misc scripts (new dependency)
See docs/maintainers.md for details.

Migration guide for plugin developers

Update imports: from ocrmypdf._options import OcrOptions
Update type hints: def check_options(options: OcrOptions) instead of options: Namespace
Attribute access remains unchanged: options.languages, options.output_type, etc.
Remove any in-place option modifications - compute values at point of use instead
Most existing plugins will continue working without changes due to duck-typing