Back to Ocrmypdf

v17

docs/releasenotes/version17.md

17.4.29.5 KB
Original Source

% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0

v17

v17.4.2

  • Fixed Python API unconditionally overriding PIL.Image.MAX_IMAGE_PIXELS when the caller did not explicitly set max_image_mpixels. Host applications (e.g. Paperless-NGX) that configure the PIL limit before invoking ocrmypdf.ocr() now have their setting respected. The CLI default of 250 megapixels is unchanged. {issue}1665
  • Updated uv.lock to avoid pinning a vulnerable version of Pillow. {issue}1666

v17.4.1

  • Fixed RTL text extraction order in the fpdf2 renderer. Arabic lam-alef ligatures and other multi-character CMap entries were garbled by the bidi algorithm during text extraction. {issue}1655
  • Fixed work_folder not being set in PdfContext options when using the Python API. Thanks @bluebox-steven. {issue}1613
  • Updated Ghostscript JPEG corruption warning to include the detected version number, confirming the bug persists in Ghostscript 10.7.0.
  • Internal refactoring.
  • CI dependency updates.

v17.4.0

  • Added --no-overwrite / -n option to prevent overwriting output files. If the destination file already exists, OCRmyPDF exits with code 5 (OutputFileAccessError). {issue}1642
  • Fixed text layer stretching in the fpdf2 renderer for widely-spaced words. The horizontal scaling (Tz) was incorrectly stretched to fill inter-word gaps instead of relying on Td positioning, causing text selection to highlight far beyond the actual word boundaries. {issue}1635
  • Fixed optimize=2 or optimize=3 crash when using the Python API without explicitly setting jpg_quality or png_quality. {issue}1641
  • Fixed verapdf availability check crashing with NotADirectoryError on some platforms. {issue}1638

v17.3.0

  • Fixed Python API ignoring the language parameter, always defaulting to eng. The API now correctly maps language to OcrOptions languages and splits +-separated codes (e.g. eng+deu) to match CLI behavior. {issue}1640
  • Fixed Python API producing empty OCR output because tesseract_timeout defaulted to 0, causing Tesseract to time out immediately. The default is now None, falling back to the plugin's 180-second timeout. {issue}1636
  • Fixed OCR text layer displacement on PDFs with non-zero MediaBox origins (e.g. JSTOR or cropped PDFs). The coordinate transformation matrix is now always computed, not skipped when rotation is zero. {issue}1630
  • Restored image overlay support (--image) for the hocrtransform tool, enabling sandwich PDF output with the fpdf2 renderer. {issue}1634
  • Docker: updated Alpine base image to 3.23.
  • Documentation restructured into per-major-version release notes files.
  • Release process improvements.

v17.2.0

  • Fixed incorrect word spacing in poppler-based PDF viewers and tools (Evince, pdftotext, and others) where words on the same line appeared separated by double newlines. This works around a poppler bug where Tz (horizontal scaling) is not carried across BT/ET boundaries. {issue}1632
  • Fixed OCR text layer being visible instead of invisible due to incorrect fpdf2 text rendering mode attribute. This caused OCR text to appear when images were removed from the PDF. {issue}1631
  • Fixed OCR text layer misalignment with non-zero mediabox origins, which affected cropped PDFs and JSTOR PDFs generated by iText. The --redo-ocr mode would shift text vertically on these files. {issue}1630
  • Fixed Ghostscript rasterization failure with very low DPI values (below 10). OCRmyPDF now renders at a minimum of 10 DPI and resizes the output to match the originally requested dimensions. {issue}1612

v17.1.0

  • Added --tagged-pdf-mode to allow skipping the TaggedPDF error message, if desired.
  • Fixed an issue where deflated JPEGs (FlateDecode + DCTDecode) were counted as lossless images for the purpose of determining whether to compress to JPEG, causing file size inflation with some workflows (--mode force in particular).

v17.0.1

  • Fixed output file size inflation when using pypdfium as rasterizer and force-ocr mode.

v17.0.0

Breaking changes

  • Plugin interface migration: Plugin hooks now receive OcrOptions objects instead of argparse.Namespace objects. Most plugins will continue working due to duck-typing compatibility, but plugin developers should update their type hints from Namespace to OcrOptions.
  • Built-in plugins no longer modify options in-place, improving immutability and code clarity.
  • Lossy JBIG2 removed: The --jbig2-lossy and --jbig2-page-group-size options have been removed due to well-documented risks of character substitution errors. These options are now deprecated and will emit warnings if used. Only lossless JBIG2 compression is supported.
  • PDF/A output behavior change: If neither Ghostscript nor verapdf is installed, --output-type auto (the new default) will produce a standard PDF instead of PDF/A. This is a change from previous versions where Ghostscript was required and PDF/A was always produced. This configuration is rare but users should be aware of the change.

New features

  • pypdfium2 rasterizer: Added optional pypdfium2-based PDF rasterization plugin as an alternative to Ghostscript for page rendering. Use --rasterizer pypdfium to enable (requires pip install pypdfium2). The default --rasterizer auto prefers pypdfium when available and falls back to Ghostscript.

  • Pluggable OCR engines: New --ocr-engine option allows selecting OCR engines:

    • auto (default): Uses Tesseract
    • tesseract: Explicit Tesseract selection
    • none: Skip OCR entirely for PDF processing-only workflows

    This prepares the foundation for future third-party OCR engine plugins.

  • Smart PDF/A conversion: New --output-type auto (now the default) produces best-effort PDF/A output without requiring Ghostscript when the verapdf validator is available. Falls back to traditional Ghostscript conversion when needed.

  • verapdf integration: Added optional verapdf validation for fast PDF/A conversion. When available, OCRmyPDF attempts speculative PDF/A conversion using pikepdf, validates with verapdf, and skips Ghostscript if validation passes.

  • Optional Ghostscript: As a consequence of the changes above, Ghostscript is no longer a required dependency. It is optional.

  • fpdf2 text renderer: Replaced legacy hOCR text renderer with new fpdf2-based implementation, providing better multilingual support and more accurate text positioning.

  • Improved Occulta glyphless font: The new Occulta font provides better handling of zero-width markers and double-width CJK characters for accurate text layer positioning.

  • Expanded multilingual font support: Added FontProvider infrastructure with language-aware font selection for Devanagari (Hindi, Sanskrit, Marathi, Nepali), CJK (Chinese, Japanese, Korean), Arabic script, and many other scripts. System font discovery reduces package size.

  • Simplified mode selection: New --mode (-m) argument consolidates processing options:

    • default: Error if text is found (standard behavior)
    • force: Rasterize all content and run OCR (replaces --force-ocr)
    • skip: Skip pages with existing text (replaces --skip-text)
    • redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr)

    Legacy flags remain as silent aliases for backward compatibility.

API improvements

  • Centralized validation logic in the OcrOptions Pydantic model
  • Removed scattered option mutation throughout the codebase
  • Better type safety for plugin development
  • Simplified plugin option handling
  • New OcrElement, OcrClass, and BoundingBox exports for OCR engine plugin developers
  • Extended OcrEngine ABC with generate_ocr() method for direct OCR tree output, eliding the need to translate a modern engine's output to hOCR or directly write to PDF.

Bug fixes

  • Fixed double-compression of already-deflated JPEGs.
  • Fixed tesseract_cache plugin to properly handle cache misses.
  • Fixed handling of PDF page boxes (ArtBox, BleedBox) which were not being processed correctly.
  • Added thread safety lock to pypdfium plugin for concurrent operations.
  • Improved pdfminer.six compatibility with explicit word spacing.

Documentation

  • Updated cookbook to replace deprecated --tesseract-timeout 0 with --ocr-engine none.
  • Added comprehensive plugin documentation for new OCR engine framework.

Dependency changes

  • Requires: one of pypdfium2 or ghostscript for PDF rasterization (PDF to image)
    • Preferred: both
  • Requires: one of verapdf or ghostscript for PDF/A generation
    • Preferred: both
  • Recommended: pypdfium2 for PDF rasterization (new dependency)
  • Recommended: ghostscript (used to be Required)
  • Recommended: Noto fonts for improved OCR text positioning
  • Optional: verapdf for fast PDF/A validation (new dependency)
  • Requires: fpdf2 for text layer rendering (new dependency)
  • Recommended: replace typer with cyclopts in misc scripts (new dependency)
  • See docs/maintainers.md for details.

Migration guide for plugin developers

  • Update imports: from ocrmypdf._options import OcrOptions
  • Update type hints: def check_options(options: OcrOptions) instead of options: Namespace
  • Attribute access remains unchanged: options.languages, options.output_type, etc.
  • Remove any in-place option modifications - compute values at point of use instead
  • Most existing plugins will continue working without changes due to duck-typing