docs/releasenotes/version17.md
% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0
app, uid/gid 1000) by default
rather than as root, as a defense-in-depth measure. If you bind-mount a
directory for input and output, you may now need to add a --user argument so
the container can write to it; the correct value differs for rootless Docker,
Podman, and rootful Docker, and is described in the Docker documentation.
Piping the input and output through stdin/stdout still works with no
permission setup./data, so files in
a directory mounted there can be given as relative paths without an explicit
--workdir.alex-p/tesseract-ocr5 PPA, and the base images
were updated to Ubuntu 26.04 and Alpine 3.24.-v 1) ({issue}846).--pdfa-image-compression=auto (the default) now selects lossless image
compression at -O0 so Ghostscript no longer transcodes lossless images to
JPEG during PDF/A generation. At -O1 and above, auto continues to defer
to Ghostscript's heuristic, which may recompress images lossily. -O1 (the
default level) is kept as a historical exception because coercing it to
lossless can substantially bloat output; users who want guaranteed lossless
image handling should pass --pdfa-image-compression=lossless or use -O0
({issue}1124).--pdfa-image-compression=lossless now passes existing JPEG images through
unchanged rather than re-encoding them with a lossless codec. Re-encoding an
already-lossy JPEG losslessly cannot recover quality and only inflates the
file, so JPEGs are preserved while non-JPEG images are encoded losslessly./MediaBox, /CropBox, /TrimBox, /ArtBox, /BleedBox) in its
input, following the PDF 2.0 specification. Coordinates written in invalid
exponential notation are reinterpreted ({issue}1398); rectangles whose
corners are given in reversed order are normalized, which previously crashed
with NegativeDimensionError ({issue}1526); and a crop/trim/art/bleed box
that falls outside the MediaBox is clamped to their intersection, or discarded
when that intersection is empty, which previously produced an output with a
zero-height effective page that some viewers refused to open ({issue}1400).
When a box is discarded, clamped, or reinterpreted, OCRmyPDF logs a warning
recommending visual inspection of the output. Thanks @ajdlinux for the initial
fix in PR #1691./Root/PieceInfo/SearchIndex) from its output. This proprietary index,
produced by Acrobat's "Embed Index" feature, is read only by Adobe Acrobat;
other viewers ignore it and search the text on the fly. Because any change to
a PDF invalidates the index, retaining it after OCRmyPDF rewrites the document
would leave a stale index that returns incorrect search results in Acrobat.
Modern viewers rebuild a search index on demand, so there is no loss of
search capability./Thumb image XObject on a page) from its output. OCRmyPDF alters page
appearance (deskew, clean, rasterize, re-render) and plugins may edit pages
arbitrarily, so a retained thumbnail would be stale and no longer match its
page. Embedded thumbnails are a navigation aid that modern viewers generate
on demand, so there is no loss of functionality.1688pngmonod
(error-diffusion) instead of pngmono (ordered dithering). It produces
better input for OCR on faint or anti-aliased scans at negligible cost and
no change to output file size, since the rasterized image is an
intermediate that is discarded after OCR.-dTextAlphaBits=4 -dGraphicsAlphaBits=4) for the
grayscale and color raster devices. Ghostscript 10.x renders aliased glyphs
that OCR frequently misreads as extra word breaks or substituted characters;
anti-aliasing materially improves OCR accuracy on the Ghostscript
rasterization path, especially for small fonts at moderate resolution. The
1-bit monochrome devices are unaffected, since they perform their own
anti-aliased downscaling and older Ghostscript versions reject alpha-bit
options on them. Note that the default rasterizer (--rasterizer auto)
prefers pypdfium2, which already anti-aliases; this change benefits users who
select --rasterizer ghostscript or do not have pypdfium2 installed.
OCRmyPDF now also logs which rasterizer rendered each page at debug verbosity
(-v 1), and the --rasterizer help text explains the OCR-quality
trade-off, to make such reports easier to diagnose. {issue}1439-v 1) so the original wording
is available for diagnosis. {issue}1566--mode strip, which removes the invisible OCR text layer from a PDF
in place. Unlike --ocr-engine none --force-ocr, it does not rasterize the
page, so images and visible content are preserved unchanged and the output is
smaller rather than larger. Only text drawn as invisible (PDF text render mode
3) is removed; some OCR engines -- and OCRmyPDF v2.2 and earlier -- express
text as visible glyphs covered by an opaque image, and that text cannot be
removed this way. {issue}1435end alias in --pages, denoting the last page
of the document. For example, --pages 3-end OCRs from page 3 through
the final page. {issue}1615--ghostscript-jpeg-quality and --ghostscript-jpeg-maxdpi
advanced options for tuning Ghostscript's PDF/A output. The optimizer's
--jpeg-quality remains the recommended file-size control.16851321TesseractConfigError with
actionable guidance, instead of crashing later with a confusing
FileNotFoundError on the missing hOCR output. {issue}1687_exec and subprocess modules to
separate probing from execution.PIL.Image.MAX_IMAGE_PIXELS
when the caller did not explicitly set max_image_mpixels. Host
applications (e.g. Paperless-NGX) that configure the PIL limit before
invoking ocrmypdf.ocr() now have their setting respected. The CLI
default of 250 megapixels is unchanged. {issue}166516661655work_folder not being set in PdfContext options when using
the Python API. Thanks @bluebox-steven. {issue}1613--no-overwrite / -n option to prevent overwriting output files.
If the destination file already exists, OCRmyPDF exits with code 5
(OutputFileAccessError). {issue}16421635optimize=2 or optimize=3 crash when using the Python API without
explicitly setting jpg_quality or png_quality. {issue}1641verapdf availability check crashing with NotADirectoryError on
some platforms. {issue}1638language parameter, always defaulting to
eng. The API now correctly maps language to OcrOptions languages
and splits +-separated codes (e.g. eng+deu) to match CLI behavior.
{issue}1640tesseract_timeout
defaulted to 0, causing Tesseract to time out immediately. The default is
now None, falling back to the plugin's 180-second timeout. {issue}16361630--image) for the hocrtransform tool,
enabling sandwich PDF output with the fpdf2 renderer. {issue}163416321631--redo-ocr
mode would shift text vertically on these files. {issue}16301612--tagged-pdf-mode to allow skipping the TaggedPDF error message, if desired.--mode force in particular).Breaking changes
OcrOptions objects instead of
argparse.Namespace objects. Most plugins will continue working due to duck-typing
compatibility, but plugin developers should update their type hints from Namespace
to OcrOptions.--jbig2-lossy and --jbig2-page-group-size options have been
removed due to well-documented risks of character substitution errors. These options are now
deprecated and will emit warnings if used. Only lossless JBIG2 compression is supported.--output-type auto (the new default) will produce a standard PDF instead of PDF/A. This is
a change from previous versions where Ghostscript was required and PDF/A was always produced.
This configuration is rare but users should be aware of the change.New features
pypdfium2 rasterizer: Added optional pypdfium2-based PDF rasterization plugin as an
alternative to Ghostscript for page rendering. Use --rasterizer pypdfium to enable
(requires pip install pypdfium2). The default --rasterizer auto prefers pypdfium when
available and falls back to Ghostscript.
Pluggable OCR engines: New --ocr-engine option allows selecting OCR engines:
auto (default): Uses Tesseracttesseract: Explicit Tesseract selectionnone: Skip OCR entirely for PDF processing-only workflowsThis prepares the foundation for future third-party OCR engine plugins.
Smart PDF/A conversion: New --output-type auto (now the default) produces best-effort
PDF/A output without requiring Ghostscript when the verapdf validator is available. Falls back
to traditional Ghostscript conversion when needed.
verapdf integration: Added optional verapdf validation for fast PDF/A conversion. When available, OCRmyPDF attempts speculative PDF/A conversion using pikepdf, validates with verapdf, and skips Ghostscript if validation passes.
Optional Ghostscript: As a consequence of the changes above, Ghostscript is no longer a required dependency. It is optional.
fpdf2 text renderer: Replaced legacy hOCR text renderer with new fpdf2-based implementation, providing better multilingual support and more accurate text positioning.
Improved Occulta glyphless font: The new Occulta font provides better handling of zero-width markers and double-width CJK characters for accurate text layer positioning.
Expanded multilingual font support: Added FontProvider infrastructure with language-aware font selection for Devanagari (Hindi, Sanskrit, Marathi, Nepali), CJK (Chinese, Japanese, Korean), Arabic script, and many other scripts. System font discovery reduces package size.
Simplified mode selection: New --mode (-m) argument consolidates processing options:
default: Error if text is found (standard behavior)force: Rasterize all content and run OCR (replaces --force-ocr)skip: Skip pages with existing text (replaces --skip-text)redo: Re-OCR pages, stripping old text layer (replaces --redo-ocr)Legacy flags remain as silent aliases for backward compatibility.
API improvements
OcrOptions Pydantic modelOcrElement, OcrClass, and BoundingBox exports for OCR engine plugin developersOcrEngine ABC with generate_ocr() method for direct OCR tree output, eliding the need to translate a modern engine's output to hOCR or directly write to PDF.Bug fixes
Documentation
--tesseract-timeout 0 with --ocr-engine none.Dependency changes
pypdfium2 or ghostscript for PDF rasterization (PDF to image)
verapdf or ghostscript for PDF/A generation
pypdfium2 for PDF rasterization (new dependency)ghostscript (used to be Required)verapdf for fast PDF/A validation (new dependency)fpdf2 for text layer rendering (new dependency)typer with cyclopts in misc scripts (new dependency)Migration guide for plugin developers
from ocrmypdf._options import OcrOptionsdef check_options(options: OcrOptions) instead of options: Namespaceoptions.languages, options.output_type, etc.