docs/releasenotes/version06.md
% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0
--user-words
regressionsandwich renderer ({issue}271).ocrmypdf-tess4 has been removed. The
main Docker images, ocrmypdf and ocrmypdf-polyglot now use
Ubuntu 18.04 as a base image, and as such Tesseract 4.0.0-beta1 is
now the Tesseract version they use. There is no Docker image based on
Tesseract 3.05 anymore.262,
--remove-background error on PDFs contained colormapped
(paletted) images.253, a
possible division by zero when using the hocr renderer.<xmp:ModifyDate> field inside XMP
metadata for PDF/As. veraPDF flags this as a PDF/A validation
failure. The error is caused the timezone and final digit of the
seconds of modified time to be omitted, so at worst the modification
time stamp is rounded to the nearest 10 seconds.248
--clean argument may remove OCR from left column of text on
certain documents. We now set --layout none to suppress this.defusedxml for safety.--force-ocr.247,
/CreationDate metadata not copied from input to output.239.defusedxml dependency.pip install ocrmypdf[fitz] to use it to its full
potential.FileExistsError that could occur if OCR timed out while it
was generating the output file.
({issue}218)).
({issue}239)--skip-repair to skip the initial PDF repair
step if the PDF is already well-formed (because another program
repaired it).The software license has been changed to GPLv3 [it has since changed again]. Test resource files and some individual sources may have other licenses.
OCRmyPDF now depends on PyMuPDF. Including PyMuPDF is the primary reason for the change to GPLv3.
Other backward incompatible changes
OCRMYPDF_TESSERACT, OCRMYPDF_QPDF, OCRMYPDF_GS and
OCRMYPDF_UNPAPER environment variables are no longer used.
Change PATH if you need to override the external programs
OCRmyPDF uses.ocrmypdf package has been moved to src/ocrmypdf to
avoid issues with accidental import.ocrmypdf.exec.get_program was removed.ocrmypdf.pageinfo was removed.--pdf-renderer tess4 alias for sandwich was removed.Fixed an issue where OCRmyPDF failed to detect existing text on
pages, depending on how the text and fonts were encoded within the
PDF. ({issue}233,232)
Fixed an issue that caused dramatic inflation of file sizes when
--skip-text --output-type pdf was used. OCRmyPDF now removes
duplicate resources such as fonts, images and other objects that it
generates. ({issue}237)
Improved performance of the initial page splitting step. Originally
this step was not believed to be expensive and ran in a process.
Large file testing revealed it to be a bottleneck, so it is now
parallelized. On a 700 page file with quad core machine, this change
saves about 2 minutes. ({issue}234)
The test suite now includes a cache that can be used to speed up test
runs across platforms. This also does not require computing
checksums, so it's faster. ({issue}217)