% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0

v5

v5.7.0

Fixed an issue that caused poor CPU utilization on machines with more than 4 cores when running Tesseract 4. (Related to {issue}217.)
The 'hocr' renderer has been improved. The 'sandwich' and 'tesseract' renderers are still better for most use cases, but 'hocr' may be useful for people who work with the PDF.js renderer in English/ASCII languages. ({issue}225)
- It now formats text in a matter that is easier for certain PDF viewers to select and extract copy and paste text. This should help macOS Preview and PDF.js in particular.
- The appearance of selected text and behavior of selecting text is improved.
- The PDF content stream now uses relative moves, making it more compact and easier for viewers to determine when two words on the same line.
- It can now deal with text on a skewed baseline.
- Thanks to @cforcey for the pull request, @jbreiden for many helpful suggestions, @ctbarbour for another round of improvements, and @acaloiaro for an independent review.

Fixed {issue}219: change how the final output file is created to avoid triggering permission errors when the output is a special file such as /dev/null
Fixed test suite failures due to a qpdf 8.0.0 regression and Python 3.5's handling of symlink
The "encrypted PDF" error message was different depending on the type of PDF encryption. Now a single clear message appears for all types of PDF encryption.
ocrmypdf is now in Homebrew. Homebrew users are advised to the version of ocrmypdf in the official homebrew-core formulas rather than the private tap.
Some linting

Fixed {issue}216: preserve "text as curves" PDFs without rasterizing file
Related to the above, messages about rasterizing are more consistent
For consistency versions minor releases will now get the trailing .0 they always should have had.

Add new argument --max-image-mpixels. Pillow 5.0 now raises an exception when images may be decompression bombs. This argument can be used to override the limit Pillow sets.
Fixed output page cropped when using the sandwich renderer and OCR is skipped on a rotated and image-processed page
A warning is now issued when old versions of Ghostscript are used in cases known to cause issues with non-Latin characters
Fixed a few parameter validation checks for -output-type pdfa-1 and pdfa-2

Fixed {issue}181: fix final merge failure for PDFs with more pages than the system file handle limit (ulimit -n)
Fixed {issue}200: an uncommon syntax for formatting decimal numbers in a PDF would cause qpdf to issue a warning, which ocrmypdf treated as an error. Now this the warning is relayed.
Fixed an issue where intermediate PDFs would be created at version 1.3 instead of the version of the original file. It's possible but unlikely this had side effects.
A warning is now issued when older versions of qpdf are used since issues like {issue}200 cause qpdf to infinite-loop
Address issue {issue}140: if Tesseract outputs invalid UTF-8, escape it and print its message instead of aborting with a Unicode error
Adding previously unlisted setup requirement, pytest-runner
Update documentation: fix an error in the example script for Synology with Docker images, improved security guidance, advised pip install --user

If a subprocess fails to report its version when queried, exit cleanly with an error instead of throwing an exception
Added test to confirm that the system locale is Unicode-aware and fail early if it's not
Clarified some copyright information
Updated pinned requirements.txt so the homebrew formula captures more recent versions

Fixed a regression from v5.4.1 that caused sidecar files to be created as empty files

Add workaround for Tesseract v4.00alpha crash when trying to obtain orientation and the latest language packs are installed

Change wording of a deprecation warning to improve clarity
Added option to generate PDF/A-1b output if desired (--output-type pdfa-1); default remains PDF/A-2b generation
Update documentation

Fixed missing error message that should occur when trying to force --pdf-renderer sandwich on old versions of Tesseract
Update copyright information in test files
Set system LANG to UTF-8 in Dockerfiles to avoid UTF-8 encoding errors

Added --user-words and --user-patterns arguments which are forwarded to Tesseract OCR as words and regular expressions respective to use to guide OCR. Supplying a list of subject-domain words should assist Tesseract with resolving words. ({issue}165)
Using a non Latin-1 language with the "hocr" renderer now warns about possible OCR quality and recommends workarounds ({issue}176)
Output file path added to error message when that location is not writable ({issue}175)
Otherwise valid PDFs with leading whitespace at the beginning of the file are now accepted

When using Tesseract 3.05.01 or newer, OCRmyPDF will select the "sandwich" PDF renderer by default, unless another PDF renderer is specified with the --pdf-renderer argument. The previous behavior was to select --pdf-renderer=hocr.
The "tesseract" PDF renderer is now deprecated, since it can cause problems with Ghostscript on Tesseract 3.05.00
The "tess4" PDF renderer has been renamed to "sandwich". "tess4" is now a deprecated alias for "sandwich".

Files with pages larger than 200" (5080 mm) in either dimension are now supported with --output-type=pdf with the page size preserved (in the PDF specification this feature is called UserUnit scaling). Due to Ghostscript limitations this is not available in conjunction with PDF/A output.

Fixed {issue}169, exception due to failure to create sidecar text files on some versions of Tesseract 3.04, including the jbarlow83/ocrmypdf Docker image

Backward incompatible changes
- Support for Python 3.4 dropped. Python 3.5 is now required.
- Support for Tesseract 3.02 and 3.03 dropped. Tesseract 3.04 or newer is required. Tesseract 4.00 (alpha) is supported.
- The OCRmyPDF.sh script was removed.
Add a new feature, --sidecar, which allows creating "sidecar" text files which contain the OCR results in plain text. These OCR text is more reliable than extracting text from PDFs. Closes {issue}126.
New feature: --pdfa-image-compression, which allows overriding Ghostscript's lossy-or-lossless image encoding heuristic and making all images JPEG encoded or lossless encoded as desired. Fixes {issue}163.
Fixed {issue}143, added --quiet to suppress "INFO" messages
Fixed {issue}164, a typo
Removed the command line parameters -n and --just-print since they have not worked for some time (reported as Ubuntu bug #1687308)