docs/releasenotes/version05.md
% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0
Fixed an issue that caused poor CPU utilization on machines with more
than 4 cores when running Tesseract 4. (Related to {issue}217.)
The 'hocr' renderer has been improved. The 'sandwich' and 'tesseract'
renderers are still better for most use cases, but 'hocr' may be
useful for people who work with the PDF.js renderer in English/ASCII
languages. ({issue}225)
219: change
how the final output file is created to avoid triggering permission
errors when the output is a special file such as /dev/null216: preserve
"text as curves" PDFs without rasterizing file--max-image-mpixels. Pillow 5.0 now raises an
exception when images may be decompression bombs. This argument can
be used to override the limit Pillow sets.-output-type pdfa-1 and
pdfa-2181: fix
final merge failure for PDFs with more pages than the system file
handle limit (ulimit -n)200: an
uncommon syntax for formatting decimal numbers in a PDF would cause
qpdf to issue a warning, which ocrmypdf treated as an error. Now this
the warning is relayed.200 cause
qpdf to infinite-loop140: if
Tesseract outputs invalid UTF-8, escape it and print its message
instead of aborting with a Unicode errorpip install --user--output-type pdfa-1); default remains PDF/A-2b generation--pdf-renderer sandwich on old versions of TesseractLANG to UTF-8 in Dockerfiles to avoid UTF-8 encoding
errors--user-words and --user-patterns arguments which are
forwarded to Tesseract OCR as words and regular expressions
respective to use to guide OCR. Supplying a list of subject-domain
words should assist Tesseract with resolving words.
({issue}165)176)175)--pdf-renderer argument. The previous behavior
was to select --pdf-renderer=hocr.--output-type=pdf with the page size preserved
(in the PDF specification this feature is called UserUnit scaling).
Due to Ghostscript limitations this is not available in conjunction
with PDF/A output.169,
exception due to failure to create sidecar text files on some
versions of Tesseract 3.04, including the jbarlow83/ocrmypdf Docker
imageBackward incompatible changes
- Support for Python 3.4 dropped. Python 3.5 is now required.
- Support for Tesseract 3.02 and 3.03 dropped. Tesseract 3.04 or newer is required. Tesseract 4.00 (alpha) is supported.
- The OCRmyPDF.sh script was removed.
Add a new feature, --sidecar, which allows creating "sidecar"
text files which contain the OCR results in plain text. These OCR
text is more reliable than extracting text from PDFs. Closes
{issue}126.
New feature: --pdfa-image-compression, which allows overriding
Ghostscript's lossy-or-lossless image encoding heuristic and making
all images JPEG encoded or lossless encoded as desired. Fixes
{issue}163.
Fixed {issue}143, added
--quiet to suppress "INFO" messages
Fixed {issue}164, a typo
Removed the command line parameters -n and --just-print since
they have not worked for some time (reported as Ubuntu bug
#1687308)