docs/releasenotes/version07.md
% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0
--force-ocr may now be used with the new --threshold and
--mask-barcodes features326)325)--mask-barcodes feature
and improved argument checkingAdded a new feature --redo-ocr to detect existing OCR in a file,
remove it, and redo the OCR. This may be particularly helpful for
anyone who wants to take advantage of OCR quality improvements in
Tesseract 4.0. Note that OCR added by OCRmyPDF before version 3.0
cannot be detected since it was not properly marked as invisible text
in the earliest versions. OCR that constructs a font from visible
text, such as Adobe Acrobat's ClearScan.
OCRmyPDF's content detection is generally more sophisticated. It learns more about the contents of each PDF and makes better recommendations:
--force-ocr to make the text searchable.Added three new experimental features to improve OCR quality in certain conditions. The name, syntax and behavior of these arguments is subject to change. They may also be incompatible with some other features.
--remove-vectors which strips out vector graphics. This can
improve OCR quality since OCR will not search artwork for readable
text; however, it currently removes "text as curves" as well.--mask-barcodes to detect and suppress barcodes in files. We
have observed that barcodes can interfere with OCR because they
are "text-like" but not actually textual.--threshold which uses a more sophisticated thresholding
algorithm than is currently in use in Tesseract OCR. This works
around a known issue in Tesseract
4.0
with dark text on bright backgrounds.Fixed an issue where an error message was not reported when the installed Ghostscript was very old.
The PDF optimizer now saves files with object streams enabled when
the optimization level is --optimize 1 or higher (the default).
This makes files a little bit smaller, but requires PDF 1.5. PDF 1.5
was first released in 2003 and is broadly supported by PDF viewers,
but some rudimentary PDF parsers such as PyPDF2 do not understand
object streams. You can use the command line tool
qpdf --object-streams=disable or
pikepdf library to remove
them.
New dependency: pdfminer.six 20181108. Note this is a fork of the Python 2-only pdfminer.
Deprecation notice: At the end of 2018, we will be ending support for Python 3.5 and Tesseract 3.x. OCRmyPDF v7 will continue to work with older versions.
Lossy JBIG2 behavior change
A user reported that ocrmypdf was in fact using JBIG2 in lossy compression mode. This was not the intended behavior. Users should review the technical concerns with JBIG2 in lossy mode and decide if this is a concern for their use case.
JBIG2 lossy mode does achieve higher compression ratios than any other monochrome compression technology; for large text documents the savings are considerable. JBIG2 lossless still gives great compression ratios and is a major improvement over the older CCITT G4 standard.
Only users who have reviewed the concerns with JBIG2 in lossy mode
should opt-in. As such, lossy mode JBIG2 is only turned on when the new
argument --jbig2-lossy is issued. This is independent of the setting
for --optimize.
Users who did not install an optional JBIG2 encoder are unaffected.
(Thanks to user 'bsdice' for reporting this issue.)
Other issues
297)299)231, a
problem with JPEG2000 images where image metadata was only available
inside the JPEG2000 file.301)-O2285.284, an error
when parsing inline images that have are also image masks, by
upgrading pikepdf to 0.3.1--rotate-pages on pages that already had
rotations applied.
({issue}279)281)The core algorithm for combining OCR layers with existing PDF pages has been rewritten and improved considerably. PDFs are no longer split into single page PDFs for processing; instead, images are rendered and the OCR results are grafted onto the input PDF. The new algorithm uses less temporary disk space and is much more performant especially for large files.
New dependency: pikepdf. pikepdf is a powerful new Python PDF library driving the latest OCRmyPDF features, built on the QPDF C++ library (libqpdf).
New feature: PDF optimization with -O or --optimize. After
OCR, OCRmyPDF will perform image optimizations relevant to OCR PDFs.
pngquant is installed, OCRmyPDF will optionally use it to
perform lossy quantization and compression of PNG images.-O0 through -O3, where 0
disables optimization and 3 implements all options. 1, the
default, performs only safe and lossless optimizations. (This is
similar to GCC's optimization parameter.) The exact type of
optimizations performed will vary over time.Small amounts of text in the margins of a page, such as watermarks,
page numbers, or digital stamps, will no longer prevent the rest of a
page from being OCRed when --skip-text is issued. This behavior
is based on a heuristic.
Removed features
--pdf-renderer tesseract PDF renderer was
removed.-g, the option to generate debug text pages, was removed
because it was a maintenance burden and only worked in isolated
cases. HOCR pages can still be previewed by running the
hocrtransform.py with appropriate settings.Removed dependencies
PyPDF2defusedxmlPyMuPDFThe sandwich PDF renderer can be used with all supported versions
of Tesseract, including that those prior to v3.05 which don't support
-c textonly. (Tesseract v4.0.0 is recommended and more
efficient.)
--pdf-renderer auto option and the diagnostics used to select a
PDF renderer now work better with old versions, but may make
different decisions than past versions.
If everything succeeds but PDF/A conversion fails, a distinct return
code is now returned (ExitCode.pdfa_conversion_failed (10)) where
this situation previously returned
ExitCode.invalid_output_pdf (4). The latter is now returned only
if there is some indication that the output file is invalid.
Notes for downstream packagers
python-xmp-toolkit which in
turn depends on libexempi3.pip install pycparser to
avoid another Python 3.7
issue.