docs/releasenotes/version04.md
% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0
156,
'NoneType' object has no attribute 'getObject' on pages with no
optional /Contents record. This should resolve all issues related to
pages with no /Contents record.158, ocrmypdf
now stops and terminates if Ghostscript fails on an intermediate
step, as it is not possible to proceed.160,
exception thrown on certain invalid arguments instead of error
message154, KeyError
'/Contents' when searching for text on blank pages that have no
/Contents record. Note: incomplete fix for this issue.--skip-big raising an exception if a page contains no images
({issue}152) (thanks
to @TomRaz)151)tess4 renderer would duplicate content
onto output pages if tesseract failed or timed outtess4 renderer not recognized when lossless reconstruction
is possible147,
--pdf-renderer tess4 --clean will produce an oversized page
containing the original image in the bottom left corner, due to loss
DPI information.137,
proportions of images with a non-square pixel aspect ratio would be
distorted in output for --force-ocr and some other combinations
of flags134; PDF
reference manual 8.10), and images they contain are taken into
account when determining the resolution for rasterizing--pdf-renderer tesseract with
Tesseract 3.04 or lower due to issues with Ghostscript corrupting the
OCR text in these casesThe Docker images (ocrmypdf, ocrmypdf-polyglot, ocrmypdf-tess4) are now based on Ubuntu 16.10 instead of Debian stretch
OCRmyPDF now prevents running the Tesseract 4 renderer with Tesseract 3.04, which was permitted in v4.4 and v4.4.1 but will not work
--tesseract-config
feature--tesseract-configTesseract 4.00 is now supported on an experimental basis.
--pdf-renderer tess4 exploits Tesseract
4's new text-only output PDF mode. See the documentation on PDF
Renderers for details.--tesseract-oem argument allows control over the Tesseract
4 OCR engine mode (tesseract's --oem). Use
--tesseract-oem 2 to enforce the new LSTM mode.Fixed an issue that caused corruption of output to stdout in some cases
Removed test for Pillow JPEG and PNG support, as the minimum supported version of Pillow now enforces this
OCRmyPDF now tests that the intended destination file is writable before proceeding
The test suite now requires pytest-helpers-namespace to run (but
not install)
Significant code reorganization to make OCRmyPDF re-entrant and improve performance. All changes should be backward compatible for the v4.x series.
--deskew or --remove-background or other Leptonica
based image processing features were in use, depending on the system
value of ulimit -nNew feature --remove-background to detect and erase the
background of color and grayscale images
Better documentation
Fixed an issue with PDFs that draw images when the raster stack depth is zero
ocrmypdf can now redirect its output to stdout for use in a shell pipeline
100) with
PDFs that omit the optional /BitsPerComponent parameter on images90) caused by
PDFs that use stencil masks properlyFixed an issue with PDFs that store page rotation (/Rotate) in an indirect object
Integrated a few fixes to simplify downstream packaging (Debian)
Added a test case to check explicit masks and stencil masks
Added a test case for indirect objects and linearized PDFs
Deprecated the OCRmyPDF.sh shell script
ocrmypdf will now try to convert single image files to PDFs if they
are provided as input
({issue}15)
img2pdf
(one of ocrmypdf's dependencies)New argument --output-type {pdf|pdfa} allows disabling
Ghostscript PDF/A generation
pdfa is the default, consistent with past behaviorpdf provides a workaround for users concerned about the
increase in file size from Ghostscript forcing JBIG2 images to
CCITT and transcoding JPEGspdf preserves as much as it can about the original file,
including problems that PDF/A conversion fixesPDFs containing images with "non-square" pixel aspect ratios, such as 200x100 DPI, are now handled and converted properly (fixing a bug that caused to be cropped)
--force-ocr rasterizes pages even if they contain no images
82Fixes an issue where, with certain settings, monochrome images in
PDFs would be converted to 8-bit grayscale, increasing file size
({issue}79)
Support for Ubuntu 12.04 LTS "precise" has been dropped in favor of (roughly) Ubuntu 14.04 LTS "trusty"
Support for some older dependencies dropped
Ghostscript now runs in "safer" mode where possible
--rotate-pages now only rotates pages when reasonably confidence
in the orientation. This behavior can be adjusted with the new
argument --rotate-pages-thresholdunpaper is uninstalled or
missing at run-timeReleased with verbose debug message turned on. Do not use. Skip to v4.0.5.
New features
Fixes
Fixes
Fixes
New features
-r) is now available. It uses ignores
any prior rotation information on PDFs and sets rotation based on the
dominant orientation of detectable text. This feature is fairly
reliable but some false positives occur especially if there is not
much text to work with.
({issue}4)Fixes
49)Changes
--deskew is now performed by Leptonica instead of unpaper
({issue}25)--pdf-renderer=tesseract now displays a warning if the Tesseract
version is less than 3.04.01, the planned release that will include
fixes to an important OCR text rendering bug in Tesseract 3.04.00.
You can also manually install ./share/sharp2.ttf on top of pdf.ttf in
your Tesseract tessdata folder to correct the problem.