Back to Ocrmypdf

v13

docs/releasenotes/version13.md

17.4.27.4 KB
Original Source

% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0

v13

v13.7.0

  • Fixed an exception when attempting to run and Tesseract is not installed.
  • Changed to SPDX license tracking and information files.

v13.6.2

  • Added a shim to prevent an "error during error handling" for Python 3.7 and 3.8.
  • Modernized some type annotations.
  • Improved annotations on our _windows module to help IDEs and mypy figure out what we're doing.

v13.6.1

  • Require setuptools-scm 7.0.5 to avoid possible issues with source distributions in earlier versions of setuptools-scm.
  • Suppress a spurious warning, improve tests, improve typing and other miscellany.

v13.6.0

  • Added a new initialize plugin hook, making it possible to suppress built-in plugins more easily, among other possibilities.
  • Fixed an issue where unpaper would exit with a "wrong stream" error, probably related to images with an odd integer width. {issue}887, 665

v13.5.0

  • Added a new optimize_pdf plugin hook, making it possible to create plugins that replace or enhance OCRmyPDF's PDF optimizer.
  • Removed all max version restrictions. Our new policy is to blacklist known-bad releases and only block known-bad versions of dependencies.
  • The naming schema for object that holds all OCR text that OCRmyPDF inserts has changed. This has always been an implementation detail (and remains so), but possibly, someone was relying on it and would appreciate the heads-up.
  • Cleanup.

v13.4.7

  • Fixed PermissionError when cleaning up temporary files in rare cases. {issue}974
  • Fixed PermissionError when calling os.nice on platforms that lack it. {issue}973
  • Suppressed some warnings from libxmp during tests.

v13.4.6

  • Convert error on corrupt ICC profiles into a warning. Thanks to @oscherler.

v13.4.5

  • Remove upper bound on pdfminer.six version.
  • Documentation.

v13.4.4

  • Updated pdfminer.six version.
  • Docker image changed to Ubuntu 22.04 now that it is released and provides the dependencies we need. This seems more consistent than our recent change to Debian.

v13.4.3

  • Fix error on pytest.skip() with older versions of pytest.
  • Documentation updates.

v13.4.2

  • Worked around a major regression in Ghostscript 9.56.0 where all OCR text is stripped out of the PDF. It simply removes all text, even generated by software other than OCRmyPDF. Fortunately, we can ask Ghostscript 9.56.0 to use its old behavior that worked correctly for our purposes. Users must avoid the combination (Ghostscript 9.56.0, ocrmypdf <13.4.2) since older versions of OCRmyPDF have no way of detecting that this particular version of Ghostscript removes all OCR text.
  • Marked pdfminer 20220319 as supported.
  • Fixed some deprecation warnings from recent versions of Pillow and pytest.
  • Test suite now covers Python 3.10 (Python 3.10 worked fine before, but was not being tested).
  • Docker image now uses debian:bookworm-slim as the base image to fix the Docker image build.

v13.4.1

  • Temporarily make threads rather than processes the default executor worker, due to a persistent deadlock issue when processes are used. Add a new command line argument --no-use-threads to disable this.

v13.4.0

  • Fixed test failures when using pikepdf 5.0.0.
  • Various improvements to the optimizer. In particular, we now recognize PDF images that are encoded with both deflate (PNG) and DCT (JPEG), and also produce PDF with images compressed with deflate and DCT, since this often yields file size improvements compared to plain DCT.

v13.3.0

  • Made a harmless but "scary" exception after failing to optimize an image less scary.
  • Added a warning if a page image is too large for unpaper to clean. The image is passed through without cleaning. This is due to a hard-coded limitation in a C library used by unpaper so it cannot be rectified easily.
  • We now use better default settings when calling img2pdf.
  • We no longer try to optimize images that we failed to save in certain situations.
  • We now account for some differences in text output from Tesseract 5 compared to Tesseract 4.
  • Better handling of Ghostscript producing empty images when attempting to rasterize page images.

v13.2.0

  • Removed all runtime uses of distutils since it is deprecated in standard library. We previous used distutils.version to examine version numbers of dependencies at run time, and now use packaging.version for this. This is a new dependency.
  • Fixed an error message advising the user that Ghostscript was not installed being suppressed when this condition actually happens.
  • Fixed an issue with incorrect page number and totals being displayed in the progress bar. This was purely a display/presentation issue. {issue}876.

v13.1.1

  • Fixed issue with attempting to deskew a blank page on Tesseract 5. {issue}868.

v13.1.0

  • Changed to using Python concurrent.futures-based parallel execution instead of pools, since futures have now exceed pools in features.
  • If a child worker is terminated (perhaps by the operating system or the user killing it in a task manager), the parallel task will fail an error message. Previously, the main ocrmypdf process would "hang" indefinitely, waiting for the child to report.
  • Added new argument --tesseract-thresholding to provide control over Tesseract 5's threshold parameter.
  • Documentation updates and changes. Better documentation for --output-type none, added a few releases ago. Removed some obsolete documentation.
  • Improved bash completions - thanks to @FPille.

v13.0.0

Breaking changes

  • The deprecated module ocrmypdf.leptonica has been removed.
  • We no longer depend on Leptonica (liblept) or CFFI (libffi, python3-cffi). (Note that Tesseract still requires Leptonica; OCRmyPDF no longer directly uses this library.)
  • The argument --remove-background is temporarily disabled while we search for an alternative to the Leptonica implementation of this feature.
  • The --threshold argument has been removed, since this also depended on Leptonica. Tesseract 5.x has implemented improvements to thresholding, so this feature will be redundant anyway.
  • --deskew was previous calculated by a Leptonica algorithm. We now use a feature of Tesseract to find the appropriate the angle to deskew a page. The deskew angle according to Tesseract may differ from Leptonica's algorithm. At least in theory, Tesseract's deskew angle is informed by a more complex analysis than Leptonica, so this should improve results in general. We also use Pillow to perform the deskewing, which may affect the appearance of the image compared to Leptonica.
  • Support for Python 3.6 was dropped, since this release is approaching end of life.
  • We now require pikepdf 4.0 or newer. This, in turn, means that OCRmyPDF requires a system compatible with the manylinux2014 specification. This change was "forced" by Pillow not releasing manylinux2010 wheels anymore.
  • We no longer provide requirements.txt-style files. Use pip install ocrmypdf[...] instead.
  • Bumped required versions of several libraries.

Fixes

  • Fixed an issue where OCRmyPDF failed to find Ghostscript on Windows even when installed, and would exit with an error.
  • By removing Leptonica, we fixed all issues related to Leptonica on Apple Silicon or Leptonica failing to import on Windows.