Back to Ocrmypdf

v3

docs/releasenotes/version03.md

17.4.26.6 KB
Original Source

% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0

v3

v3.2.1

Changes

  • Fixed {issue}47 "convert() got and unexpected keyword argument 'dpi'" by upgrading to img2pdf 0.2
  • Tweaked the Dockerfiles

v3.2

New features

  • Lossless reconstruction: when possible, OCRmyPDF will inject text layers without otherwise manipulating the content and layout of a PDF page. For example, a PDF containing a mix of vector and raster content would see the vector content preserved. Images may still be transcoded during PDF/A conversion. (--deskew and --clean-final disable this mode, necessarily.)
  • New argument --tesseract-pagesegmode allows you to pass page segmentation arguments to Tesseract OCR. This helps for two column text and other situations that confuse Tesseract.
  • Added a new "polyglot" version of the Docker image, that generates Tesseract with all languages packs installed, for the polyglots among us. It is much larger.

Changes

  • JPEG transcoding quality is now 95 instead of the default 75. Bigger file sizes for less degradation.

v3.1.1

Changes

  • Fixed bug that caused incorrect page size and DPI calculations on documents with mixed page sizes

v3.1

Changes

  • Default output format is now PDF/A-2b instead of PDF/A-1b
  • Python 3.5 and macOS El Capitan are now supported platforms - no changes were needed to implement support
  • Improved some error messages related to missing input files
  • Fixed {issue}20: uppercase .PDF extension not accepted
  • Fixed an issue where OCRmyPDF failed to text that certain pages contained previously OCR'ed text, such as OCR text produced by Tesseract 3.04
  • Inserts /Creator tag into PDFs so that errors can be traced back to this project
  • Added new option --pdf-renderer=auto, to let OCRmyPDF pick the best PDF renderer. Currently it always chooses the 'hocrtransform' renderer but that behavior may change.
  • Set up Travis CI automatic integration testing

v3.0

New features

  • Easier installation with a Docker container or Python's pip package manager
  • Eliminated many external dependencies, so it's easier to setup
  • Now installs ocrmypdf to /usr/local/bin or equivalent for system-wide access and easier typing
  • Improved command line syntax and usage help (--help)
  • Tesseract 3.03+ PDF page rendering can be used instead for better positioning of recognized text (--pdf-renderer tesseract)
  • PDF metadata (title, author, keywords) are now transferred to the output PDF
  • PDF metadata can also be set from the command line (--title, etc.)
  • Automatic repairs malformed input PDFs if possible
  • Added test cases to confirm everything is working
  • Added option to skip extremely large pages that take too long to OCR and are often not OCRable (e.g. large scanned maps or diagrams); other pages are still processed (--skip-big)
  • Added option to kill Tesseract OCR process if it seems to be taking too long on a page, while still processing other pages (--tesseract-timeout)
  • Less common colorspaces (CMYK, palette) are now supported by conversion to RGB
  • Multiple images on the same PDF page are now supported

Changes

  • New, robust rewrite in Python 3.4+ with ruffus pipelines

  • Now uses Ghostscript 9.14's improved color conversion model to preserve PDF colors

  • OCR text is now rendered in the PDF as invisible text. Previous versions of OCRmyPDF incorrectly rendered visible text with an image on top.

  • All "tasks" in the pipeline can be executed in parallel on any available CPUs, increasing performance

  • The -o DPI argument has been phased out, in favor of --oversample DPI, in case we need -o OUTPUTFILE in the future

  • Removed several dependencies, so it's easier to install. We no longer use:

  • Some new external dependencies are required or optional, compared to v2.x:

    • Ghostscript 9.14+
    • qpdf 5.0.0+
    • Unpaper 6.1 (optional)
    • some automatically managed Python packages

Release candidates^

  • rc9:

    • Fix {issue}118: report error if ghostscript iccprofiles are missing
    • fixed another issue related to {issue}111: PDF rasterized to palette file
    • add support image files with a palette
    • don't try to validate PDF file after an exception occurs
  • rc8:

    • Fix {issue}111: exception thrown if PDF is missing DocumentInfo dictionary
  • rc7:

    • fix error when installing direct from pip, "no such file 'requirements.txt'"
  • rc6:

    • dropped libxml2 (Python lxml) since Python 3's internal XML parser is sufficient
    • set up Docker container
    • fix Unicode errors if recognized text contains Unicode characters and system locale is not UTF-8
  • rc5:

    • dropped Java and JHOVE in favour of qpdf
    • improved command line error output
    • additional tests and bug fixes
    • tested on Ubuntu 14.04 LTS
  • rc4:

    • dropped MuPDF in favour of qpdf
    • fixed some installer issues and errors in installation instructions
    • improve performance: run Ghostscript with multithreaded rendering
    • improve performance: use multiple cores by default
    • bug fix: checking for wrong exception on process timeout
  • rc3: skipping version number intentionally to avoid confusion with Tesseract

  • rc2: first release for public testing to test-PyPI, Github

  • rc1: testing release process

Compatibility notes

  • ./OCRmyPDF.sh script is still available for now
  • Stacking the verbosity option like -vvv is no longer supported
  • The configuration file config.sh has been removed. Instead, you can feed a file to the arguments for common settings:
ocrmypdf input.pdf output.pdf @settings.txt

where settings.txt contains one argument per line, for example:

-l
deu
--author
A. Merkel
--pdf-renderer
tesseract

Fixes

  • Handling of filenames containing spaces: fixed

Notes and known issues

  • Some dependencies may work with lower versions than tested, so try overriding dependencies if they are "in the way" to see if they work.
  • --pdf-renderer tesseract will output files with an incorrect page size in Tesseract 3.03, due to a bug in Tesseract.
  • PDF files containing "inline images" are not supported and won't be for the 3.0 release. Scanned images almost never contain inline images.