docs/introduction.md
% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0
OCRmyPDF is a Python application and library that adds text "layers" to images in PDFs, making scanned image PDFs searchable. It uses OCR to guess the text contained in images. OCRmyPDF also supports plugins that enable customization of its processing steps, and it is highly tolerant of PDFs containing scanned images and "born digital" content that doesn't require text recognition.
Optical character recognition is a technology that converts images of typed or handwritten text, such as in a scanned document, into computer text that can be selected, searched and copied.
OCRmyPDF uses Tesseract, a widely available open source OCR engine, to perform OCR.
(raster-vector)=
PDFs are page description files that attempt to preserve a layout exactly. They contain vector graphics that can contain raster objects, such as scanned images. Because PDFs can contain multiple pages (unlike many image formats) and can contain fonts and text, they are a suitable format for exchanging scanned documents.
:::{image} images/bitmap_vs_svg.svg :::
A PDF page may contain multiple images, even if it appears to have only one image. Some scanners or scanning software may segment pages into monochromatic text and color regions, for example, to enhance the compression ratio and appearance of the page.
Rasterizing a PDF is the process of generating corresponding raster images. OCR engines like Tesseract work with images, not scalable vector graphics or mixed raster-vector-text graphics such as PDF.
PDF/A is an ISO-standardized subset of the full PDF specification that is designed for archiving (the 'A' stands for Archive). PDF/A differs from PDF primarily by omitting features that could complicate future file readability, such as embedded Javascript, video, audio and references to external fonts. All fonts and resources needed to interpret the PDF must be contained within it. Because PDF/A disables Javascript and other types of embedded content, it is likely more secure.
There are various conformance levels and versions, such as "PDF/A-2b".
In general, the preferred format for scanned documents is PDF/A. Some governments and jurisdictions, US Courts in particular, mandate the use of PDF/A for scanned documents.
Since most individuals scanning documents aim for long-term readability, OCRmyPDF defaults to generating PDF/A-2b.
PDF/A does have a few drawbacks. Some PDF viewers display an alert indicating that the file is in PDF/A format, which may confuse some users. Additionally, it tends to result in larger files than standard PDFs because it embeds certain resources, even if they are widely available. PDF/A files can be digitally signed but may not be encrypted to ensure future readability. Fortunately, converting from PDF/A to a regular PDF is straightforward, and any PDF viewer can handle PDF/A files.
OCRmyPDF analyzes each page of a PDF to determine the required colorspace and resolution (DPI) for capturing all the information on that page without losing content. It uses a PDF rasterizer (pypdfium2 or Ghostscript) to convert each page to an image and subsequently performs OCR on the rasterized image to generate an OCR "layer." This layer is then integrated back into the original PDF.
:::{versionchanged} 17.0.0
OCRmyPDF now supports pypdfium2 as an alternative rasterizer to Ghostscript.
pypdfium2 is a Python binding for pdfium, the PDF rendering library used by
Google Chrome. The --rasterizer auto setting (default) prefers pypdfium2
when available.
:::
While it is possible to use a program like Ghostscript or ImageMagick to obtain an image and then run that image through Tesseract OCR, this process actually generates a new PDF, potentially resulting in the loss of various details (such as the document's metadata). In contrast, OCRmyPDF can produce a minimally altered PDF as the output.
OCRmyPDF also offers several image processing options, such as deskew, which enhances the visual quality of files and the accuracy of OCR. When these options are utilized, the OCR layer is integrated into the processed image.
By default, OCRmyPDF generates archival PDFs in the PDF/A format, which is
a more rigid subset of PDF features designed for long-term archives. If you
prefer regular PDFs, you can disable this feature using the
--output-type pdf option.
A PDF is similar to an HTML file, in that it contains document structure along with images. While some PDFs may solely display a full-page image, they often contain additional content that would be forfeited if not preserved.
A manual process could take one of these approaches:
In cases where a PDF solely serves as a container for images without any rotation, scaling, or cropping, the second approach can be lossless.
OCRmyPDF uses various strategies depending on input options and the input PDF itself. Generally, it rasterizes a page for OCR and then integrates the OCR data back into the original PDF. This approach allows it to handle complex PDFs and preserve their content as much as possible.
Furthermore, OCRmyPDF supports a wide range of edge cases that have emerged during several years of development. It accommodates PDF features like images within Form XObjects and pages with UserUnit scaling. It also supports less common image formats like non-monochrome 1-bit images and provides warnings about files you may not want to OCR. Thanks to tools like pikepdf and QPDF, it can auto-repair damaged PDFs. You don't need to understand the intricacies of these issues; you should be able to use OCRmyPDF with any PDF file, and expect reasonable results.
OCRmyPDF is subject to limitations imposed by the Tesseract OCR engine. These limitations are inherent to any software relying on Tesseract:
-l LANG argument.:::{versionchanged} 17.0.0 Ghostscript is no longer strictly required. OCRmyPDF can use pypdfium2 for rasterization and verapdf for PDF/A validation. :::
While Ghostscript remains a capable and feature-rich tool with a long history, recent releases have introduced some compatibility challenges that OCRmyPDF v17 addresses through alternative codepaths. When Ghostscript is used:
--pdfa-image-compression to
jpeg or lossless to set all images to one type or the other.
Ghostscript lacks an option to maintain the input image's format.
(Modern Ghostscript can copy JPEG images without transcoding them.)When pypdfium2 and verapdf are available, many of these limitations can be
avoided by using the speculative PDF/A conversion path (enabled by default
with --output-type auto).
You can use --output-type pdf to disable PDF/A conversion and produce
a standard, non-archival PDF.
Regarding OCRmyPDF itself:
To the author's knowledge, OCRmyPDF is the most feature-rich and thoroughly tested command line OCR PDF conversion tool. If it does not meet your needs, contributions and suggestions are welcome.
Ghostscript recently added three "pdfocr" output devices. They work by rasterizing all content and converting all pages to a single colour space.
The Docker image of OCRmyPDF provides a web service front-end that allows files to submitted over HTTP, and the results can be downloaded. This is an HTTP server intended to demonstrate how OCRmyPDF can be integrated into a web service. It is not intended to be deployed on the public internet and does not provide any security measures.
In addition, the following third-party integrations are available:
OCRmyPDF is not designed to be secure against malware-bearing PDFs (see Using OCRmyPDF online). Users should ensure they comply with OCRmyPDF's licenses and the licenses of all dependencies. In particular, OCRmyPDF requires Ghostscript, which is licensed under AGPLv3.