docs/batch.md
% SPDX-FileCopyrightText: 2022 James R. Barlow % SPDX-License-Identifier: CC-BY-SA-4.0
This article provides information about running OCRmyPDF on multiple files or configuring it as a service triggered by file system events.
Consider using the excellent GNU Parallel to apply OCRmyPDF to multiple files at once.
Both parallel and ocrmypdf will try to use all available processors.
To maximize parallelism without overloading your system with processes,
consider using parallel -j 2 to limit parallel to running two jobs at
once.
This command will run ocrmypdf on all files named *.pdf in the
current directory and write them to the previously created output/
folder. It will not search subdirectories.
The --tag argument tells parallel to print the filename as a prefix
whenever a message is printed, so that one can trace any errors to the
file that produced them.
:::{code} bash parallel --tag -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf :::
OCRmyPDF automatically repairs PDFs before parsing and gathering information from them.
This will walk through a directory tree and run OCR on all files in place, and printing each filename in between runs:
:::{code} bash find . -name '*.pdf' -printf '%p\n' -exec ocrmypdf '{}' '{}' ; :::
This only runs one ocrmypdf process at a time. This variation uses
find to create a directory list and parallel to parallelize runs of
ocrmypdf, again updating files in place.
:::{code} bash find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf '{}' '{}' :::
In a Windows batch file, use
:::{code} bat for /r %%f in (*.pdf) do ocrmypdf %%f %%f :::
With a Docker container, you will need to stream through standard input and output:
:::{code} bash find . -name '*.pdf' -print0 | xargs -0 | while read pdf; do pdfout=$(mktemp) docker run --rm -i jbarlow83/ocrmypdf - - <$pdf >$pdfout && cp $pdfout $pdf done :::
This user contributed script also provides an example of batch processing.
:::
Synology DiskStations (Network Attached Storage devices) can run the Docker image of OCRmyPDF if the Synology Docker package is installed. Attached is a script to address particular quirks of using OCRmyPDF on one of these devices.
At the time this script was written, it only worked for x86-based Synology products. It is not known if it will work on ARM-based Synology products. Further adjustments might be needed to deal with the Synology's relatively limited CPU and RAM.
:::
If you have thousands of files to work with, contact the author. Consulting work related to OCRmyPDF helps fund this open source project and all inquiries are appreciated.
OCRmyPDF has a folder watcher called watcher.py, which is currently included in source distributions but not part of the main program. It may be used natively or may run in a Docker container. Native instances tend to give better performance. watcher.py works on all platforms.
Users may need to customize the script to meet their requirements.
:::{code} bash
uv sync --extra watcher
pip3 install ocrmypdf[watcher]
env OCR_INPUT_DIRECTORY=/mnt/input-pdfs
OCR_OUTPUT_DIRECTORY=/mnt/output-pdfs
OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1
python3 watcher.py
:::
OCR_ON_SUCCESS_ARCHIVE to be set)OCR_ARCHIVE_DIRECTORY if the exit code is 0 (OK). Note that OCR_ON_SUCCESS_DELETE takes precedence over this option, i.e. if both options are set, the input file will be deleted.{output}/{year}/{month}/{filename}ocrmypdf.ocr, e.g. 'OCR_JSON_SETTINGS={"rotate_pages": true, "optimize": "3"}'.One could configure a networked scanner or scanning computer to drop files in the watched folder.
The watcher service is included in the OCRmyPDF Docker image. To run it:
:::{code} bash
docker run
--volume <path to files to convert>:/input
--volume <path to store results>:/output
--volume <path to store processed originals>:/processed
--env OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1
--env OCR_ON_SUCCESS_ARCHIVE=1
--env OCR_DESKEW=1
--env PYTHONUNBUFFERED=1
--interactive --tty --entrypoint python3
jbarlow83/ocrmypdf
watcher.py
:::
This service will watch for a file that matches /input/\*.pdf, convert
it to a OCRed PDF in /output/, and move the processed original to
/processed. The parameters to this image are:
:::{list-table} Watcher Docker Parameters :header-rows: 1
--volume <path to files to convert>:/input--volume <path to store results>:/output--volume <path to store processed originals>:/processed--env OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1OCR_OUTPUT_DIRECTORY_YEAR_MONTH=1 to place files in the output in {output}/{year}/{month}/{filename}--env OCR_ON_SUCCESS_ARCHIVE=1OCR_ON_SUCCESS_ARCHIVE to move processed originals--env OCR_DESKEW=1OCR_DESKEW to apply deskew to crooked input PDFs--env PYTHONBUFFERED=1STDOUT to be unbuffered and allow you to see messages in docker logs--env OCR_LOGLEVEL='DEBUG'--env OCR_JSON_SETTINGS={"language":"deu+eng", "rotate_pages": true}ocrmypdf.ocr
:::This service relies on polling to check for changes to the filesystem. It may not be suitable for some environments, such as filesystems shared on a slow network.
A configuration manager such as Docker Compose could be used to ensure that the service is always available.
:::
watchmedo may not work properly on a networked file system,
depending on the capabilities of the file system client and server.ulimit -n 1024 to watch a folder of up to 1024 files.watchmedo.You can use the Automator app with macOS, to create a Workflow or Quick
Action. Use a Run Shell Script action in your workflow. In the context
of Automator, the PATH may be set differently your Terminal's PATH;
you may need to explicitly set the PATH to include ocrmypdf. The
following example may serve as a starting point:
You may customize the command sent to ocrmypdf.