readme/apps/ocr.md
Optical Character Recognition (OCR) involves transforming an image containing text into a format that a machine can interpret. When you scan a text document, the computer stores it as an image file, preventing direct text editing or searching. OCR allows for the conversion of the image into text.
OCR should be enabled by default. If it is not you can enable it from the Configuration screen, under the "General" section. Once you do so, Joplin is going to scan your images (PNG and JPEG) and PDF files to extract text data from it.
Scanning documents is only available on the desktop app since this is a relatively resource-intensive process. The mobile app will have access to that OCR data via sync.
For now OCR is reliable when scanning printed text, PDFs in particular, or images where the text is clear such as screenshots. We do not currently support handwritten text, and text on photos may or may not be recognized depending on how clear it is.
When you search, the application will be able to tell you what notes but also what attachments match the query. In this case, a banner will be displayed at the top of the note that contains the attachment(s):
Searching in OCR text is enabled on the desktop and mobile app.
The application allows you to view the OCR text associated with an image. To do so, right-click on a PDF link or image and select "View OCR text". This will create a new text file with that OCR text, and open it in your text editor.
For scanned PDFs that contain only images, you can create an accessible version that allows text selection, copying, and screen reader support. To do this:
The generated PDF contains the original page images with an invisible text layer overlaid on top. This allows you to:
If the PDF was processed before this feature was available, you will be prompted to re-run OCR to generate the required word coordinate data.
To have all future PDFs automatically processed with word coordinates (so you don't need to re-run OCR), enable the "OCR: PDF processing mode" setting (under General > Advanced) and set it to "Accessible". Note that this increases database size by approximately 10 to 50 KB per page.
Watch this short video to learn how to use the Optical Character Recognition (OCR) in Joplin:
Processing images and PDF may be resource intensive, especially if you have a lot of attachments. So the first time the feature is enabled don't be surprised if Joplin CPU usage is higher than usual. Once the initial scan of all your attachments is done, this will go back to normal. Later, whenever you attach a file it will be scanned quickly in a way that's not noticeable.
As always, Joplin is offline first which means OCR too happens offline without the need for an internet connection and, more importantly, without the need to upload your private data to a third party cloud. The drawback is the aforementioned initial use of your computer's resources, but we believe this is worth it to enable full offline support.
OCR is a technology that evolves rapidly especially with the recent advances in AI and large language model (LLM) in particular. As such Joplin OCR is designed to be pluggable. We will monitor the existing open source OCR technologies and may switch to a different one if it makes sense, or provide support for multiple ones.
Additionally in some cases it may make sense to use a cloud-based solution, or simply connect to your self-hosted or intranet-based server for OCR. The current system will allow this by writing specific drivers for these services.
This pluggable interface is present in the software but not currently exposed. We will do so depending on feedback we receive and potential use cases. If you have any specific use case in mind or notice any issue with the current OCR system feel free to let us know on the forum.
After enabling OCR, Joplin downloads language files from https://cdn.jsdelivr.net/npm/@tesseract.js-data/. This URL can be customized in settings > advanced > "OCR: Language data URL or path". This URL or path should point to a directory with a .traineddata.gz file for each language to be used for OCR. After the first download, language data files are cached.
For example, to use OCR on a computer without internet access:
.traineddata.gz files for the languages that will be OCRed.
https://cdn.jsdelivr.net/npm/@tesseract.js-data/[language]/4.0.0_best_int/[language].traineddata.gz where [language] should be replaced with eng, fra, chi_sim, deu, spa, or one of the other supported language codes..traineddata.gz files to the offline computer.C:\Users\User\Documents\joplin-ocr-data\).To replace existing cached language data, click "Clear cache and re-download language data files".