docs/User Guide/User Guide/Advanced Usage/Text Extraction (OCR).md
Optical Character Recognition is the process in which the text from images or PDFs is extracted.
Since v0.103.0, Trilium has built-in support for OCR. The extracted text can be:
OCR in Trilium supports the following formats:
Currently only text extraction is supported and not OCR.
The text will be extracted from the following file formats:
The OCR can be configured by going to <a class="reference-link" href="../Basic%20Concepts%20and%20Features/UI%20Elements/Options.md">Options</a> → <a class="reference-link" href="#root/_hidden/_options/_optionsMedia">Media</a> and looking for the Text Extraction (OCR) section.
There are three ways to trigger the OCR:
When extracting text from an image, there is a certain level of confidence which indicates whether the extracted text appears relevant.
When the minimum confidence is set to a low percentage, the text extraction can interpret symbols and drawings incorrectly resulting in garbled text.
If the extracted text for a note or an attachment quality is lower than the minimum confidence, the OCR is disregarded.
OCR needs to be aware of the language of the content in order for it to work correctly. The reason is that each language has its own data which needs to be downloaded, and accents or other symbols will not be supported by the default language.
To configure the languages that are supported by the OCR, simply go to <a class="reference-link" href="../Basic%20Concepts%20and%20Features/UI%20Elements/Options.md">Options</a> → <a class="reference-link" href="#root/_hidden/_options/_optionsLocalization">Language & Region</a> and adjust the Content languages.
When there are no content languages defined, the user interface Language is used instead.
After making this change, the automatic processing or manual reprocessing will take into consideration the new languages.
To enforce the detection in a particular language for a given note, use the language attribute, similar to text content language. For <a class="reference-link" href="../Basic%20Concepts%20and%20Features/Notes/Attachments.md">Attachments</a>, it's not possible to manually adjust the language.
[!NOTE] The trained data for each language is not packaged with Trilium, as that would require a significant amount of space that might not be otherwise needed. As such, when the trained data will be downloaded automatically via Tesseract.js.
The downloaded trained data is located in the <a class="reference-link" href="../Installation%20%26%20Setup/Data%20directory.md">Data directory</a>, in the
ocr-cachedirectory.
To access the extracted content of a note:
This section allows: