Back to Bentopdf

OCR PDF

docs/tools/ocr-pdf.md

2.7.03.2 KB
Original Source

OCR PDF

Turn scanned documents into searchable, copyable PDFs. The tool uses Tesseract OCR to recognize text in images and overlays an invisible text layer on top of the original pages, preserving their visual appearance while making the content fully searchable.

How It Works

  1. Upload a scanned PDF or image-based PDF file.
  2. Select one or more languages present in the document from the searchable language list.
  3. Optionally adjust advanced settings (resolution, binarization, whitelist).
  4. Click Start OCR and monitor progress in the real-time progress bar.
  5. When complete, review the extracted text, then download the searchable PDF or a plain .txt file.

Features

  • Multi-language OCR with a searchable language selector
  • Invisible text layer preserves the original document appearance
  • Real-time progress bar and log output during processing
  • Download results as a searchable PDF or plain text file
  • Copy extracted text directly to clipboard
  • Three resolution tiers for balancing speed versus accuracy

Options

SettingValuesDefaultPurpose
ResolutionStandard (192 DPI), High (288 DPI), Ultra (384 DPI)HighHigher resolution improves accuracy on small text but takes longer
Binarize ImageOn / OffOffEnhances contrast for clean scans by converting to black and white
Character Whitelist PresetNone, Alphanumeric, Numbers + Currency, Letters Only, Numbers Only, Invoice, Forms, CustomNoneRestricts recognized characters to improve accuracy for specific document types
Character WhitelistFree textEmptyManual character set when preset is set to Custom

Use Cases

  • Making scanned contracts and invoices searchable so you can Ctrl+F through them
  • Extracting text from photographed documents or old paper records
  • Processing receipts with the Invoice preset to accurately capture dollar amounts and dates
  • Creating accessible PDFs from image-only scans for compliance requirements
  • Batch-extracting text content from scanned books or manuals

Tips

  • Select multiple languages if your document contains mixed-language content (e.g., English headers with Japanese body text).
  • Use the binarize option for documents with faded or low-contrast text -- it can significantly improve recognition accuracy.
  • The Invoice and Forms whitelist presets dramatically reduce false positives on structured documents by ignoring irrelevant character shapes.