readme/apps/transcribe/system_architecture.md
Currently, Joplin supports OCR functionality but only for printed text. Handwritten text recognition is not yet handled. The Transcribe server addresses this gap by enabling the extraction of handwritten text from images.
Recognising handwritten text is computationally demanding, which is why it requires a dedicated server separate from the main Joplin Server. Integrating Transcribe as a distinct component ensures that only one server is needed for Joplin while isolating the resource-intensive handwriting recognition process. This separation also increases stability: since Transcribe uses AI models that can unpredictably consume resources and potentially fail, running it independently means any issues won’t directly impact Joplin Server’s operation.
The main functional goals of the Transcribe server are to extract handwritten text to enable search and processing within Joplin, and to support smart notebook features by converting scanned handwritten pages into searchable text.
The Transcribe server is part of a workflow that begins with the client making a request to the Joplin Server. The Joplin Server then forwards the request to the Transcribe Server, which processes the image containing handwritten text and returns the extracted text back to the Joplin Server. Finally, the processed text is sent back to the client.
flowchart LR
Client((Client))
JoplinServer[[Joplin Server]]
TranscribeServer[[Transcribe Server]]
Client -- "Send image request" --> JoplinServer
JoplinServer -- "Forward image" --> TranscribeServer
TranscribeServer -- "Return extracted text" --> JoplinServer
JoplinServer -- "Send text response" --> Client
Transcribe API (REST, port 4567; customisable):
POST /transcribe (multipart/form-data): upload an image file to create a job; returns a job ID.POST /transcribe/:job_id: check the status of a job; when completed, returns the extracted text.?secret=...) on requests to Transcribe.Job store (PostgreSQL):
Internal queue:
Job processor (worker):
Transcription engine:
Image storage:
Joplin Server (proxy):
Client:
flowchart LR
subgraph Client
ClientNode((Joplin))
end
subgraph JoplinServer
JS[[REST API]]
end
subgraph Transcribe
API[[Transcribe API :4567]]
Q[(Internal Queue)]
Worker[[Job Processor]]
DB[(PostgreSQL - Jobs)]
Store[(Images Folder)]
Engine[[LlamaCPP + LLM]]
end
ClientNode -- "POST /transcribe" --> JS
JS -- "POST /transcribe?secret=***" --> API
API -- "Persist job (created)" --> DB
API -- "Save image" --> Store
API -- "Enqueue job" --> Q
Worker -- "Dequeue" --> Q
Worker -- "Load image" --> Store
Worker -- "Transcribe" --> Engine
Worker -- "Update status/result" --> DB
Worker -- "Delete image" --> Store
ClientNode -- "POST /transcribe/:job_id" --> JS
JS -- "POST /transcribe/:job_id?secret=***" --> API
API -- "Read status/result" --> DB
API -- "Status/result" --> JS
JS -- "Status/result (text)" --> ClientNode
Refer to the Joplin Server hardware requirements
Cost-effective configuration:
Fast / scalable configuration:
The GPU is recommended for efficient execution of the LlamaCPP-based transcription model.
Model accuracy: The LLM used for transcription is a relatively new technology and can occasionally produce unexpected or inaccurate results. However, this technology evolves quickly, and the model can be updated to a newer, more accurate version with minimal changes.
Security of LLM execution: Risks such as prompt injection exist when running LLMs. To mitigate these:
Operational costs: GPU hardware is required for efficient processing. Depending on pricing and usage, GPU costs can be significant and should be monitored closely.
pg_dump can be used.