docs/usage/api_server/deployment.md
!!! info "Synced from docling-serve v1.21.0" This page summarizes the docling-serve documentation at v1.21.0. For the exhaustive reference, follow the links to the source repository.
Get docling-serve running on one machine fast. For cluster/production hardening, follow the links to the docling-serve repo.
Two independent choices shape how you run it:
docling-serve command, or Docker Compose.DOCLING_SERVE_ENG_KIND) — the in-process Local engine (default), or the Redis-backed RQ engine.| Local engine (default) | RQ engine (Redis + workers) | |
|---|---|---|
docling-serve command | Quickstart / dev | Distributed |
| Docker Compose | Containerized single node (+GPU) | → serve repo |
docling-serve is configured by CLI flags or environment variables. Precedence is environment variable > config file > defaults.
!!! warning "Subprocess gotcha"
When uvicorn runs with --reload or --workers > 1 it spawns subprocesses, and CLI flags (e.g. --enable-ui, --artifacts-path) are ignored. Use the DOCLING_SERVE_* environment variables in those deployments.
| Setting (env var) | What it does | Default |
|---|---|---|
UVICORN_HOST / UVICORN_PORT | bind address / port | 0.0.0.0 / 5001 |
UVICORN_WORKERS | uvicorn worker processes | 1 |
DOCLING_SERVE_API_KEY | require an X-Api-Key header | unset |
DOCLING_SERVE_ENABLE_UI | serve the Gradio demo UI at /ui | false |
DOCLING_SERVE_ARTIFACTS_PATH | local path to pre-downloaded models | unset (auto-download) |
DOCLING_SERVE_MAX_NUM_PAGES / DOCLING_SERVE_MAX_FILE_SIZE | per-request limits | unset |
DOCLING_SERVE_ENG_KIND | async engine: local or rq (also kfp/ray — see serve repo) | local |
See the full reference in the source repo: configuration.md and .env.example.
These tune Docling itself and are read by the server too:
| Env var | What it does | Default |
|---|---|---|
DOCLING_DEVICE | inference device: cpu / cuda / mps | auto |
DOCLING_NUM_THREADS | CPU threads | runtime default |
DOCLING_PERF_PAGE_BATCH_SIZE | pages per batch | runtime default |
DOCLING_PERF_ELEMENTS_BATCH_SIZE | elements per batch | runtime default |
DOCLING_DEBUG_PROFILE_PIPELINE_TIMINGS | log per-stage timings | false |
For how to choose device/perf values see GPU support. For offline / air-gapped model setup see the FAQ and Advanced options; set DOCLING_SERVE_ARTIFACTS_PATH to a pre-populated model directory.
docling-serve runs each conversion as an asynchronous job dispatched to a compute engine, chosen with DOCLING_SERVE_ENG_KIND:
local, the default) — jobs run in an in-process thread pool inside the server. No external services; everything stays on one host. Tunable with DOCLING_SERVE_ENG_LOC_NUM_WORKERS (default 2) and DOCLING_SERVE_ENG_LOC_SHARE_MODELS (default false). Best for a single machine.rq) — jobs are queued in Redis and executed by separate docling-serve rq-worker processes, so the API tier and the conversion workers scale independently. Best for horizontal scaling and higher throughput.pip install "docling-serve[ui]"
docling-serve run --enable-ui # production-style: reload off, binds 0.0.0.0, UI off by default
# docling-serve dev # dev: auto-reload, binds 127.0.0.1, UI on (localhost only)
API at http://localhost:5001, interactive docs at /docs, demo UI at /ui. Smoke test:
curl -X POST "http://localhost:5001/v1/convert/source/async" \
-H "Content-Type: application/json" \
-d '{"http_sources": [{"url": "https://arxiv.org/pdf/2501.17887"}]}'
!!! note
The demo UI (--enable-ui / DOCLING_SERVE_ENABLE_UI) is a Gradio app; files it produces are cleared from its cache after ~10 hours. It is a demonstrator, not durable storage.
Same server, containerized. The shipped compose examples are all-in-one containers that don't set ENG_KIND, so they run the default Local engine.
# Pure CPU (no compose)
podman run -p 5001:5001 -e DOCLING_SERVE_ENABLE_UI=1 quay.io/docling-project/docling-serve
# NVIDIA GPU
docker compose -f compose-nvidia.yaml up -d
# AMD GPU
docker compose -f compose-amd.yaml up -d
Compose manifests: compose-nvidia.yaml, compose-amd.yaml.
GPU prerequisites (host side; for the Python AcceleratorOptions view see GPU support and RTX GPU):
nvidia-container-toolkit + the nvidia container runtime.make docling-serve-rocm-image. Detailed GID wiring: serve repo.!!! note
The compose files pin older image tags (-cu126:main, -rocm72:main) than the README image table; treat the README image table as the source of truth and adjust the image: line if needed. There is no shipped single-CPU compose file — use the podman one-liner for pure CPU.
The API enqueues jobs to Redis; conversion runs in separate docling-serve rq-worker processes.
# 1) Redis
docker run -p 6379:6379 redis:7-alpine
# 2) API server (enqueues jobs)
DOCLING_SERVE_ENG_KIND=rq \
DOCLING_SERVE_ENG_RQ_REDIS_URL=redis://localhost:6379/0 \
docling-serve run
# 3) one or more workers (do the conversion)
DOCLING_SERVE_ENG_KIND=rq \
DOCLING_SERVE_ENG_RQ_REDIS_URL=redis://localhost:6379/0 \
docling-serve rq-worker
!!! warning
The API alone accepts jobs but nothing runs them without at least one rq-worker. DOCLING_SERVE_ENG_RQ_REDIS_URL is required (no default) and must be identical across every API and worker process.
These live in the docling-serve repo (run-time manifests aren't vendored here):
Prefer not to run any of this yourself? See the managed service.