deploy/paddleocr_vl_docker/hps/README_en.md
This directory provides a high-performance service deployment solution for PaddleOCR-VL-1.5 with concurrent request processing support.
This solution currently only supports NVIDIA GPUs. Support for other inference devices is still being developed.
Client → FastAPI Gateway → Triton Server → vLLM Server
| Component | Description |
|---|---|
| FastAPI Gateway | Unified access point, simplified client calls, concurrency control |
| Triton Server | Layout detection model (PP-DocLayoutV3) and pipeline orchestration; model management, dynamic batching, inference scheduling |
| vLLM Server | VLM (PaddleOCR-VL-1.5), continuous batching inference |
Triton Models:
| Model | Device | Description |
|---|---|---|
layout-parsing | Inference device (e.g., GPU) | Layout parsing inference |
restructure-pages | CPU | Multi-page result post-processing (cross-page table merging, title level reassignment) |
git clone https://github.com/PaddlePaddle/PaddleOCR.git
cd PaddleOCR/deploy/paddleocr_vl_docker/hps
bash prepare.sh
docker compose up
The above command will start 3 containers in sequence:
| Service | Description | Port |
|---|---|---|
paddleocr-vl-api | FastAPI gateway (external entry point) | 8080 |
paddleocr-vl-tritonserver | Triton inference server | 8000 (internal) |
paddleocr-vlm-server | vLLM-based VLM inference service | 8080 (internal) |
The first startup will automatically download and build images, which takes longer. Subsequent startups will use local images and start faster.
Copy .env.example to .env and modify as needed.
cp .env.example .env
You can also set these as environment variables directly instead of using the .env file, e.g.:
export HPS_MAX_CONCURRENT_INFERENCE_REQUESTS=8
| Variable | Default | Description |
|---|---|---|
HPS_MAX_CONCURRENT_INFERENCE_REQUESTS | 16 | Max concurrent inference requests (layout parsing) |
HPS_MAX_CONCURRENT_NON_INFERENCE_REQUESTS | 64 | Max concurrent non-inference requests (page restructuring) |
HPS_INFERENCE_TIMEOUT | 600 | Request timeout in seconds |
HPS_HEALTH_CHECK_TIMEOUT | 5 | Health check timeout in seconds |
HPS_VLM_URL | http://paddleocr-vlm-server:8080 | VLM server URL (for health checks) |
HPS_LOG_LEVEL | INFO | Log level (DEBUG, INFO, WARNING, ERROR) |
HPS_FILTER_HEALTH_ACCESS_LOG | true | Whether to filter health check access logs |
UVICORN_WORKERS | 4 | Number of gateway worker processes |
DEVICE_ID | 0 | Inference device ID to use |
To adjust pipeline configurations (such as model path, batch size, deployment device, etc.), please refer to the Pipeline Configuration section in the PaddleOCR-VL Usage Tutorial.
Please refer to the Client-Side Invocation section in the PaddleOCR-VL Usage Tutorial.
# Liveness check
curl http://localhost:8080/health
# Readiness check (verifies Triton and VLM services are ready to process requests)
curl http://localhost:8080/health/ready
The gateway controls concurrency for inference and non-inference operations independently:
HPS_MAX_CONCURRENT_INFERENCE_REQUESTS (default 16): Controls concurrency for inference operations such as layout-parsing (layout parsing)
HPS_MAX_CONCURRENT_NON_INFERENCE_REQUESTS (default 64): Controls concurrency for non-inference operations such as restructure-pages (page restructuring)
High-throughput configuration example:
# .env
HPS_MAX_CONCURRENT_INFERENCE_REQUESTS=32
HPS_MAX_CONCURRENT_NON_INFERENCE_REQUESTS=128
UVICORN_WORKERS=8
Low-latency configuration example:
# .env
HPS_MAX_CONCURRENT_INFERENCE_REQUESTS=8
HPS_MAX_CONCURRENT_NON_INFERENCE_REQUESTS=32
HPS_INFERENCE_TIMEOUT=300
UVICORN_WORKERS=2
Each Uvicorn worker is an independent process with its own event loop:
Triton automatically batches requests to improve inference device utilization. The maximum batch size is controlled by the max_batch_size parameter in the model configuration file (default: 8), located at config.pbtxt under each model directory in the model repository (e.g., model_repo/layout-parsing/config.pbtxt).
The number of parallel inference instances for each Triton model is configured via the instance_group section in config.pbtxt (default: 1). Increasing the instance count improves parallelism but consumes more device resources.
# model_repo/layout-parsing/config.pbtxt
instance_group [
{
count: 1 # Number of instances; increase for higher parallelism
kind: KIND_GPU
gpus: [ 0 ]
}
]
There is a trade-off between instance count and dynamic batching:
count: 1): Dynamic batching combines multiple requests into one batch for parallel execution, but all requests in the same batch must wait for the slowest one to finish before results are returned, which may increase latency for faster requests. Additionally, a single instance can only process one batch at a time — subsequent requests must queue until the current batch completes. Best suited for scenarios with limited GPU memory or uniform request processing timescount: 2+): Multiple instances can process different batches simultaneously, allowing more requests to be handled concurrently. This reduces queuing time and improves latency for individual requests. Note that within each instance, dynamic batching behavior still applies (requests in the same batch start and finish together). Each additional instance consumes an extra copy of the layout detection model's GPU memory, increases the load on the VLM inference service, and uses more CPU and system memory. Adjust based on the available resources of your inference deviceNon-inference models (e.g., restructure-pages) run on CPU and can have their instance count increased based on available CPU cores.
Check the logs for each service to identify the issue:
docker compose logs paddleocr-vl-api
docker compose logs paddleocr-vl-tritonserver
docker compose logs paddleocr-vlm-server
Common causes include port conflicts, unavailable inference devices, or image pull failures.
HPS_INFERENCE_TIMEOUT (for complex documents)HPS_MAX_CONCURRENT_INFERENCE_REQUESTSHPS_MAX_CONCURRENT_INFERENCE_REQUESTSshm_size in compose.yaml (default: 4GB)