Back to Paddleocr

PaddleOCR-VL-1.5 High-Performance Service Deployment

deploy/paddleocr_vl_docker/hps/README_en.md

3.5.08.2 KB
Original Source

PaddleOCR-VL-1.5 High-Performance Service Deployment

简体中文

This directory provides a high-performance service deployment solution for PaddleOCR-VL-1.5 with concurrent request processing support.

This solution currently only supports NVIDIA GPUs. Support for other inference devices is still being developed.

Architecture

Client → FastAPI Gateway → Triton Server → vLLM Server
ComponentDescription
FastAPI GatewayUnified access point, simplified client calls, concurrency control
Triton ServerLayout detection model (PP-DocLayoutV3) and pipeline orchestration; model management, dynamic batching, inference scheduling
vLLM ServerVLM (PaddleOCR-VL-1.5), continuous batching inference

Triton Models:

ModelDeviceDescription
layout-parsingInference device (e.g., GPU)Layout parsing inference
restructure-pagesCPUMulti-page result post-processing (cross-page table merging, title level reassignment)

Requirements

  • x64 CPU
  • NVIDIA GPU, Compute Capability >= 8.0 and < 12.0
  • NVIDIA driver supporting CUDA 12.6
  • Docker >= 19.03
  • Docker Compose >= 2.0

Quick Start

  1. Clone the PaddleOCR repository and navigate to this directory:
bash
git clone https://github.com/PaddlePaddle/PaddleOCR.git
cd PaddleOCR/deploy/paddleocr_vl_docker/hps
  1. Prepare necessary files:
bash
bash prepare.sh
  1. Start the services:
bash
docker compose up

The above command will start 3 containers in sequence:

ServiceDescriptionPort
paddleocr-vl-apiFastAPI gateway (external entry point)8080
paddleocr-vl-tritonserverTriton inference server8000 (internal)
paddleocr-vlm-servervLLM-based VLM inference service8080 (internal)

The first startup will automatically download and build images, which takes longer. Subsequent startups will use local images and start faster.

Configuration

Environment Variables

Copy .env.example to .env and modify as needed.

bash
cp .env.example .env

You can also set these as environment variables directly instead of using the .env file, e.g.:

bash
export HPS_MAX_CONCURRENT_INFERENCE_REQUESTS=8
VariableDefaultDescription
HPS_MAX_CONCURRENT_INFERENCE_REQUESTS16Max concurrent inference requests (layout parsing)
HPS_MAX_CONCURRENT_NON_INFERENCE_REQUESTS64Max concurrent non-inference requests (page restructuring)
HPS_INFERENCE_TIMEOUT600Request timeout in seconds
HPS_HEALTH_CHECK_TIMEOUT5Health check timeout in seconds
HPS_VLM_URLhttp://paddleocr-vlm-server:8080VLM server URL (for health checks)
HPS_LOG_LEVELINFOLog level (DEBUG, INFO, WARNING, ERROR)
HPS_FILTER_HEALTH_ACCESS_LOGtrueWhether to filter health check access logs
UVICORN_WORKERS4Number of gateway worker processes
DEVICE_ID0Inference device ID to use

Pipeline Configuration

To adjust pipeline configurations (such as model path, batch size, deployment device, etc.), please refer to the Pipeline Configuration section in the PaddleOCR-VL Usage Tutorial.

API Usage

Document Parsing

Please refer to the Client-Side Invocation section in the PaddleOCR-VL Usage Tutorial.

Health Checks

bash
# Liveness check
curl http://localhost:8080/health

# Readiness check (verifies Triton and VLM services are ready to process requests)
curl http://localhost:8080/health/ready

Performance Tuning

Concurrency Settings

The gateway controls concurrency for inference and non-inference operations independently:

  • HPS_MAX_CONCURRENT_INFERENCE_REQUESTS (default 16): Controls concurrency for inference operations such as layout-parsing (layout parsing)
    • Too low (4): Underutilized inference device, requests queue unnecessarily
    • Too high (64): May overload Triton, causing OOM or timeouts
    • Default value of 16 allows enough requests to queue for the next batch while the current batch is being processed
    • If inference device resources are limited, consider lowering this value
  • HPS_MAX_CONCURRENT_NON_INFERENCE_REQUESTS (default 64): Controls concurrency for non-inference operations such as restructure-pages (page restructuring)
    • Non-inference operations do not consume inference device resources and can be set to a higher concurrency level
    • Adjust based on CPU cores and available memory

High-throughput configuration example:

bash
# .env
HPS_MAX_CONCURRENT_INFERENCE_REQUESTS=32
HPS_MAX_CONCURRENT_NON_INFERENCE_REQUESTS=128
UVICORN_WORKERS=8

Low-latency configuration example:

bash
# .env
HPS_MAX_CONCURRENT_INFERENCE_REQUESTS=8
HPS_MAX_CONCURRENT_NON_INFERENCE_REQUESTS=32
HPS_INFERENCE_TIMEOUT=300
UVICORN_WORKERS=2

Worker Processes

Each Uvicorn worker is an independent process with its own event loop:

  • 1 worker: Simple, but limited to a single process
  • 4 workers: Suitable for most scenarios
  • 8+ workers: Suitable for high-concurrency scenarios with many small requests

Triton Dynamic Batching

Triton automatically batches requests to improve inference device utilization. The maximum batch size is controlled by the max_batch_size parameter in the model configuration file (default: 8), located at config.pbtxt under each model directory in the model repository (e.g., model_repo/layout-parsing/config.pbtxt).

Triton Instance Count

The number of parallel inference instances for each Triton model is configured via the instance_group section in config.pbtxt (default: 1). Increasing the instance count improves parallelism but consumes more device resources.

# model_repo/layout-parsing/config.pbtxt
instance_group [
  {
      count: 1       # Number of instances; increase for higher parallelism
      kind: KIND_GPU
      gpus: [ 0 ]
  }
]

There is a trade-off between instance count and dynamic batching:

  • Single instance (count: 1): Dynamic batching combines multiple requests into one batch for parallel execution, but all requests in the same batch must wait for the slowest one to finish before results are returned, which may increase latency for faster requests. Additionally, a single instance can only process one batch at a time — subsequent requests must queue until the current batch completes. Best suited for scenarios with limited GPU memory or uniform request processing times
  • Multiple instances (count: 2+): Multiple instances can process different batches simultaneously, allowing more requests to be handled concurrently. This reduces queuing time and improves latency for individual requests. Note that within each instance, dynamic batching behavior still applies (requests in the same batch start and finish together). Each additional instance consumes an extra copy of the layout detection model's GPU memory, increases the load on the VLM inference service, and uses more CPU and system memory. Adjust based on the available resources of your inference device

Non-inference models (e.g., restructure-pages) run on CPU and can have their instance count increased based on available CPU cores.

Troubleshooting and Resolution

Service Fails to Start

Check the logs for each service to identify the issue:

bash
docker compose logs paddleocr-vl-api
docker compose logs paddleocr-vl-tritonserver
docker compose logs paddleocr-vlm-server

Common causes include port conflicts, unavailable inference devices, or image pull failures.

Timeout Errors

  • Increase HPS_INFERENCE_TIMEOUT (for complex documents)
  • If the inference device is overloaded, reduce HPS_MAX_CONCURRENT_INFERENCE_REQUESTS

Out of Memory

  • Reduce HPS_MAX_CONCURRENT_INFERENCE_REQUESTS
  • Ensure only one service runs per inference device
  • Check shm_size in compose.yaml (default: 4GB)