Back to Paddleocr

PaddleOCR-VL NVIDIA Blackwell-Architecture GPUs Usage Tutorial

docs/version3.x/pipeline_usage/PaddleOCR-VL-NVIDIA-Blackwell.en.md

3.7.019.4 KB
Original Source

PaddleOCR-VL NVIDIA Blackwell-Architecture GPUs Usage Tutorial

INFO: Unless otherwise specified, the term "PaddleOCR-VL" in this tutorial refers to the PaddleOCR-VL model series (e.g., PaddleOCR-VL-1.6). References specific to the PaddleOCR-VL v1 version will be explicitly noted.

This tutorial provides guidance on using PaddleOCR-VL on NVIDIA Blackwell-architecture GPUs, covering the complete workflow from environment preparation to service deployment.

NVIDIA Blackwell-architecture GPUs include, but are not limited to:

  • RTX 5090
  • RTX 5080
  • RTX 5070、RTX 5070 Ti
  • RTX 5060、RTX 5060 Ti
  • RTX 5050

PaddleOCR-VL has been verified for accuracy and speed on the RTX 5070. However, due to hardware diversity, compatibility with other NVIDIA Blackwell-architecture GPUs has not yet been confirmed. We welcome the community to test on different hardware setups and share your results.

Before starting the tutorial, please ensure that your NVIDIA driver supports CUDA 12.9 or higher.

Workflow Guide for This Hardware

Use this guide for the workflows below.

GoalSupport on this hardwareRead this section
Local direct inferenceSupportedRead Section 1. Local Runtime Environment Preparation and Section 2. Quick Start.
Client + VLM inference serviceSupportedComplete local direct inference first, then read Section 3. Using VLM Inference Services.
Full API serviceSupported with Docker Compose or manual deploymentUse Section 4.1 for Docker Compose, or Section 4.2 for manual deployment (complete Section 1. Local Runtime Environment Preparation first), then continue with the Section 4.3 client invocation section and the Section 4.4 pipeline configuration section.
Model fine-tuningSupportedRead Section 5. Model Fine-Tuning.

If you only need to confirm which inference methods are available on this hardware, refer to the PaddleOCR-VL Inference Method and Hardware Support Matrix in the main guide.

1. Local Runtime Environment Preparation

Local Runtime Environment Setup Methods Supported on This Hardware

Local runtime environment setup methodStatusNotes
Official Docker imageSupported with steps in this guideContinue with Section 1.1.
Manually install the inference engine and PaddleOCRSupported with steps in this guideContinue with Section 1.2.

This section introduces how to set up the PaddleOCR-VL local runtime environment using one of the following two methods:

  • Method 1: Use the official Docker image.

  • Method 2: Manually install the inference engine and PaddleOCR.

We strongly recommend using the Docker image to minimize potential environment-related issues.

1.1 Method 1: Using Docker Image

We recommend using the official Docker image (requires Docker version >= 19.03, GPU-equipped machine with NVIDIA driver supporting CUDA 12.9 or higher):

shell
docker run \
    -it \
    --gpus all \
    --network host \
    --user root \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu-sm120 \
    /bin/bash
# Call PaddleOCR CLI or Python API in the container

If you wish to use PaddleOCR-VL in an offline environment, replace ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu-sm120 (image size approximately 10 GB) in the above command with the offline version image ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu-sm120-offline (image size approximately 12 GB).

TIP: Images with the latest-xxx tag correspond to the latest version. If the corresponding latest image already exists locally and you want the newest features or fixes, we recommend running docker pull again before using it. If you want to use an image corresponding to a specific PaddleOCR version, you can replace latest in the tag with the desired version number: paddleocr<major>.<minor>. For example: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:paddleocr3.3-nvidia-gpu-sm120-offline

1.2 Method 2: Manually Install the Inference Engine and PaddleOCR

If Docker is not an option, you can manually install the inference engine and PaddleOCR. This guide documents Python 3.9–3.13 as the verified range.

This guide provides PaddlePaddle installation steps. To use Transformers or other inference engines, see Section 1.2 of the main tutorial.

We strongly recommend installing PaddleOCR-VL in a virtual environment to avoid dependency conflicts. For example, create a virtual environment using Python's standard venv library:

shell
# Create a virtual environment
python -m venv .venv_paddleocr
# Activate the environment
source .venv_paddleocr/bin/activate

Run the following commands to complete the installation:

shell
# Note that PaddlePaddle for cu129 is being installed here
python -m pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu129/
python -m pip install -U "paddleocr[doc-parser]"

Please ensure that PaddlePaddle framework version 3.2.1 or higher is installed.

2. Quick Start

Please refer to PaddleOCR-VL Usage Tutorial - 2. Quick Start.

3. Using VLM Inference Services

This section explains how to connect PaddleOCR-VL to a dedicated VLM inference service backend. On this hardware, this is usually used to improve inference performance beyond the default configuration for production use. In this hardware-specific guide, the examples use vLLM and SGLang as the backends for the VLM inference service.

3.1 Starting the VLM Inference Service

IMPORTANT: The service started according to this section is responsible only for the VLM inference stage in the PaddleOCR-VL workflow. It does not provide a complete end-to-end document parsing API. We strongly recommend that you do not call this service directly via HTTP requests or OpenAI clients to process document images. If you need to deploy a service with the full PaddleOCR-VL capabilities, refer to the service deployment section later in this document.

Launch Methods Supported on This Hardware

Launch methodStatusNotes
Official Docker imageSupported with steps in this guideContinue with Section 3.1.1.
Install dependencies with the PaddleOCR CLI and launch the serviceSupported with steps in this guideContinue with Section 3.1.2.
Launch the service directly with the acceleration frameworkNot verifiedThis hardware can start the VLM inference service through the vLLM or SGLang backend, but launching directly with the native framework has not been verified.

There are two methods to start the VLM inference service; choose one:

  • Method 1: Start the service using the official Docker image.

  • Method 2: Manually install dependencies and start the service via PaddleOCR CLI.

We strongly recommend using the Docker image to minimize potential environment-related issues.

3.1.1 Method 1: Using Docker Image

PaddleOCR provides a Docker image for quickly starting the vLLM inference service. Use the following command to start the service (requires Docker version >= 19.03, GPU-equipped machine with NVIDIA driver supporting CUDA 12.9 or higher):

shell
docker run \
    -it \
    --gpus all \
    --network host \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu-sm120 \
    paddleocr genai_server --model_name PaddleOCR-VL-1.6-0.9B --host 0.0.0.0 --port 8118 --backend vllm

If you wish to start the service in an offline environment, replace ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu-sm120 (image size approximately 13 GB) in the above command with the offline version image ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu-sm120-offline (image size approximately 15 GB).

When launching the vLLM inference service, we provide a set of default parameter settings. If you need to adjust parameters such as GPU memory usage, you can configure additional parameters yourself. Please refer to 3.3.1 Server-side Parameter Adjustment to create a configuration file, then mount the file into the container and specify the configuration file using backend_config in the command to start the service, for example:

shell
docker run \
    -it \
    --rm \
    --gpus all \
    --network host \
    -v ./vllm_config.yml:/tmp/vllm_config.yml \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu-sm120 \
    paddleocr genai_server --model_name PaddleOCR-VL-1.6-0.9B --host 0.0.0.0 --port 8118 --backend vllm --backend_config /tmp/vllm_config.yml

TIP: Images with the latest-xxx tag correspond to the latest version. If the corresponding latest image already exists locally and you want the newest features or fixes, we recommend running docker pull again before using it. If you want to use an image corresponding to a specific PaddleOCR version, you can replace latest in the tag with the desired version number: paddleocr<major>.<minor>. For example: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:paddleocr3.3-nvidia-gpu-sm120-offline

3.1.2 Method 2: Installation and Usage via PaddleOCR CLI

Since inference acceleration frameworks may conflict with packages already installed in the current environment, it is recommended to install them in a virtual environment:

shell
# If a virtual environment is currently activated, deactivate it first using `deactivate`
# Create a virtual environment
python -m venv .venv_vlm
# Activate the environment
source .venv_vlm/bin/activate

vLLM and SGLang depend on FlashAttention, and installing FlashAttention may require CUDA compilation tools such as nvcc. If these tools are not available in your environment (for example, when using the paddleocr-vl image), you can obtain a prebuilt FlashAttention package (version 2.8.3 required) from this repository, install it first, and then proceed with subsequent commands. For example, in the paddleocr-vl image, run python -m pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.3.14/flash_attn-2.8.3+cu128torch2.8-cp310-cp310-linux_x86_64.whl. This step is not required for FastDeploy.

Install PaddleOCR and the dependencies of inference acceleration services, using vLLM as an example:

shell
# Install PaddleOCR
python -m pip install "paddleocr[doc-parser]"
# Install inference acceleration service dependencies
paddleocr install_genai_server_deps vllm

Usage of the paddleocr install_genai_server_deps command:

shell
paddleocr install_genai_server_deps <inference acceleration framework name>

Currently supported framework names are vllm and sglang, corresponding to vLLM and SGLang, respectively.

WARNING: The transformers library versions required by vLLM, SGLang and Transformers engine are currently incompatible, so Transformers engine cannot be installed together with vLLM or SGLang in the same environment. If using Transformers + vLLM or Transformers + SGLang inference, please deploy the layout analysis model and VLM service in different environments.

After installation, you can start the service using the paddleocr genai_server command:

shell
paddleocr genai_server --model_name PaddleOCR-VL-1.6-0.9B --backend vllm --port 8118

The parameters supported by this command are as follows:

ParameterDescription
--model_nameName of the model
--model_dirDirectory containing the model
--hostServer hostname
--portServer port number
--backendBackend name, i.e., the name of the inference acceleration framework being used; options are vllm or sglang
--backend_configYAML file specifying backend configuration

3.2 Client Usage

For client-side invocation methods, please refer to PaddleOCR-VL Usage Tutorial - 3.2 Client Usage Methods. If you run the client on this hardware, make sure to specify device="gpu".

3.3 Performance Tuning

Please refer to PaddleOCR-VL Usage Tutorial - 3.3 Performance Tuning.

4. Service Deployment

Deployment Methods Supported on This Hardware

Deployment methodStatusNotes
Docker Compose deploymentSupported with steps in this guideContinue with Section 4.1.
Manual deploymentSupportedComplete Section 1. Local Runtime Environment Preparation first, then continue with Section 4.2.

This section mainly introduces how to deploy PaddleOCR-VL as a service and invoke it. There are two methods available; choose one:

  • Method 1: Deploy using Docker Compose.

  • Method 2: Manually install dependencies for deployment.

IMPORTANT: The PaddleOCR-VL service introduced in this section differs from the VLM inference service in the previous section: the latter is responsible for only one part of the complete process (i.e., VLM inference) and is called as an underlying service by the former.

4.1 Method 1: Deploy Using Docker Compose

  1. Download the Compose file and the environment variable configuration file separately from here and here to your local machine.

  2. Execute the following command in the directory containing the compose.yaml and .env files to start the server, which will listen on port 8080 by default:

    shell
    # Must be executed in the directory containing compose.yaml and .env files
    docker compose up
    

    TIP: The image tags used by compose.yaml are usually controlled by API_IMAGE_TAG_SUFFIX and VLM_IMAGE_TAG_SUFFIX in .env, and default to tags such as latest-nvidia-gpu-offline. To make sure you pull the newest latest images, run docker compose pull in the current directory before docker compose up. To use an image corresponding to a specific PaddleOCR version, replace latest in these variables with paddleocr<major>.<minor>, for example paddleocr3.3-nvidia-gpu-offline.

    After startup, you will see output similar to the following:

    text
    paddleocr-vl-api             | INFO:     Started server process [1]
    paddleocr-vl-api             | INFO:     Waiting for application startup.
    paddleocr-vl-api             | INFO:     Application startup complete.
    paddleocr-vl-api             | INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
    

This method accelerates VLM inference using the vLLM framework and is more suitable for production environment deployment.

Additionally, after starting the server in this manner, no internet connection is required except for image pulling. For deployment in an offline environment, you can first pull the images involved in the Compose file on a connected machine, export them, and transfer them to the offline machine for import to start the service in an offline environment.

Docker Compose starts two containers sequentially by reading configurations from the .env and compose.yaml files, running the underlying VLM inference service and the PaddleOCR-VL service (pipeline service) respectively.

The meanings of each environment variable contained in the .env file are as follows:

- `API_IMAGE_TAG_SUFFIX`: The tag suffix of the image used to launch the pipeline service.
- `VLM_BACKEND`: The VLM inference backend.
- `VLM_IMAGE_TAG_SUFFIX`: The tag suffix of the image used to launch the VLM inference service.

You can modify compose.yaml to meet custom requirements, for example:

<details> <summary>1. Change the port of the PaddleOCR-VL service</summary>

Edit <code>paddleocr-vl-api.ports</code> in the <code>compose.yaml</code> file to change the port. For example, if you need to change the service port to 8111, make the following modifications:

diff
  paddleocr-vl-api:
    ...
    ports:
-     - 8080:8080
+     - 8111:8080
    ...
</details> <details> <summary>2. Specify the GPU used by the PaddleOCR-VL service</summary>

Edit <code>environment</code> in the <code>compose.yaml</code> file to change the GPU used. For example, if you need to use card 1 for deployment, make the following modifications:

diff
  paddleocr-vl-api:
    ...
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
-             device_ids: ["0"]
+             device_ids: ["1"]
              capabilities: [gpu]
    ...
  paddleocr-vlm-server:
    ...
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
-             device_ids: ["0"]
+             device_ids: ["1"]
              capabilities: [gpu]
    ...
</details> <details> <summary>3. Adjust VLM server-side configuration</summary>

If you want to adjust the VLM server configuration, refer to <a href="./PaddleOCR-VL.en.md#331-server-side-parameter-adjustment">3.3.1 Server Parameter Adjustment</a> to generate a configuration file.

After generating the configuration file, add the following <code>paddleocr-vlm-server.volumes</code> and <code>paddleocr-vlm-server.command</code> fields to your <code>compose.yaml</code>. Replace <code>/path/to/your_config.yaml</code> with your actual configuration file path.

yaml
  paddleocr-vlm-server:
    ...
    volumes: /path/to/your_config.yaml:/home/paddleocr/vlm_server_config.yaml
    command: paddleocr genai_server --model_name PaddleOCR-VL-1.6-0.9B --host 0.0.0.0 --port 8118 --backend vllm --backend_config /home/paddleocr/vlm_server_config.yaml
    ...
</details> <details> <summary>4. Adjust pipeline-related configurations (such as model path, batch size, deployment device, etc.)</summary>

Refer to the <a href="./PaddleOCR-VL.en.md#44-pipeline-configuration-adjustment-instructions">4.4 Pipeline Configuration Adjustment Instructions</a> section.

</details>

4.2 Method 2: Manually Deployment

Please complete Section 1. Local Runtime Environment Preparation first, then refer to PaddleOCR-VL Usage Tutorial - 4.2 Method 2: Manual Deployment.

4.3 Client Invocation Methods

Please refer to PaddleOCR-VL Usage Tutorial - 4.3 Client Invocation Methods.

4.4 Pipeline Configuration Adjustment Instructions

Please refer to PaddleOCR-VL Usage Tutorial - 4.4 Pipeline Configuration Adjustment Instructions.

5. Model Fine-Tuning

Please refer to PaddleOCR-VL Usage Tutorial - 5. Model Fine-Tuning.