Back to Paddleocr

PaddleOCR-VL Usage Tutorial

docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md

3.5.0149.8 KB
Original Source

PaddleOCR-VL Usage Tutorial

INFO: PaddleOCR provides a unified interface for the PaddleOCR-VL model series to facilitate quick setup and usage. Unless otherwise specified, the term "PaddleOCR-VL" in this tutorial and related hardware usage tutorials refers to the PaddleOCR-VL model series (e.g., PaddleOCR-VL-1.5). References specific to the PaddleOCR-VL v1 version will be explicitly noted.

PaddleOCR-VL is an advanced and efficient document parsing model designed specifically for element recognition in documents. Taking its initial version (PaddleOCR-VL v1) as an example, its core component is PaddleOCR-VL-0.9B, a compact yet powerful Vision-Language Model (VLM) composed of a NaViT-style dynamic resolution visual encoder and the ERNIE-4.5-0.3B language model, enabling precise element recognition. The model series supports 109 languages and excels in recognizing complex elements (such as text, tables, formulas, and charts) while maintaining extremely low resource consumption. Comprehensive evaluations on widely used public benchmarks and internal benchmarks demonstrate that PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing Pipeline-based solutions, document parsing multimodal schemes, and advanced general-purpose multimodal large models, while offering faster inference speeds. These advantages make it highly suitable for deployment in real-world scenarios.

On January 29, 2026, we released PaddleOCR-VL-1.5. PaddleOCR-VL-1.5 not only significantly improved the accuracy on the OmniDocBench v1.5 evaluation set to 94.5%, but also innovatively supports irregular-shaped bounding box localization. As a result, PaddleOCR-VL-1.5 demonstrates outstanding performance in real-world scenarios such as Skew, Warping, Screen Photography, Illumination, and Scanning. In addition, the model has added new capabilities for seal (stamp) recognition and text detection and recognition, with key metrics continuing to lead the industry.

Process Guide

You can first choose a reading path based on your goal, and then confirm whether you should continue with this tutorial or switch to the corresponding hardware-specific tutorial for the same chapter.

Before getting started, we recommend first identifying your device type:

  • x64 CPU: You can read this tutorial directly.
  • NVIDIA GPU:
    • If you are using a Blackwell-architecture GPU such as the RTX 50 series, we recommend first continuing with this process guide to determine your goal, and then referring to the corresponding chapters in the PaddleOCR-VL NVIDIA Blackwell Architecture GPU Usage Tutorial.
    • For other NVIDIA GPUs, you can read this tutorial directly.
  • Apple Silicon, Kunlunxin XPU, Hygon DCU, MetaX GPU, Iluvatar GPU, and Huawei Ascend NPU: We recommend first continuing with this process guide to determine your goal, and then referring to the corresponding chapters in the dedicated tutorial for your hardware.

Before proceeding directly to the following sections along the path described above, if you need to confirm which inference methods PaddleOCR-VL supports in your current hardware environment (for example, using the PaddlePaddle framework as the inference engine), please continue to the next section, “Inference Device Support for PaddleOCR-VL”.

After confirming the above, choose your reading path based on your goal:

  1. Local Direct Inference (Quick Experience / Script Integration):

    Suitable for directly calling PaddleOCR-VL on the local machine through the PaddleOCR CLI or Python API. This category usually corresponds to local inference engine methods such as PaddlePaddle or Transformers.

    Please read 1. Environment Preparation and 2. Quick Start, or the corresponding chapters in the hardware-specific tutorial.

  2. Client with a VLM Inference Service (Performance-Focused):

    Suitable for offloading only the VLM stage to a dedicated inference service for better performance. You can either deploy your own VLM inference service based on backends such as vLLM, SGLang, FastDeploy, MLX-VLM, and llama.cpp, or directly use a compatible managed service. This category usually corresponds to combinations of "Layout Detection Inference Method + VLM Inference Service".

    It is recommended to first complete the basic local direct inference flow described in the previous item, and then continue with 3. Improving Inference Performance with VLM Inference Services or the corresponding chapters in the hardware-specific tutorial.

    Note that Section 3 launches a VLM inference service, not the full PaddleOCR-VL API service. Other stages such as layout detection are still executed on the client side.

  3. Deploy the Full API Service:

    Suitable for packaging the full PaddleOCR-VL capability as a web service so that the client only needs to call it through an HTTP interface. Unlike the previous option, what is deployed here is an API service that directly exposes the complete PaddleOCR-VL capability, rather than a backend service that is only responsible for VLM inference. If you do not have special requirements for concurrent request processing, choose either of the following:

    • Deployment using Docker Compose (one-click startup, recommended): this uses the "PaddlePaddle + VLM Inference Service" inference method, where the underlying VLM service uses an inference acceleration framework. Please read 4.1 Method 1: Deploy Using Docker Compose and 4.3 Client-Side Invocation, or the corresponding chapters in the hardware-specific tutorial.
    • Manual deployment: by default, this uses PaddlePaddle inference. You can also switch to Transformers, or configure a VLM inference service to form a "Layout Detection Inference Method + VLM Inference Service" combination. Please read 1. Environment Preparation, 4.2 Method 2: Manual Deployment, and 4.3 Client-Side Invocation, or the corresponding chapters in the hardware-specific tutorial.

    For concurrent request processing, please refer to the High-Performance Service Deployment solution.

  4. Model Fine-tuning:

    If you find that the accuracy of PaddleOCR-VL in specific business scenarios does not meet expectations, please read 5. Model Fine-tuning or the corresponding chapters in the hardware-specific tutorial.

Hardware-specific usage tutorials:

Hardware TypeUsage Tutorial
x64 CPUThis tutorial (currently supports manual dependency installation only)
NVIDIA GPU- NVIDIA Blackwell architecture GPUs (such as the RTX 50 series): PaddleOCR-VL NVIDIA Blackwell Architecture GPU Usage Tutorial

Inference Device Support for PaddleOCR-VL

PaddleOCR-VL currently provides multiple inference methods, and the supported inference devices are not exactly the same. Please confirm that your inference device meets the requirements in the table below before deploying PaddleOCR-VL:

<table border="1"> <thead> <tr> <th>Inference Method</th> <th>NVIDIA GPU</th> <th>Kunlunxin XPU</th> <th>Hygon DCU</th> <th>MetaX GPU</th> <th>Iluvatar GPU</th> <th>Huawei Ascend NPU</th> <th>x64 CPU</th> <th>Apple Silicon</th> <th>AMD GPU</th> <th>Intel Arc GPU</th> </tr> </thead> <tbody> <tr style="text-align: center;"> <td>PaddlePaddle</td> <td>✅</td> <td>✅</td> <td>✅</td> <td>✅</td> <td>✅</td> <td>🚧</td> <td>✅</td> <td>✅</td> <td>✅</td> <td>✅</td> </tr> <tr style="text-align: center;"> <td>Transformers</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> </tr> <tr style="text-align: center;"> <td>PaddlePaddle + vLLM</td> <td>✅</td> <td>🚧</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>✅</td> <td>-</td> <td>-</td> <td>✅</td> <td>✅</td> </tr> <tr style="text-align: center;"> <td>PaddlePaddle + SGLang</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>-</td> <td>-</td> <td>🚧</td> <td>🚧</td> </tr> <tr style="text-align: center;"> <td>PaddlePaddle + FastDeploy</td> <td>✅</td> <td>✅</td> <td>🚧</td> <td>✅</td> <td>✅</td> <td>🚧</td> <td>-</td> <td>-</td> <td>🚧</td> <td>🚧</td> </tr> <tr style="text-align: center;"> <td>PaddlePaddle + MLX-VLM</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>✅</td> <td>-</td> <td>-</td> </tr> <tr style="text-align: center;"> <td>PaddlePaddle + llama.cpp</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> </tr> <tr style="text-align: center;"> <td>Transformers + vLLM</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>-</td> <td>-</td> <td>🚧</td> <td>🚧</td> </tr> <tr style="text-align: center;"> <td>Transformers + SGLang</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>-</td> <td>-</td> <td>🚧</td> <td>🚧</td> </tr> <tr style="text-align: center;"> <td>Transformers + FastDeploy</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>-</td> <td>-</td> <td>🚧</td> <td>🚧</td> </tr> <tr style="text-align: center;"> <td>Transformers + MLX-VLM</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>✅</td> <td>-</td> <td>-</td> </tr> <tr style="text-align: center;"> <td>Transformers + llama.cpp</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> </tr> </tbody> </table> <details><summary>Explanation of Inference Method</summary> "PaddlePaddle" indicates that both the layout detection model and the VLM use the PaddlePaddle framework for inference. This is the default mode for the PaddleOCR CLI and Python API. "Transformers" indicates that both the layout detection model and the VLM use the Transformers engine for inference. Other inference methods follow the format "Layout Detection Model Inference Method + VLM Inference Method". For example, "PaddlePaddle + vLLM" means that the layout detection model uses PaddlePaddle for inference, while the VLM uses vLLM. </details>

TIP:

  • When using NVIDIA GPU for inference, ensure that the Compute Capability (CC) and CUDA version meet the requirements:
  • PaddlePaddle: CC ≥ 7.0, CUDA ≥ 11.8
  • Transformers: CC ≥ 7.0, CUDA ≥ 11.8
  • vLLM: CC ≥ 8.0, CUDA ≥ 12.6
  • SGLang: 8.0 ≤ CC < 12.0, CUDA ≥ 12.6
  • FastDeploy: 8.0 ≤ CC < 12.0, CUDA ≥ 12.6
  • Common GPUs with CC ≥ 8 include RTX 30/40/50 series and A10/A100, etc. For more models, refer to CUDA GPU Compute Capability
  • vLLM compatibility note: Although vLLM can be launched on NVIDIA GPUs with CC 7.x such as T4/V100, timeout or OOM issues may occur, and its use is not recommended.
  • vLLM, SGLang, and FastDeploy cannot run natively on Windows. Please use the Docker images we provide.
  • Due to dependency conflicts between different libraries, when using mixed inference methods like Transformers + vLLM, it is recommended to deploy the layout detection model and VLM service in different environments.

1. Environment Preparation

This section explains how to set up the runtime environment for PaddleOCR-VL. This tutorial mainly applies to x64 CPU users and NVIDIA GPU users other than Blackwell. For other hardware, please refer first to the dedicated tutorials listed above.

This tutorial provides the following two methods for environment preparation:

  • Method 1: Use the official Docker image (NVIDIA GPU only).

  • Method 2: Manually install the inference engine and PaddleOCR (available for both x64 CPU and NVIDIA GPU).

We strongly recommend using the Docker image to minimize potential environment-related issues.

1.1 Method 1: Using Docker Image

We recommend using the official Docker image (requires Docker version >= 19.03, GPU-equipped machine with NVIDIA drivers supporting CUDA 12.6 or later):

shell
docker run \
    -it \
    --gpus all \
    --network host \
    --user root \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu \
    /bin/bash
# Invoke PaddleOCR CLI or Python API within the container

If you need to use PaddleOCR-VL in an offline environment, replace ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu (image size approximately 8 GB) in the above command with the offline version image ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu-offline (image size is approximately 10 GB). You will need to pull the image on an internet-connected machine, import it into the offline machine, and then start the container using this image on the offline machine. For example:

shell
# Execute on an internet-connected machine
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu-offline
# Save the image to a file
docker save ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu-offline -o paddleocr-vl-latest-nvidia-gpu-offline.tar

# Transfer the image file to the offline machine

# Execute on the offline machine
docker load -i paddleocr-vl-latest-nvidia-gpu-offline.tar
# After that, you can use `docker run` to start the container on the offline machine

The image comes preinstalled with the PaddlePaddle framework and does not include any other inference engines. If you want to use other inference engines, it is recommended to install them manually using Method 2 (it is not recommended to install them in an environment where the PaddlePaddle framework is preinstalled).

TIP: Images with the latest-xxx tag correspond to the latest version. If you want to use a specific version of the image, you can replace latest in the tag with the desired PaddleOCR version number: paddleocr<major>.<minor>. For example: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:paddleocr3.3-nvidia-gpu-offline

1.2 Method 2: Manually Install the Inference Engine and PaddleOCR

If you cannot use Docker, you can manually install PaddlePaddle and PaddleOCR. The required Python version is 3.8–3.13.

We strongly recommend installing PaddleOCR-VL in a virtual environment to avoid dependency conflicts. For example, use the Python venv standard library to create a virtual environment:

shell
# Create a virtual environment
python -m venv .venv_paddleocr
# Activate the environment
source .venv_paddleocr/bin/activate

Please first install the dependencies corresponding to your chosen inference engine:

  • If you use PaddlePaddle for inference, install PaddlePaddle 3.2.1 or later (do not install both the CPU and GPU versions of PaddlePaddle at the same time). Common installation commands are as follows:
shell
# NVIDIA GPU (CUDA 12.6 as an example)
python -m pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/

# x64 CPU
python -m pip install paddlepaddle==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/

For other CUDA versions, please refer to the PaddlePaddle installation guide: https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html

After installing the inference engine, run the following command to install the base package required by PaddleOCR-VL:

shell
python -m pip install -U "paddleocr[doc-parser]"

2. Quick Start

This section introduces how to use PaddleOCR-VL through the CLI and Python API.

PaddleOCR-VL supports both CLI and Python API usage. The CLI method is simpler and suitable for quick verification, while the Python API is more flexible and suitable for integration into existing projects. The examples below use PaddlePaddle inference by default. To switch to the transformers engine, append --engine transformers in the CLI, or pass engine="transformers" when initializing the Python API.

IMPORTANT: The methods introduced in this section are primarily for rapid validation. Their inference speed, memory usage, and stability may not meet the requirements of a production environment. If deployment to a production environment is needed, we strongly recommend using a dedicated VLM inference service. For specific methods, please refer to the next section.

2.1 Command Line Usage

When you run PaddleOCR-VL for the first time, it will automatically download the official model files. Please make sure the current environment has internet access and allow some extra time for downloading and initialization.

If you would like to use the local demo image from this document directly, you can download it first:

shell
curl -L -o paddleocr_vl_demo.png https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png

The following are ready-to-copy example commands. For the first try, we recommend adding --save_path ./output so that you can inspect the saved results in the current directory:

shell
# NVIDIA GPU
paddleocr doc_parser -i ./paddleocr_vl_demo.png --save_path ./output

# Kunlunxin XPU
paddleocr doc_parser -i ./paddleocr_vl_demo.png --device xpu --save_path ./output

# Hygon DCU
paddleocr doc_parser -i ./paddleocr_vl_demo.png --device dcu --save_path ./output

# MetaX GPU
paddleocr doc_parser -i ./paddleocr_vl_demo.png --device metax_gpu --save_path ./output

# Apple Silicon
paddleocr doc_parser -i ./paddleocr_vl_demo.png --device cpu --save_path ./output

# Huawei Ascend NPU
# For Huawei Ascend NPU, please refer to Chapter 3 and use PaddlePaddle + vLLM for inference

# Use --use_doc_orientation_classify to enable document orientation classification
paddleocr doc_parser -i ./paddleocr_vl_demo.png --use_doc_orientation_classify True --save_path ./output

# Use --use_doc_unwarping to enable the document unwarping module
paddleocr doc_parser -i ./paddleocr_vl_demo.png --use_doc_unwarping True --save_path ./output

# Use --use_layout_detection to disable the layout detection and ordering module
paddleocr doc_parser -i ./paddleocr_vl_demo.png --use_layout_detection False --save_path ./output

After successful execution, the terminal will print the structured result. If you set --save_path ./output, the result files will also be saved under the output directory in the current working directory for further inspection and debugging.

To switch to the transformers engine, use:

bash
paddleocr doc_parser -i ./paddleocr_vl_demo.png --engine transformers --save_path ./output
<details><summary><b>Command line supports more parameters. Click to expand for detailed parameter descriptions</b></summary> <table> <thead> <tr> <th>Parameter</th> <th>Description</th> <th>Type</th> <th>Default</th> </tr> </thead> <tbody> <tr> <td><code>input</code></td> <td><b>Meaning:</b>Data to be predicted, required.

<b>Description:</b> For example, the local path of an image file or PDF file: <code>/root/data/img.jpg</code>;<b>Such as a URL link</b>, for example, the network URL of an image file or PDF file:<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/demo_paper.png">Example</a>;<b>Such as a local directory</b>, which should contain the images to be predicted, for example, the local path: <code>/root/data/</code>(Currently, prediction for directories containing PDF files is not supported. PDF files need to be specified with a specific file path).</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>save_path</code></td> <td><b>Meaning:</b>Specify the path where the inference result file will be saved.

<b>Description:</b> If not set, the inference results will not be saved locally.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>pipeline_version</code></td> <td> <b>Meaning:</b> Specifies the pipeline version.
<b>Description:</b> The currently available values are <code>"v1"</code> and <code>"v1.5"</code>.
</td> <td><code>str</code></td> <td>"v1.5"</td> </tr> <tr> <td><code>layout_detection_model_name</code></td> <td><b>Meaning:</b>Name of the layout area detection and ranking model.

<b>Description:</b> If not set, the default model of the production line will be used.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>layout_detection_model_dir</code></td> <td><b>Meaning:</b>Directory path of the layout area detection and ranking model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>layout_threshold</code></td> <td><b>Meaning:</b>Score threshold for the layout model.

<b>Description:</b> Any value between <code>0-1</code>. If not set, the default value is used, which is <code>0.5</code>.

<td><code>float</code></td> <td></td> </tr> <tr> <td><code>layout_nms</code></td> <td><b>Meaning:</b>Whether to use post-processing NMS for layout detection.

<b>Description:</b> If not set, the initialized default value will be used.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>layout_unclip_ratio</code></td> <td><b>Meaning:</b>Expansion coefficient for the detection boxes of the layout area detection model.

<b>Description:</b> Any floating-point number greater than <code>0</code>. If not set, the initialized default value will be used.</td>

<td><code>float</code></td> <td></td> </tr> <tr> <td><code>layout_merge_bboxes_mode</code></td> <td><b>Meaning:</b>Merging mode for the detection boxes output by the model in layout detection.

<b>Description:</b>

<ul> <li><b>large</b> when set to large, it means that among the detection boxes output by the model, for overlapping and contained boxes, only the outermost largest box is retained, and the overlapping inner boxes are deleted;</li> <li><b>small</b>, when set to small, it means that among the detection boxes output by the model, for overlapping and contained boxes, only the innermost contained small box is retained, and the overlapping outer boxes are deleted;</li> <li><b>union</b>,no filtering is performed on the boxes, and both inner and outer boxes are retained;</li></ul> If not set, the initialized parameter value will be used. </td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>vl_rec_model_name</code></td> <td><b>Meaning:</b>Name of the multimodal recognition model.

<b>Description:</b> If not set, the default model will be used.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>vl_rec_model_dir</code></td> <td><b>Meaning:</b>Directory path of the multimodal recognition model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>vl_rec_backend</code></td> <td><b>Meaning:</b>Inference backend used by the multimodal recognition model.</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>vl_rec_server_url</code></td> <td><b>Description:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the server URL.</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>vl_rec_max_concurrency</code></td> <td><b>Meaning:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the maximum number of concurrent requests.</td> <td><code>int</code></td> <td></td> </tr> <tr> <td><code>vl_rec_api_model_name</code></td> <td><b>Meaning:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the model name of the service.</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>vl_rec_api_key</code></td> <td><b>Meaning:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the API key of the service.</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>doc_orientation_classify_model_name</code></td> <td><b>Meaning:</b>Name of the document orientation classification model.

<b>Description:</b> If not set, the initialized default value will be used.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>doc_orientation_classify_model_dir</code></td> <td><b>Meaning:</b>Directory path of the document orientation classification model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>doc_unwarping_model_name</code></td> <td><b>Meaning:</b>Name of the text image rectification model.

<b>Description:</b> If not set, the initialized default value will be used.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>doc_unwarping_model_dir</code></td> <td><b>Meaning:</b>Directory path of the text image rectification model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>use_doc_orientation_classify</code></td> <td><b>Meaning:</b>Whether to load and use the document orientation classification module.

<b>Description:</b> If not set, the initialized default value will be used, which is initialized to<code>False</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>use_doc_unwarping</code></td> <td><b>Meaning:</b>Whether to load and use the text image rectification module.

<b>Description:</b> If not set, the initialized default value will be used, which is initialized to <code>False</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>use_layout_detection</code></td> <td><b>Meaning:</b>Whether to load and use the layout area detection and ranking module.

<b>Description:</b> If not set, the initialized default value will be used, which is initialized to <code>True</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>use_chart_recognition</code></td> <td><b>Meaning:</b>Whether to use the chart parsing function.

<b>Description:</b> If not set, the initialized default value will be used, which is initialized to <code>False</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>use_seal_recognition</code></td> <td><b>Meaning:</b>Whether to use the seal recognition function.

<b>Description:</b> If not set, the initialized default value will be used, which defaults to initialization as <code>False</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>use_ocr_for_image_block</code></td> <td><b>Meaning:</b>Whether to perform OCR on text within image blocks.

<b>Description:</b> If not set, the initialized default value will be used, which defaults to initialization as <code>False</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>format_block_content</code></td> <td><b>Meaning:</b>Controls whether to format the <code>block_content</code> content within as Markdown.

<b>Description:</b> If not set, the initialized default value will be used, which defaults to initialization as<code>False</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>merge_layout_blocks</code></td> <td><b>Meaning:</b>Control whether to merge the layout detection boxes for cross-column or staggered top and bottom columns.

<b>Description:</b> If not set, the initialized default value will be used, which defaults to initialization as<code>True</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>markdown_ignore_labels</code></td> <td><b>Meaning:</b>Layout labels that need to be ignored in Markdown.

<b>Description:</b> If not set, the initialized default value will be used, which defaults to initialization as<code>['number','footnote','header','header_image','footer','footer_image','aside_text']</code>.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>layout_shape_mode</code></td> <td> <b>Meaning:</b>Specifies the geometric representation mode for layout detection results. It defines how the boundaries of detected regions (e.g., text blocks, images, tables) are calculated and displayed.
<b>Description:</b> Value descriptions:
<ul>
  <li>
    <b>rect (rectangle)</b>:
    Outputs an axis-aligned bounding box (including x1, y1, x2, y2).
    Suitable for standard horizontally aligned layouts.
  </li>
  <li>
    <b>quad (quadrilateral)</b>:
    Outputs an arbitrary quadrilateral composed of four vertices.
    Suitable for regions with skew or perspective distortion.
  </li>
  <li>
    <b>poly (polygon)</b>:
    Outputs a closed contour composed of multiple coordinate points.
    Suitable for irregularly shaped or curved layout elements,
    offering the highest precision.
  </li>
  <li>
    <b>auto (automatic)</b>:
    The system automatically selects the most appropriate shape
    representation based on the complexity and confidence of the
    detected targets.
  </li>
</ul>
</td> <td><code>str</code></td> <td>"auto"</td> </tr> <tr> <td><code>use_queues</code></td> <td><b>Meaning:</b>Used to control whether to enable internal queues.

<b>Description:</b> When set to <code>True</code>, data loading (such as rendering PDF pages as images), layout detection model processing, and VLM inference will be executed asynchronously in separate threads, with data passed through queues, thereby improving efficiency. This approach is particularly efficient for PDF documents with a large number of pages or directories containing a large number of images or PDF files. If not set, the initialized default value will be used, which defaults to initialization as <code>True</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>prompt_label</code></td> <td><b>Meaning:</b>The prompt type setting for the VL model, which takes effect if and only if <code>use_layout_detection=False</code>.</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>repetition_penalty</code></td> <td><b>Meaning:</b>The repetition penalty parameter used in VL model sampling.</td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>temperature</code></td> <td><b>Meaning:</b>The temperature parameter used in VL model sampling.</td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>top_p</code></td> <td><b>Meaning:</b>The top-p parameter used in VL model sampling.</td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>min_pixels</code></td> <td><b>Meaning:</b>The minimum number of pixels allowed when the VL model preprocesses images.</td> <td><code>int</code></td> <td></td> </tr> <tr> <td><code>max_pixels</code></td> <td><b>Meaning:</b>The maximum number of pixels allowed when the VL model preprocesses images.</td> <td><code>int</code></td> <td></td> </tr> <tr> <td><code>device</code></td> <td><b>Meaning:</b>The device used for inference.

<b>Description:</b> Supports specifying specific card numbers:<ul>

<li><b>CPU</b>: For example,<code>cpu</code> indicates using the CPU for inference;</li> <li><b>GPU</b>: For example,<code>gpu:0</code> indicates using the first GPU for inference;</li> <li><b>NPU</b>: For example,<code>npu:0</code> indicates using the first NPU for inference;</li> <li><b>XPU</b>: For example,<code>xpu:0</code> indicates using the first XPU for inference;</li> <li><b>MLU</b>: For example,<code>mlu:0</code> indicates using the first MLU for inference;</li> <li><b>DCU</b>: For example,<code>dcu:0</code> indicates using the first DCU for inference;</li> <li><b>MetaX GPU</b>: For example,<code>metax_gpu:0</code> indicates using the first MetaX GPU for inference;</li> <li><b>Iluvatar GPU</b>: For example,<code>iluvatar_gpu:0</code> indicates using the first Iluvatar GPU for inference;</li> </ul>If not set, the initialized default value will be used. During initialization, the local GPU device 0 will be used preferentially. If it is not available, the CPU device will be used.</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>engine</code></td> <td><b>Meaning:</b> Inference engine. <b>Description:</b> Supports <code>None</code> (the default), <code>paddle</code>, <code>paddle_static</code>, <code>paddle_dynamic</code>, and <code>transformers</code>. When left as <code>None</code>, PaddleOCR preserves the behavior of earlier versions, which in most configurations is equivalent to <code>paddle</code>. For detailed descriptions, supported values, compatibility rules, and examples, see <a href="../inference_engine.en.md">Inference Engine and Configuration</a>.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>enable_hpi</code></td> <td><b>Meaning:</b> Whether to enable high-performance inference.</td> <td><code>bool</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_tensorrt</code></td> <td><b>Meaning:</b> Whether to enable the TensorRT subgraph engine of Paddle Inference.

<b>Description:</b> If the model does not support TensorRT acceleration, acceleration will not be used even if this flag is set.

For CUDA 11.8 versions of PaddlePaddle, the compatible TensorRT version is 8.x (x>=6). TensorRT 8.6.1.6 is recommended.

</td> <td><code>bool</code></td> <td><code>False</code></td> </tr> <tr> <td><code>precision</code></td> <td><b>Meaning:</b> Computation precision, such as <code>fp32</code> or <code>fp16</code>.</td> <td><code>str</code></td> <td><code>fp32</code></td> </tr> <tr> <td><code>enable_mkldnn</code></td> <td><b>Meaning:</b> Whether to enable MKL-DNN accelerated inference.

<b>Description:</b> If MKL-DNN is unavailable or the model does not support MKL-DNN acceleration, acceleration will not be used even if this flag is set.

</td> <td><code>bool</code></td> <td><code>True</code></td> </tr> <tr> <td><code>mkldnn_cache_capacity</code></td> <td> <b>Meaning:</b> MKL-DNN cache capacity. </td> <td><code>int</code></td> <td><code>10</code></td> </tr> <tr> <td><code>cpu_threads</code></td> <td><b>Meaning:</b> Number of threads used for inference on CPU.</td> <td><code>int</code></td> <td><code>10</code></td> </tr> <tr> <td><code>paddlex_config</code></td> <td><b>Meaning:</b> Path to the PaddleX pipeline configuration file.</td> <td><code>str</code></td> <td></td> </tr> </tbody> </table> </details>

The inference result will be printed in the terminal. The default output of PaddleOCR-VL is as follows:

<details><summary> 👉Click to expand</summary> <pre> <code> {'res': {'input_path': 'paddleocr_vl_demo.png', 'page_index': None, 'model_settings': {'use_doc_preprocessor': False, 'use_layout_detection': True, 'use_chart_recognition': False, 'format_block_content': False}, 'layout_det_res': {'input_path': None, 'page_index': None, 'boxes': [{'cls_id': 6, 'label': 'doc_title', 'score': 0.9636914134025574, 'coordinate': [np.float32(131.31366), np.float32(36.450516), np.float32(1384.522), np.float32(127.984665)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9281806349754333, 'coordinate': [np.float32(585.39465), np.float32(158.438), np.float32(930.2184), np.float32(182.57469)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9840355515480042, 'coordinate': [np.float32(9.023666), np.float32(200.86115), np.float32(361.41583), np.float32(343.8828)]}, {'cls_id': 14, 'label': 'image', 'score': 0.9871416091918945, 'coordinate': [np.float32(775.50574), np.float32(200.66502), np.float32(1503.3807), np.float32(684.9304)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9801855087280273, 'coordinate': [np.float32(9.532196), np.float32(344.90594), np.float32(361.4413), np.float32(440.8244)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9708921313285828, 'coordinate': [np.float32(28.040405), np.float32(455.87976), np.float32(341.7215), np.float32(520.7117)]}, {'cls_id': 24, 'label': 'vision_footnote', 'score': 0.9002962708473206, 'coordinate': [np.float32(809.0692), np.float32(703.70044), np.float32(1488.3016), np.float32(750.5238)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9825374484062195, 'coordinate': [np.float32(8.896561), np.float32(536.54895), np.float32(361.05237), np.float32(655.8058)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9822263717651367, 'coordinate': [np.float32(8.971573), np.float32(657.4949), np.float32(362.01715), np.float32(774.625)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9767460823059082, 'coordinate': [np.float32(9.407074), np.float32(776.5216), np.float32(361.31067), np.float32(846.82874)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9868153929710388, 'coordinate': [np.float32(8.669495), np.float32(848.2543), np.float32(361.64703), np.float32(1062.8568)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9826608300209045, 'coordinate': [np.float32(8.8025055), np.float32(1063.8615), np.float32(361.46588), np.float32(1182.8524)]}, {'cls_id': 22, 'label': 'text', 'score': 0.982555627822876, 'coordinate': [np.float32(8.820602), np.float32(1184.4663), np.float32(361.66394), np.float32(1302.4507)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9584776759147644, 'coordinate': [np.float32(9.170288), np.float32(1304.2161), np.float32(361.48898), np.float32(1351.7483)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9782056212425232, 'coordinate': [np.float32(389.1618), np.float32(200.38202), np.float32(742.7591), np.float32(295.65146)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9844875931739807, 'coordinate': [np.float32(388.73303), np.float32(297.18463), np.float32(744.00024), np.float32(441.3034)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9680547714233398, 'coordinate': [np.float32(409.39468), np.float32(455.89386), np.float32(721.7174), np.float32(520.9387)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9741666913032532, 'coordinate': [np.float32(389.71606), np.float32(536.8138), np.float32(742.7112), np.float32(608.00165)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9840384721755981, 'coordinate': [np.float32(389.30988), np.float32(609.39636), np.float32(743.09247), np.float32(750.3231)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9845995306968689, 'coordinate': [np.float32(389.13272), np.float32(751.7772), np.float32(743.058), np.float32(894.8815)]}, {'cls_id': 22, 'label': 'text', 'score': 0.984852135181427, 'coordinate': [np.float32(388.83267), np.float32(896.0371), np.float32(743.58215), np.float32(1038.7345)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9804865717887878, 'coordinate': [np.float32(389.08478), np.float32(1039.9119), np.float32(742.7585), np.float32(1134.4897)]}, {'cls_id': 22, 'label': 'text', 'score': 0.986461341381073, 'coordinate': [np.float32(388.52643), np.float32(1135.8137), np.float32(743.451), np.float32(1352.0085)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9869391918182373, 'coordinate': [np.float32(769.8341), np.float32(775.66235), np.float32(1124.9813), np.float32(1063.207)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9822869896888733, 'coordinate': [np.float32(770.30383), np.float32(1063.938), np.float32(1124.8295), np.float32(1184.2192)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9689218997955322, 'coordinate': [np.float32(791.3042), np.float32(1199.3169), np.float32(1104.4521), np.float32(1264.6985)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9713128209114075, 'coordinate': [np.float32(770.4253), np.float32(1279.6072), np.float32(1124.6917), np.float32(1351.8672)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9236552119255066, 'coordinate': [np.float32(1153.9058), np.float32(775.5814), np.float32(1334.0654), np.float32(798.1581)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9857938885688782, 'coordinate': [np.float32(1151.5197), np.float32(799.28015), np.float32(1506.3619), np.float32(991.1156)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9820687174797058, 'coordinate': [np.float32(1151.5686), np.float32(991.91095), np.float32(1506.6023), np.float32(1110.8875)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9866049885749817, 'coordinate': [np.float32(1151.6919), np.float32(1112.1301), np.float32(1507.1611), np.float32(1351.9504)]}]}}} </code></pre></details>

For detailed descriptions of the running results and saving interfaces, refer to the result explanation in 2.2 Python Script Integration.

<b>Note: </b> Since the default model of PaddleOCR-VL is relatively large, inference may be slow. For actual use, it is recommended to use 3. Improving Inference Performance with VLM Inference Services for faster inference.

2.2 Python Script Integration

The command line method is intended for quick testing and visualization. In real projects, you usually integrate the model through code. You can quickly run PaddleOCR-VL inference with just a few lines of code:

python
from pathlib import Path

from paddleocr import PaddleOCRVL

output_dir = Path("./output")
output_dir.mkdir(parents=True, exist_ok=True)

# NVIDIA GPU
pipeline = PaddleOCRVL()
# Kunlunxin XPU
# pipeline = PaddleOCRVL(device="xpu")
# Hygon DCU
# pipeline = PaddleOCRVL(device="dcu")
# MetaX GPU
# pipeline = PaddleOCRVL(device="metax_gpu")
# Apple Silicon
# pipeline = PaddleOCRVL(device="cpu")
# Huawei Ascend NPU 
# Huawei Ascend NPU please refer to Chapter 3 for inference using PaddlePaddle + vLLM

# pipeline = PaddleOCRVL(use_doc_orientation_classify=True) # Use use_doc_orientation_classify to enable/disable document orientation classification model
# pipeline = PaddleOCRVL(use_doc_unwarping=True) # Use use_doc_unwarping to enable/disable document unwarping module
# pipeline = PaddleOCRVL(use_layout_detection=False) # Use use_layout_detection to enable/disable layout detection module

output = pipeline.predict("./paddleocr_vl_demo.png")
for res in output:
    res.print() ## Print the structured prediction output
    res.save_to_json(save_path=output_dir) ## Save the current image's structured result in JSON format
    res.save_to_markdown(save_path=output_dir) ## Save the current image's result in Markdown format
    res.save_to_word(save_path="output") ## Save the current image's result in Word format

To switch to the transformers engine, use:

python
from pathlib import Path

from paddleocr import PaddleOCRVL

output_dir = Path("./output")
output_dir.mkdir(parents=True, exist_ok=True)

pipeline = PaddleOCRVL(engine="transformers")
output = pipeline.predict("./paddleocr_vl_demo.png")
for res in output:
    res.print() ## Print the structured prediction output
    res.save_to_json(save_path=output_dir) ## Save the current image's structured result in JSON format
    res.save_to_markdown(save_path=output_dir) ## Save the current image's result in Markdown format
    res.save_to_word(save_path="output") ## Save the current image's result in Word format

For PDF files, each page will be processed individually, and a separate Markdown file will be generated for each page. If you wish to perform cross-page table merging, reconstruct multi-level headings, or merge multi-page results, you can achieve this using the following method:

python
from pathlib import Path

from paddleocr import PaddleOCRVL

input_file = "./your_pdf_file.pdf"
output_dir = Path("./output")
output_dir.mkdir(parents=True, exist_ok=True)

pipeline = PaddleOCRVL()

output = pipeline.predict(input=input_file)

pages_res = list(output)

output = pipeline.restructure_pages(pages_res)
# output = pipeline.restructure_pages(pages_res, merge_tables=True) # Merge tables across pages
# output = pipeline.restructure_pages(pages_res, merge_tables=True, relevel_titles=True) # Merge tables across pages and reconstruct multi-level titles
# output = pipeline.restructure_pages(pages_res, merge_tables=True, relevel_titles=True, concatenate_pages=True) # Merge tables across pages, reconstruct multi-level titles, and merge multiple pages

for res in output:
    res.print() ## Print the structured prediction output
    res.save_to_json(save_path=output_dir) ## Save the current image's structured result in JSON format
    res.save_to_markdown(save_path=output_dir) ## Save the current image's result in Markdown format

If you need to process multiple files, it is recommended to pass the directory path containing the files or a list of file paths to the predict method to maximize processing efficiency. For example:

python
# The `imgs` directory contains multiple images to be processed: file1.png, file2.png, file3.png
# Pass the directory path
output = pipeline.predict("imgs")
# Or pass a list of file paths
output = pipeline.predict(["imgs/file1.png", "imgs/file2.png", "imgs/file3.png"])
# Both of the above methods are more efficient than the following approach:
# for file in ["imgs/file1.png", "imgs/file2.png", "imgs/file3.png"]:
#     output = pipeline.predict(file)

Note:

  • In the example code above, the use_doc_orientation_classify and use_doc_unwarping parameters are both set to False by default, meaning document orientation classification and document unwarping are disabled. If you need these features, set them to True manually.

The above Python script performs the following steps:

<details><summary>(1) Instantiate the pipeline object. Specific parameter descriptions are as follows:</summary> <table> <thead> <tr> <th>Parameter</th> <th>Parameter Description</th> <th>Parameter Type</th> <th>Default Value</th> </tr> </thead> <tbody> <tr> <td><code>pipeline_version</code></td> <td> <b>Meaning:</b> Specifies the pipeline version.
<b>Description:</b> The currently available values are <code>"v1"</code> and <code>"v1.5"</code>.
</td> <td><code>str</code></td> <td>"v1.5"</td> </tr> <tr> <td><code>layout_detection_model_name</code></td> <td><b>Meaning:</b>Name of the layout area detection and ranking model.

<b>Description:</b> If set to <code>None</code>, the default model of the production line will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_detection_model_dir</code></td> <td><b>Meaning:</b>Directory path of the layout area detection and ranking model.

<b>Description:</b> If set to <code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_threshold</code></td> <td><b>Meaning:</b>Score threshold for the layout model.

<b>Description:</b>

<ul> <li><b>float</b>: Any floating-point number between <code>0-1</code>;</li> <li><b>dict</b>: <code>{0:0.1}</code> The key is the class ID, and the value is the threshold for that class;</li> <li><b>None</b>: If set to <code>None</code>, the parameter value initialized by the pipeline will be used.</li> </ul> <td><code>float|dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_nms</code></td> <td><b>Meaning:</b>Whether to use post-processing NMS for layout detection.

<b>Description:</b> If set to <code>None</code>, the parameter value initialized by the production line will be used.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_unclip_ratio</code></td> <td> <b>Meaning:</b>Expansion coefficient for the detection box of the layout area detection model.

<b>Description:</b>

<ul> <li><b>float</b>: Any floating-point number greater than <code>0</code></li> <li><b>Tuple[float,float]</b>: The respective expansion coefficients in the horizontal and vertical directions;</li> <li><b>dict</b>: where the key of the dict is of <b>int</b> type, representing <code>cls_id</code>, and the value is of</code>tuple <code>type, such as</code>{0: (1.1, 2.0)}, indicating that the center of the detection box for class 0 output by the model remains unchanged, with the width expanded by 1.1 times and the height expanded by 2.0 times;</li> <li><b>None</b>: If set to <code>None</code>, the parameter value initialized by the pipeline will be used.</li> </ul> <td><code>float|Tuple[float,float]|dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_merge_bboxes_mode</code><ul> <td><b>Meaning:</b>Merging mode for the detection boxes output by the model in layout detection.

<b>Description:</b>

<ul> <li><b>large</b> when set to large, it means that among the detection boxes output by the model, for overlapping and contained boxes, only the outermost largest box is retained, and the overlapping inner boxes are deleted;</li> <li><b>small</b>, when set to small, it means that among the detection boxes output by the model, for overlapping and contained boxes, only the innermost contained small box is retained, and the overlapping outer boxes are deleted;</li> <li><b>union</b>,no filtering is performed on the boxes, and both inner and outer boxes are retained;</li></ul> If set to <code>None</code>, the initialized parameter value will be used. </td> <td><code>str|dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>vl_rec_model_name</code></td> <td><b>Meaning:</b>Name of the multimodal recognition model.

<b>Description:</b> If set to <code>None</code>, the default model will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>vl_rec_model_dir</code></td> <td><b>Meaning:</b>Directory path of the multimodal recognition model.

<b>Description:</b> If set to <code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>vl_rec_backend</code></td> <td><b>Meaning:</b>Inference backend used by the multimodal recognition model.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>vl_rec_server_url</code></td> <td><b>Meaning:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the server URL.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>vl_rec_max_concurrency</code></td> <td><b>Meaning:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the maximum number of concurrent requests.</td> <td><code>int|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>vl_rec_api_model_name</code></td> <td><b>Meaning:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the model name of the service.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>vl_rec_api_key</code></td> <td><b>Meaning:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the API key of the service.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>doc_orientation_classify_model_name</code></td> <td><b>Meaning:</b>Name of the document orientation classification model.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>doc_orientation_classify_model_dir</code></td> <td><b>Meaning:</b>Directory path of the document orientation classification model.

<b>Description:</b> If set to <code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>doc_unwarping_model_name</code></td> <td><b>Meaning:</b>Name of the text image rectification model.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>doc_unwarping_model_dir</code></td> <td><b>Meaning:</b>Directory path of the text image rectification model.

<b>Description:</b> If set to <code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_doc_orientation_classify</code></td> <td><b>Meaning:</b>Whether to load and use the document orientation classification module.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which is initialized to<code>False</code>.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_doc_unwarping</code></td> <td><b>Meaning:</b>Whether to load and use the text image rectification module.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which is initialized to <code>False</code>.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_layout_detection</code></td> <td><b>Meaning:</b>Whether to load and use the layout area detection and ranking module.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which is initialized to <code>True</code>.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_chart_recognition</code></td> <td><b>Meaning:</b>Whether to use the chart parsing function.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which is initialized to <code>False</code>.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_seal_recognition</code></td> <td><b>Meaning:</b>Whether to use the seal recognition function.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which is initialized to <code>False</code>.</td>

<td><code>bool|None</code></td> <td></td> </tr> <tr> <td><code>use_ocr_for_image_block</code></td> <td><b>Meaning:</b>Whether to perform OCR on text within image blocks.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which is initialized to <code>False</code>.</td>

<td><code>bool|None</code></td> <td></td> </tr> <tr> <td><code>format_block_content</code></td> <td><b>Meaning:</b>Controls whether to format the <code>block_content</code> content within as Markdown.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which defaults to initialization as<code>False</code>.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> <td></td> </tr> <tr> <td><code>merge_layout_blocks</code></td> <td><b>Meaning:</b>Control whether to merge the layout detection boxes for cross-column or staggered top and bottom columns.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which defaults to initialization as<code>True</code>.</td>

<td><code>bool|None</code></td> <td></td> </tr> <tr> <td><code>markdown_ignore_labels</code></td> <td><b>Meaning:</b>Layout labels that need to be ignored in Markdown.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which defaults to initialization as <code>['number','footnote','header','header_image','footer','footer_image','aside_text']</code>.</td>

<td><code>list|None</code></td> <td></td> </tr> <tr> <td><code>use_queues</code></td> <td><b>Meaning:</b>Used to control whether to enable internal queues.

<b>Description:</b> When set to <code>True</code>, data loading (such as rendering PDF pages as images), layout detection model processing, and VLM inference will be executed asynchronously in separate threads, with data passed through queues, thereby improving efficiency. This approach is particularly efficient for PDF documents with many pages or directories containing a large number of images or PDF files. If set to <code>None</code>, the initialized default value will be used, which defaults to initialization as <code>True</code>.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>device</code></td> <td><b>Meaning:</b>The device used for inference.

<b>Description:</b> Supports specifying specific card numbers:<ul>

<li><b>CPU</b>: For example,<code>cpu</code> indicates using the CPU for inference;</li> <li><b>GPU</b>: For example,<code>gpu:0</code> indicates using the first GPU for inference;</li> <li><b>NPU</b>: For example,<code>npu:0</code> indicates using the first NPU for inference;</li> <li><b>XPU</b>: For example,<code>xpu:0</code> indicates using the first XPU for inference;</li> <li><b>MLU</b>: For example,<code>mlu:0</code> indicates using the first MLU for inference;</li> <li><b>DCU</b>: For example,<code>dcu:0</code> indicates using the first DCU for inference;</li> <li><b>MetaX GPU</b>: For example,<code>metax_gpu:0</code> indicates using the first MetaX GPU for inference;</li> <li><b>Iluvatar GPU</b>: For example,<code>iluvatar_gpu:0</code> indicates using the first Iluvatar GPU for inference;</li> </ul>If not set, the initialized default value will be used. During initialization, the local GPU device 0 will be used preferentially. If it is not available, the CPU device will be used.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>engine</code></td> <td><b>Meaning:</b> Inference engine. <b>Description:</b> Supports <code>None</code> (the default), <code>paddle</code>, <code>paddle_static</code>, <code>paddle_dynamic</code>, and <code>transformers</code>. When left as <code>None</code>, PaddleOCR preserves the behavior of earlier versions, which in most configurations is equivalent to <code>paddle</code>. For detailed descriptions, supported values, compatibility rules, and examples, see <a href="../inference_engine.en.md">Inference Engine and Configuration</a>.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>engine_config</code></td> <td><b>Meaning:</b> Inference-engine configuration. <b>Description:</b> Recommended together with <code>engine</code>. For supported fields, compatibility rules, and examples, see <a href="../inference_engine.en.md">Inference Engine and Configuration</a>.</td> <td><code>dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>enable_hpi</code></td> <td><b>Meaning:</b> Whether to enable high-performance inference.</td> <td><code>bool</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_tensorrt</code></td> <td><b>Meaning:</b> Whether to enable the TensorRT subgraph engine of Paddle Inference.

<b>Description:</b> If the model does not support TensorRT acceleration, acceleration will not be used even if this flag is set.

For CUDA 11.8 versions of PaddlePaddle, the compatible TensorRT version is 8.x (x>=6). TensorRT 8.6.1.6 is recommended.

</td> <td><code>bool</code></td> <td><code>False</code></td> </tr> <tr> <td><code>precision</code></td> <td><b>Meaning:</b> Computation precision, such as <code>"fp32"</code> or <code>"fp16"</code>.</td> <td><code>str</code></td> <td><code>"fp32"</code></td> </tr> <tr> <td><code>enable_mkldnn</code></td> <td><b>Meaning:</b> Whether to enable MKL-DNN accelerated inference.

<b>Description:</b> If MKL-DNN is unavailable or the model does not support MKL-DNN acceleration, acceleration will not be used even if this flag is set.

</td> <td><code>bool</code></td> <td><code>True</code></td> </tr> <tr> <td><code>mkldnn_cache_capacity</code></td> <td> <b>Meaning:</b> MKL-DNN cache capacity. </td> <td><code>int</code></td> <td><code>10</code></td> </tr> <tr> <td><code>cpu_threads</code></td> <td><b>Meaning:</b> Number of threads used for inference on CPU.</td> <td><code>int</code></td> <td><code>10</code></td> </tr> <tr> <td><code>paddlex_config</code></td> <td><b>Meaning:</b> Path to the PaddleX pipeline configuration file.</td> <td><code>str</code></td> <td><code>None</code></td> </tr> </tbody> </table> </details> <details><summary>(2) Call the <code>predict()</code>method of the PaddleOCR-VL pipeline object for inference prediction. This method will return a list of results. Additionally, the pipeline also provides the <code>predict_iter()</code>Method. The two are completely consistent in terms of parameter acceptance and result return. The difference lies in that <code>predict_iter()</code>returns a <code>generator</code>, which can process and obtain prediction results step by step. It is suitable for scenarios involving large datasets or where memory conservation is desired. You can choose either of these two methods based on actual needs. Below are the parameters of the <code>predict()</code>method and their descriptions:</summary> <table> <thead> <tr> <th>Parameter</th> <th>Parameter Description</th> <th>Parameter Type</th> <th>Default Value</th> </tr> </thead> <tr> <td><code>input</code></td> <td><b>Meaning:</b>Data to be predicted, supporting multiple input types. Required.

<b>Description:</b>

<ul> <li><b>Python Var</b>: such as <code>numpy.ndarray</code> representing image data</li> <li><b>str</b>: such as the local path of an image file or PDF file: <code>/root/data/img.jpg</code>;<b>such as a URL link</b>, such as the network URL of an image file or PDF file:<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/demo_paper.png">Example</a>;<b>such as a local directory</b>, which should contain the images to be predicted, such as the local path: <code>/root/data/</code>(Currently, prediction for directories containing PDF files is not supported. PDF files need to be specified with a specific file path)</li> <li><b>list</b>: List elements should be of the aforementioned data types, such as <code>[numpy.ndarray, numpy.ndarray]</code>, <code>["/root/data/img1.jpg", "/root/data/img2.jpg"]</code>, <code>["/root/data1", "/root/data2"].</code></li> </ul> </td> <td><code>Python Var|str|list</code></td> <td></td> </tr> <tr> <td><code>use_doc_orientation_classify</code></td> <td><b>Meaning:</b>Whether to use the document orientation classification module during inference.

<b>Description:</b> Setting it to <code>None</code> means using the instantiation parameter; otherwise, this parameter takes precedence.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_doc_unwarping</code></td> <td><b>Meaning:</b>Whether to use the text image rectification module during inference.

<b>Description:</b> Setting it to <code>None</code> means using the instantiation parameter; otherwise, this parameter takes precedence.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_layout_detection</code></td> <td><b>Meaning:</b>Whether to use the layout region detection and sorting module during inference.

<b>Description:</b> Setting it to <code>None</code> means using the instantiation parameter; otherwise, this parameter takes precedence.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_chart_recognition</code></td> <td><b>Meaning:</b>Whether to use the chart parsing function. Setting it to <code>None</code> means using the instantiation parameter; otherwise, this parameter takes precedence.</td> <td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_seal_recognition</code></td> <td><b>Meaning:</b>Whether to use the seal recognition function. Setting it to <code>None</code> means using the instantiation parameter; otherwise, this parameter takes precedence.</td> <td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_ocr_for_image_block</code></td> <td><b>Meaning:</b>Whether to perform OCR on text within image blocks. Setting it to <code>None</code> means using the instantiation parameter; otherwise, this parameter takes precedence.</td> <td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_threshold</code></td> <td><b>Meaning:</b>The parameter meaning is basically the same as the instantiation parameter.

<b>Description:</b> Setting it to <code>None</code> means using the instantiation parameter; otherwise, this parameter takes precedence.</td>

<td><code>float|dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_nms</code></td> <td><b>Meaning:</b>The parameter meaning is basically the same as the instantiation parameter.

<b>Description:</b> Setting it to <code>None</code> means using the instantiation parameter; otherwise, this parameter takes precedence.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_unclip_ratio</code></td> <td><b>Meaning:</b>The parameter meaning is basically the same as the instantiation parameter.

<b>Description:</b> Setting it to <code>None</code> means using the instantiation parameter; otherwise, this parameter takes precedence.</td>

<td><code>float|Tuple[float,float]|dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_merge_bboxes_mode</code></td> <td><b>Meaning:</b>The parameter meaning is basically the same as the instantiation parameter.

<b>Description:</b> Setting it to <code>None</code> means using the instantiation parameter; otherwise, this parameter takes precedence.</td>

<td><code>str|dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_shape_mode</code></td> <td> <b>Meaning:</b>Specifies the geometric representation mode for layout detection results. It defines how the boundaries of detected regions (e.g., text blocks, images, tables) are calculated and displayed.
<b>Description:</b> Value descriptions:
<ul>
  <li>
    <b>rect (rectangle)</b>:
    Outputs an axis-aligned bounding box (including x1, y1, x2, y2).
    Suitable for standard horizontally aligned layouts.
  </li>
  <li>
    <b>quad (quadrilateral)</b>:
    Outputs an arbitrary quadrilateral composed of four vertices.
    Suitable for regions with skew or perspective distortion.
  </li>
  <li>
    <b>poly (polygon)</b>:
    Outputs a closed contour composed of multiple coordinate points.
    Suitable for irregularly shaped or curved layout elements,
    offering the highest precision.
  </li>
  <li>
    <b>auto (automatic)</b>:
    The system automatically selects the most appropriate shape
    representation based on the complexity and confidence of the
    detected targets.
  </li>
</ul>
</td> <td><code>str</code></td> <td>"auto"</td> </tr> <tr> <td><code>use_queues</code></td> <td><b>Meaning:</b>The parameter meaning is basically the same as the instantiation parameter.

<b>Description:</b> Setting it to <code>None</code> means using the instantiation parameter; otherwise, this parameter takes precedence.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>prompt_label</code></td> <td><b>Meaning:</b>The prompt type setting for the VL model, which takes effect only when <code>use_layout_detection=False</code>.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>format_block_content</code></td> <td><b>Meaning:</b>The parameter meaning is basically the same as the instantiation parameter.

<b>Description:</b> Setting it to <code>None</code> means using the instantiation parameter; otherwise, this parameter takes precedence.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>repetition_penalty</code></td> <td><b>Meaning:</b>The repetition penalty parameter used for VL model sampling.</td> <td><code>float|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>temperature</code></td> <td><b>Meaning:</b>Temperature parameter used for VL model sampling.</td> <td><code>float|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>top_p</code></td> <td><b>Meaning:</b>Top-p parameter used for VL model sampling.</td> <td><code>float|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>min_pixels</code></td> <td><b>Meaning:</b>The minimum number of pixels allowed when the VL model preprocesses images.</td> <td><code>int|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>max_pixels</code></td> <td><b>Meaning:</b>The maximum number of pixels allowed when the VL model preprocesses images.</td> <td><code>int|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>max_new_tokens</code></td> <td><b>Meaning:</b>The maximum number of tokens generated by the VL model.</td> <td><code>int|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>merge_layout_blocks</code></td> <td><b>Meaning:</b>Control whether to merge the layout detection boxes for cross-column or staggered top and bottom columns.</td> <td><code>bool|None</code></td> <td></td> </tr> <tr> <td><code>markdown_ignore_labels</code></td> <td><b>Meaning:</b>Layout labels that need to be ignored in Markdown.</td> <td><code>list|None</code></td> <td></td> </tr> <tr> <td><code>vlm_extra_args</code></td> <td><b>Meaning:</b>Additional configuration parameters for the VLM. The currently supported custom parameters are as follows: <ul> <li><code>ocr_min_pixels</code>: Minimum resolution for OCR</li> <li><code>ocr_max_pixels</code>: Maximum resolution for OCR</li> <li><code>table_min_pixels</code>: Minimum resolution for tables</li> <li><code>table_max_pixels</code>: Maximum resolution for tables</li> <li><code>chart_min_pixels</code>: Minimum resolution for charts</li> <li><code>chart_max_pixels</code>: Maximum resolution for charts</li> <li><code>formula_min_pixels</code>: Minimum resolution for formulas</li> <li><code>formula_max_pixels</code>: Maximum resolution for formulas</li> <li><code>seal_min_pixels</code>: Minimum resolution for seals</li> <li><code>seal_max_pixels</code>: Maximum resolution for seals</li> </ul></td> <td><code>dict|None</code></td> <td><code>None</code></td> </tr> </table> </details> <details><summary>(3) Invoke the <code>restructure_pages()</code> method of the PaddleOCR-VL object to reconstruct pages from the multi-page results list of inference predictions. This method will return a reconstructed multi-page result or a merged single-page result. Below are the parameters of the <code>restructure_pages()</code> method and their descriptions:</summary> <table> <thead> <tr> <th>Parameter</th> <th>Description</th> <th>Type</th> <th>Default Value</th> </tr> </thead> <tbody> <tr> <td><code>res_list</code></td> <td><b>Meaning:</b> The list of results predicted from a multi-page PDF inference.</td> <td><code>list|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>merge_tables</code></td> <td><b>Meaning:</b> Controls whether to merge tables across pages.</td> <td><code>Bool</code></td> <td><code>True</code></td> </tr> <tr> <td><code>relevel_titles</code></td> <td><b>Meaning:</b> Controls whether to perform multi-level table grading.</td> <td><code>Bool</code></td> <td><code>True</code></td> </tr> <tr> <td><code>concatenate_pages</code></td> <td><b>Meaning:</b> Controls whether to concatenate multi-page results into one page.</td> <td><code>Bool</code></td> <td><code>False</code></td> </tr> </tbody> </table> </details> <details><summary>(4) Process the prediction results: The prediction result for each sample is a corresponding Result object, supporting operations such as printing, saving as an image, and saving as a <code>json</code> file:</summary> <table> <thead> <tr> <th>Method</th> <th>Method Description</th> <th>Parameter</th> <th>Parameter Type</th> <th>Parameter Description</th> <th>Default Value</th> </tr> </thead> <tr> <td rowspan="3"> <code>print()</code></td> <td rowspan="3">Print results to the terminal</td> <td><code>format_json</code></td> <td><code>bool</code></td> <td>Whether to format the output content using <code>JSON</code> indentation.</td> <td><code>True</code></td> </tr> <tr> <td><code>indent</code></td> <td><code>int</code></td> <td>Specify the indentation level to beautify the output <code>JSON</code> data, making it more readable. Only valid when <code>format_json</code> is <code>True</code>.</td> <td><code>4</code></td> </tr> <tr> <td><code>ensure_ascii</code></td> <td><code>bool</code></td> <td>Control whether non- <code>ASCII</code> characters are escaped as <code>Unicode</code>. When set to <code>True</code>, all non- <code>ASCII</code> characters will be escaped; <code>False</code> retains the original characters. Only valid when <code>format_json</code> is <code>True</code>.</td> <td><code>False</code></td> </tr> <tr> <td rowspan="3"> <code>save_to_json()</code></td> <td rowspan="3">Save the results as a json format file</td> <td><code>save_path</code></td> <td><code>str</code></td> <td>The file path for saving. When it is a directory, the saved file name will be consistent with the input file type naming.</td> <td><code>None</code></td> </tr> <tr> <td><code>indent</code></td> <td><code>int</code></td> <td>Specify the indentation level to beautify the output <code>JSON</code>data, making it more readable. Only valid when <code>format_json</code>is <code>True</code>.</td> <td><code>4</code></td> </tr> <tr> <td><code>ensure_ascii</code></td> <td><code>bool</code></td> <td>Control whether non- <code>ASCII</code> characters are escaped as <code>Unicode</code>. When set to <code>True</code>, all non- <code>ASCII</code> characters will be escaped; <code>False</code> retains the original characters. Only valid when <code>format_json</code> is <code>True</code>.</td> <td><code>False</code></td> </tr> <tr> <td><code>save_to_img()</code></td> <td>Save the visualized images of each intermediate module in png format</td> <td><code>save_path</code></td> <td><code>str</code></td> <td>The file path for saving, supporting directory or file paths.</td> <td><code>None</code></td> </tr> <tr> <td rowspan="3"> <code>save_to_markdown()</code></td> <td rowspan="3">Save each page in an image or PDF file as a markdown format file separately</td> <td><code>save_path</code></td> <td><code>str</code></td> <td>The file path for saving. When it is a directory, the saved file name will be consistent with the input file type naming</td> <td><code>None</code></td> </tr> <tr> <td><code>pretty</code></td> <td><code>bool</code></td> <td>Whether to beautify the <code>markdown</code> output results, centering charts, etc., to make the <code>markdown</code> rendering more aesthetically pleasing.</td> <td><code>True</code></td> </tr> <tr> <td><code>show_formula_number</code></td> <td><code>bool</code></td> <td>Control whether to retain formula numbers in <code>markdown</code>. When set to <code>True</code>, all formula numbers are retained; <code>False</code> retains only the formulas</td> <td><code>False</code></td> </tr> <tr> <td><code>save_to_html()</code></td> <td>Save the tables in the file as html format files</td> <td><code>save_path</code></td> <td><code>str</code></td> <td>The file path for saving, supporting directory or file paths.</td> <td><code>None</code></td> </tr> <tr> <td><code>save_to_xlsx()</code></td> <td>Save the tables in the file as xlsx format files</td> <td><code>save_path</code></td> <td><code>str</code></td> <td>The file path for saving, supporting directory or file paths.</td> <td><code>None</code></td> </tr> <tr> <td><code>save_to_word()</code></td> <td>Save the layout parsing results as a Word (.docx) format file</td> <td><code>save_path</code></td> <td><code>str</code></td> <td>The file path for saving, supporting directory or file paths.</td> <td><code>None</code></td> </tr> </tr> </table>
  • Calling the print() method will print the results to the terminal. The content printed to the terminal is explained as follows:

    • input_path: (str) The input path of the image or PDF to be predicted.

    • page_index: (Union[int, None]) If the input is a PDF file, it indicates the current page number of the PDF; otherwise, it is None.

    • page_count: (Union[int, None]) If the input is a PDF file, it indicates the total number of pages in the PDF; otherwise, it is None.

    • width: (int) The width of the original input image.

    • height: (int) The height of the original input image.

    • model_settings: (Dict[str, bool]) Model parameters required for configuring PaddleOCR-VL.

      • use_doc_preprocessor: (bool) Controls whether to enable the document preprocessing sub-pipeline.
      • use_layout_detection: (bool) Controls whether to enable the layout detection module.
      • use_chart_recognition: (bool) Controls whether to enable the chart recognition function.
      • format_block_content: (bool) Controls whether to save the formatted markdown content in JSON.
      • markdown_ignore_labels: (List[str]) Labels of layout regions that need to be ignored in Markdown
    • doc_preprocessor_res: (Dict[str, Union[List[float], str]]) A dictionary of document preprocessing results, which exists only when use_doc_preprocessor=True.

      • input_path: (str) The image path accepted by the document preprocessing sub-pipeline. When the input is a numpy.ndarray, it is saved as None; here, it is None.
      • page_index: None. Since the input here is a numpy.ndarray, the value is None.
      • model_settings: (Dict[str, bool]) Model configuration parameters for the document preprocessing sub-pipeline.
        • use_doc_orientation_classify: (bool) Controls whether to enable the document image orientation classification sub-module.
        • use_doc_unwarping: (bool) Controls whether to enable the text image distortion correction sub-module.
      • angle: (int) The prediction result of the document image orientation classification sub-module. When enabled, it returns the actual angle value.
    • parsing_res_list: (List[Dict]) A list of parsing results, where each element is a dictionary. The list order is the reading order after parsing.

      • block_bbox: (np.ndarray) The bounding box of the layout area.
      • block_label: (str) The label of the layout area, such as text, table, etc.
      • block_content: (str) The content within the layout area.
      • block_id: (int) The index of the layout area, used to display the layout sorting results.
      • block_order (int) The order of the layout area, used to display the layout reading order. For non-sorted parts, the default value is None.
  • Calling the save_to_json() method will save the above content to the specified save_path. If a directory is specified, the saved path will be save_path/{your_img_basename}_res.json. If a file is specified, it will be saved directly to that file. Since json files do not support saving numpy arrays, the numpy.array types within will be converted to list form.

    • input_path: (str) The input path of the image or PDF to be predicted.

    • page_index: (Union[int, None]) If the input is a PDF file, it indicates the current page number of the PDF; otherwise, it is None.

    • model_settings: (Dict[str, bool]) Model parameters required for configuring PaddleOCR-VL.

      • use_doc_preprocessor: (bool) Controls whether to enable the document preprocessing sub-pipeline.
      • use_layout_detection: (bool) Controls whether to enable the layout detection module.
      • use_chart_recognition: (bool) Controls whether to enable the chart recognition function.
      • format_block_content: (bool) Controls whether to save the formatted markdown content in JSON.
    • doc_preprocessor_res: (Dict[str, Union[List[float], str]]) A dictionary of document preprocessing results, which exists only when use_doc_preprocessor=True.

      • input_path: (str) The image path accepted by the document preprocessing sub-pipeline. When the input is a numpy.ndarray, it is saved as None; here, it is None.
      • page_index: None. Since the input here is a numpy.ndarray, the value is None.
      • model_settings: (Dict[str, bool]) Model configuration parameters for the document preprocessing sub-pipeline.
        • use_doc_orientation_classify: (bool) Controls whether to enable the document image orientation classification sub-module.
        • use_doc_unwarping: (bool) Controls whether to enable the text image distortion correction sub-module.
      • angle: (int) The prediction result of the document image orientation classification sub-module. When enabled, it returns the actual angle value.
    • parsing_res_list: (List[Dict]) A list of parsing results, where each element is a dictionary. The list order represents the reading order after parsing.

      • block_bbox: (np.ndarray) The bounding box of the layout region.
      • block_label: (str) The label of the layout region, such as text, table, etc.
      • block_content: (str) The content within the layout region.
      • block_id: (int) The index of the layout region, used to display the layout sorting results.
      • block_order (int) The order of the layout region, used to display the layout reading order. For non-sorted parts, the default value is None.
  • Calling the save_to_img() method will save the visualization results to the specified save_path. If a directory is specified, visualized images for layout region detection, global OCR, layout reading order, etc., will be saved. If a file is specified, it will be saved directly to that file. (Pipelines typically contain many result images, so it is not recommended to directly specify a specific file path, as multiple images will be overwritten, retaining only the last one.)

  • Calling the save_to_markdown() method will save the converted Markdown file to the specified save_path. The saved file path will be save_path/{your_img_basename}.md. If the input is a PDF file, it is recommended to directly specify a directory; otherwise, multiple markdown files will be overwritten.

<li>Additionally, it also supports obtaining visualized images and prediction results with results through attributes, as follows: <table> <thead> <tr> <th>Attribute</th> <th>Attribute Description</th> </tr> </thead> <tbody> <tr> <td><code>json</code></td> <td>Obtain the prediction <code>json</code>result in the format</td> </tr> <tr> <td><code>img</code></td> <td>Obtain visualized images in <code>dict</code> format</td> </tr> <tr> <td><code>markdown</code></td> <td>Obtain markdown results in <code>dict</code> format</td> </tr> </tbody> </table> <ul> <li>The prediction result obtained through the <code>json</code> attribute is data of dict type, with relevant content consistent with that saved by calling the <code>save_to_json()</code> method.</li> <li>The prediction result returned by the <code>img</code> attribute is data of dict type. The keys are <code>layout_det_res</code>, <code>overall_ocr_res</code>, <code>text_paragraphs_ocr_res</code>, <code>formula_res_region1</code>, <code>table_cell_img</code>, and <code>seal_res_region1</code>, with corresponding values being <code>Image.Image</code> objects: used to display visualized images of layout region detection, OCR, OCR text paragraphs, formulas, tables, and seal results, respectively. If optional modules are not used, the dict only contains <code>layout_det_res</code>.</li> <li>The prediction result returned by the <code>markdown</code> attribute is data of dict type. The keys are <code>markdown_texts</code>, <code>markdown_images</code>, and <code>page_continuation_flags</code>, with corresponding values being markdown text, images displayed in Markdown (<code>Image.Image</code> objects), and a bool tuple used to identify whether the first element on the current page is the start of a paragraph and whether the last element is the end of a paragraph, respectively.</li> </ul> </li> </details>

<a id="3-vlm"></a>

3. Improving Inference Performance with VLM Inference Services

Using only PaddlePaddle or Transformers usually does not provide optimal inference performance. This section mainly introduces how to improve PaddleOCR-VL inference performance through VLM inference services. You can either deploy your own VLM inference service based on backends such as vLLM, SGLang, FastDeploy, MLX-VLM, and llama.cpp, or directly use compatible managed services. This section corresponds to combinations of "Layout Detection Inference Method + VLM Inference Service". Its core idea is that the client continues to handle the other stages in the full workflow, such as layout detection, while only the VLM stage is delegated to a dedicated service.

3.1 Launching the VLM Inference Service

IMPORTANT: The services launched according to this section are responsible only for the VLM inference stage in the PaddleOCR-VL workflow and do not provide a complete end-to-end document parsing API. It is strongly discouraged to directly call such services through plain HTTP requests or OpenAI clients to process document images. If you need to deploy a service with the full PaddleOCR-VL capability, please refer to the service deployment section later in this document.

There are three methods to launch the VLM inference service; choose either one:

  • Method 1: Launch the service using the official Docker image. Currently supported:

    • FastDeploy
    • vLLM
  • Method 2: Launch the service by manually installing dependencies via the PaddleOCR CLI. Currently supported:

    • FastDeploy
    • vLLM
    • SGLang
  • Method 3: Launch service directly using inference acceleration frameworks (the pre-configured performance tuning parameters provided by PaddleOCR will not be applied). Currently supported:

    • FastDeploy
    • vLLM
    • MLX-VLM
    • llama.cpp

We strongly recommend using the Docker image to minimize potential environment-related issues.

In addition, cloud platforms such as SiliconFlow and Novita AI also provide managed services. If you choose to use such services, you can skip this subsection and directly read 3.2 Client Usage Methods.

3.1.1 Method 1: Using Docker Image

PaddleOCR provides Docker images for quickly launching vLLM or FastDeploy inference services. You can use the following commands to start the services (requires Docker version >= 19.03, a machine equipped with a GPU, and NVIDIA drivers supporting CUDA 12.6 or later):

=== "Launch vLLM Service"

```shell
docker run \
    -it \
    --rm \
    --gpus all \
    --network host \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu \
    paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --host 0.0.0.0 --port 8118 --backend vllm
```

If you wish to start the service in an environment without internet access, replace `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu` (image size approximately 13 GB) in the above command with the offline version image `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu-offline` (image size approximately 15 GB).

=== "Launch FastDeploy Service"

```shell
docker run \
    -it \
    --rm \
    --gpus all \
    --network host \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-fastdeploy-server:latest-nvidia-gpu \
    paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --host 0.0.0.0 --port 8118 --backend fastdeploy
```

If you wish to start the service in an environment without internet access, replace `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-fastdeploy-server:latest-nvidia-gpu` (image size approximately 43 GB) in the above command with the offline version image `ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-fastdeploy-server:latest-nvidia-gpu-offline` (image size approximately 45 GB).

When starting the vLLM or FastDeploy inference service, we provide a set of default parameter settings. If you have additional requirements for adjusting parameters such as GPU memory usage, you can configure more parameters yourself. Please refer to 3.3.1 Server-side Parameter Adjustment to create a configuration file, then mount this file into the container, and specify the configuration file using backend_config in the command to start the service. Taking vLLM as an example:

shell
docker run \
    -it \
    --rm \
    --gpus all \
    --network host \
    -v vllm_config.yml:/tmp/vllm_config.yml \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu \
    paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --host 0.0.0.0 --port 8118 --backend vllm --backend_config /tmp/vllm_config.yml

Here, vllm_config.yml refers to a local configuration file path on the host machine. The example assumes that you created this file in the current working directory. If the file is located elsewhere, replace it with the actual absolute or relative path.

TIP: Images with the latest-xxx tag correspond to the latest version of PaddleOCR. If you want to use a specific version of the PaddleOCR image, you can replace latest in the tag with the desired version number: paddleocr<major>.<minor>. For example: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:paddleocr3.3-nvidia-gpu-offline

3.1.2 Method 2: Installation and Usage via PaddleOCR CLI

The PaddleOCR CLI has already resolved complex version compatibility issues. Instead of spending time studying framework documentation, you can install the necessary environment with a single command.

Since inference acceleration frameworks may conflict with packages already installed in the current environment, it is recommended to install them in a virtual environment:

shell
# If a virtual environment is currently activated, deactivate it first using `deactivate`
# Create a virtual environment
python -m venv .venv_vlm
# Activate the environment
source .venv_vlm/bin/activate

vLLM and SGLang depend on FlashAttention, and installing FlashAttention may require CUDA compilation tools such as nvcc. If these tools are not available in your environment (for example, when using the paddleocr-vl image), you can obtain a prebuilt FlashAttention package (version 2.8.2 required) from this repository, install it first, and then proceed with subsequent commands. For example, in the paddleocr-vl image, run python -m pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.3.14/flash_attn-2.8.2+cu128torch2.8-cp310-cp310-linux_x86_64.whl. This step is not required for FastDeploy.

Install PaddleOCR and the dependencies of inference acceleration services, using vLLM as an example:

shell
# Install PaddleOCR
python -m pip install "paddleocr[doc-parser]"
# Install inference acceleration service dependencies
paddleocr install_genai_server_deps vllm

The usage of paddleocr install_genai_server_deps is:

shell
paddleocr install_genai_server_deps <inference-acceleration-framework-name>

Currently supported framework names are vllm, sglang, and fastdeploy, corresponding to vLLM, SGLang, and FastDeploy, respectively.

Both vLLM and SGLang installed through paddleocr install_genai_server_deps are CUDA 12.6 versions. Please ensure that your local NVIDIA driver supports this version or a later one.

WARNING: The transformers library versions required by vLLM, SGLang and Transformers engine are currently incompatible, so Transformers engine and vLLM cannot be installed together with vLLM or SGLang in the same environment. If using Transformers + vLLM or Transformers + SGLang inference, please deploy the layout detection model and VLM service in different environments.

After installation, you can launch the service using the paddleocr genai_server command:

shell
paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --backend vllm --port 8118

The parameters supported by this command are as follows:

ParameterDescription
--model_nameModel name
--model_dirModel directory
--hostServer hostname
--portServer port number
--backendBackend name, i.e., the name of the inference acceleration framework used; options are vllm, sglang, or fastdeploy
--backend_configCan specify a YAML file containing backend configurations

3.1.3 Launch Service Directly Using Inference Acceleration Frameworks

If you need to install a custom version of an inference framework and launch the service natively, please refer to the following guidelines. Please note that when launching natively, the pre-configured performance tuning parameters provided by PaddleOCR will not be applied.

3.2 Client Usage Methods

After launching the VLM inference service, the client can call the service through PaddleOCR. This section applies both to self-hosted VLM inference services launched in 3.1 and to compatible managed services provided by third parties. Please note that because the client still needs to call the layout detection model and complete the other stages in the workflow, it is still recommended to run the client on GPU or other acceleration devices to achieve more stable and efficient performance. Please refer to Section 1 for the client-side environment configuration. The configuration described in Section 3.1 applies only to starting the service and is not applicable to the client. If you want the client to invoke the full PaddleOCR-VL capability only through an HTTP interface, please directly refer to Section 4, "Service Deployment".

3.2.1 CLI Invocation

Specify the backend type (vllm-server, sglang-server, fastdeploy-server, mlx-vlm-server or llama-cpp-server) using --vl_rec_backend and the service address using --vl_rec_server_url, for example:

shell
paddleocr doc_parser --input paddleocr_vl_demo.png --vl_rec_backend vllm-server --vl_rec_server_url http://localhost:8118/v1

In addition, you can specify the model name used by the service via --vl_rec_api_model_name, and specify the API key used for authentication via --vl_rec_api_key. Examples are as follows:

Using a service started with the default parameters of vllm serve:

shell
paddleocr doc_parser \
    --input paddleocr_vl_demo.png \
    --vl_rec_backend vllm-server \
    --vl_rec_server_url http://localhost:8000/v1 \
    --vl_rec_api_model_name 'PaddlePaddle/PaddleOCR-VL-1.5'

SiliconFlow platform:

shell
paddleocr doc_parser \
    --input paddleocr_vl_demo.png \
    --vl_rec_backend vllm-server \
    --vl_rec_server_url https://api.siliconflow.cn/v1 \
    --vl_rec_api_model_name 'PaddlePaddle/PaddleOCR-VL-1.5' \
    --vl_rec_api_key xxxxxx

Novita AI platform (currently only PaddleOCR-VL-0.9B is supported, i.e., the v1 model):

shell
paddleocr doc_parser \
    --input paddleocr_vl_demo.png \
    --pipeline_version v1 \
    --vl_rec_backend vllm-server \
    --vl_rec_server_url https://api.novita.ai/openai \
    --vl_rec_api_model_name 'paddlepaddle/paddleocr-vl' \
    --vl_rec_api_key xxxxxx

3.2.2 Python API Invocation

When creating a PaddleOCRVL object, specify the backend type (vllm-server, sglang-server, fastdeploy-server, mlx-vlm-server or llama-cpp-server) using vl_rec_backend and the service address using vl_rec_server_url, for example:

python
pipeline = PaddleOCRVL(vl_rec_backend="vllm-server", vl_rec_server_url="http://localhost:8118/v1")

In addition, you can specify the model name used by the service via vl_rec_api_model_name, and specify the API key used for authentication via vl_rec_api_key.

Using a service started with the default parameters of vllm serve:

python
pipeline = PaddleOCRVL(
    vl_rec_backend="vllm-server", 
    vl_rec_server_url="http://localhost:8000/v1",
    vl_rec_api_model_name="PaddlePaddle/PaddleOCR-VL-1.5",
)

SiliconFlow platform (currently only PaddleOCR-VL-0.9B is supported, i.e., the v1 model):

python
pipeline = PaddleOCRVL(
    vl_rec_backend="vllm-server", 
    vl_rec_server_url="https://api.siliconflow.cn/v1",
    vl_rec_api_model_name="PaddlePaddle/PaddleOCR-VL-1.5",
    vl_rec_api_key="xxxxxx",
)

Novita AI platform (currently only PaddleOCR-VL-0.9B is supported, i.e., the v1 model):

python
pipeline = PaddleOCRVL(
    pipeline_version="v1",
    vl_rec_backend="vllm-server", 
    vl_rec_server_url="https://api.novita.ai/openai",
    vl_rec_api_model_name="paddlepaddle/paddleocr-vl",
    vl_rec_api_key="xxxxxx",
)

3.3 Performance Tuning

The default configurations cannot guarantee optimal performance in all environments. If you encounter performance issues in actual use, you can try the following optimization methods.

3.3.1 Server-Side Parameter Adjustment

Different inference acceleration frameworks support different parameters. Refer to their official documentation for available parameters and adjustment timing:

The PaddleOCR VLM inference service supports parameter tuning through configuration files. The following example shows how to adjust the gpu-memory-utilization and max-num-seqs parameters for the vLLM server:

  1. Create a YAML file vllm_config.yaml with the following content:

    yaml
    gpu-memory-utilization: 0.3
    max-num-seqs: 128
    
  2. Specify the configuration file path when starting the service, for example, using the paddleocr genai_server command:

    shell
    paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --backend vllm --backend_config vllm_config.yaml
    

If using a shell that supports process substitution (like Bash), you can also pass configuration items directly without creating a configuration file:

bash
paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --backend vllm --backend_config <(echo -e 'gpu-memory-utilization: 0.3\nmax-num-seqs: 128')

3.3.2 Client-Side Parameter Adjustment

PaddleOCR groups sub-images from single or multiple input images and sends concurrent requests to the server, so the number of concurrent requests significantly impacts performance.

  • For CLI and Python API, adjust the maximum number of concurrent requests using the vl_rec_max_concurrency parameter;
  • For service deployment, modify the VLRecognition.genai_config.max_concurrency field in the configuration file.

When there is a 1:1 client-to-VLM inference service ratio and sufficient server resources, increasing concurrency can improve performance. If the server needs to support multiple clients or has limited computing resources, reduce concurrency to avoid resource overload and service abnormalities.

3.3.3 Common Hardware Performance Tuning Recommendations

The following configurations are for scenarios with a 1:1 client-to-VLM inference service ratio.

NVIDIA RTX 3060

  • Server-Side
    • vLLM: gpu-memory-utilization: 0.7
    • FastDeploy:
      • gpu-memory-utilization: 0.7
      • max-concurrency: 2048

4. Service Deployment

This step mainly introduces how to deploy PaddleOCR-VL as a service and invoke it. If concurrent request processing is not required, choose either of the following two methods:

  • Method 1: Deploy using Docker Compose (recommended).

  • Method 2: Manual Deployment.

Both methods can handle only one request at a time. If you need concurrent request processing, please refer to the High-Performance Service Deployment solution.

Note that the PaddleOCR-VL service described in this section differs from the VLM inference service in the previous section: the latter is responsible for only one part of the complete process (i.e., VLM inference) and is called as an underlying service by the former.

You can obtain the Compose file and the environment variables configuration file from here and here, respectively, and download them to your local machine. Then, in the directory where the files were just downloaded, execute the following command to start the server, which will listen on port 8080 by default:

shell
# Must be executed in the directory containing the compose.yaml and .env files
docker compose up

After startup, you will see output similar to the following:

text
paddleocr-vl-api             | INFO:     Started server process [1]
paddleocr-vl-api             | INFO:     Waiting for application startup.
paddleocr-vl-api             | INFO:     Application startup complete.
paddleocr-vl-api             | INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

This solution accelerates VLM inference based on frameworks like vLLM, making it more suitable for production environment deployment. However, it requires the machine to be equipped with a GPU and the NVIDIA driver to support CUDA 12.6 or higher.

Additionally, after starting the server using this method, no internet connection is required except for pulling the image. For offline environment deployment, you can first pull the images involved in the Compose file on an online machine, export and transfer them to the offline machine for import, and then start the service in the offline environment.

Docker Compose starts two containers in sequence by reading the configurations in the .env and compose.yaml files, running the underlying VLM inference service and the PaddleOCR-VL service (Pipeline) respectively.

The meanings of each environment variable contained in the .env file are as follows:

  • API_IMAGE_TAG_SUFFIX: The tag suffix of the image used to start the pipeline service. The default is latest-nvidia-gpu-offline, indicating the use of the latest offline GPU image. To use an image corresponding to a specific version of PaddleOCR, replace latest with the desired version paddleocr<major>.<minor>, for example paddleocr3.3-nvidia-gpu-offline.
  • VLM_BACKEND: The VLM inference backend, currently supporting vllm and fastdeploy. The default is vllm.
  • VLM_IMAGE_TAG_SUFFIX: The tag suffix of the image used to start the VLM inference service. The default is latest-nvidia-gpu-offline, indicating the use of the latest offline GPU image. If you want to use a non-offline version of the image, you can remove the -offline suffix. To use an image corresponding to a specific version of PaddleOCR, replace latest with the desired version paddleocr<major>.<minor>, for example paddleocr3.3-nvidia-gpu-offline.

You can meet custom requirements by modifying .env and compose.yaml, for example:

<details> <summary>1. Change the port of the PaddleOCR-VL service</summary>

Edit <code>paddleocr-vl-api.ports</code> in the <code>compose.yaml</code> file to change the port. For example, if you need to change the service port to 8111, make the following modifications:

diff
  paddleocr-vl-api:
    ...
    ports:
-     - 8080:8080
+     - 8111:8080
    ...
</details> <details> <summary>2. Specify the GPU used by the PaddleOCR-VL service</summary>

Edit <code>device_ids</code> in the <code>compose.yaml</code> file to change the GPU used. For example, if you need to use GPU card 1 for deployment, make the following modifications:

diff
  paddleocr-vl-api:
    ...
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
-             device_ids: ["0"]
+             device_ids: ["1"]
              capabilities: [gpu]
    ...
  paddleocr-vlm-server:
    ...
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
-             device_ids: ["0"]
+             device_ids: ["1"]
              capabilities: [gpu]
    ...
</details> <details> <summary>3. Adjust VLM server-side configuration</summary>

If you want to adjust the VLM server-side configuration, please refer to <a href="#331-server-side-parameter-adjustment">3.3.1 Server-side Parameter Adjustment</a> to generate a configuration file.

After generating the configuration file, add the following <code>paddleocr-vlm-server.volumes</code> and <code>paddleocr-vlm-server.command</code> fields to your <code>compose.yaml</code>. Please replace <code>/path/to/your_config.yaml</code> with your actual configuration file path.

yaml
  paddleocr-vlm-server:
    ...
    volumes:
      - /path/to/your_config.yaml:/home/paddleocr/vlm_server_config.yaml
    command: paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --host 0.0.0.0 --port 8118 --backend vllm --backend_config /home/paddleocr/vlm_server_config.yaml
    ...
</details> <details> <summary>4. Change the VLM inference backend</summary>

Modify <code>VLM_BACKEND</code> in the <code>.env</code> file, for example, to change the VLM inference backend to <code>fastdeploy</code>:

diff
  API_IMAGE_TAG_SUFFIX=latest-nvidia-gpu-offline
- VLM_BACKEND=vllm
+ VLM_BACKEND=fastdeploy
  VLM_IMAGE_TAG_SUFFIX=latest-nvidia-gpu-offline
</details> <details> <summary>5. Adjust pipeline configurations (such as model path, batch size, deployment device, etc.)</summary>

Refer to section <a href="#44-pipeline-configuration-adjustment-instructions">4.4 Pipeline Configuration Adjustment Instructions</a> in this document.

</details>

4.2 Method 2: Manual Deployment

Execute the following command to install the service deployment plugin via the PaddleX CLI:

The paddlex command is installed together with paddleocr. Therefore, if you have already installed PaddleOCR in the previous steps, you usually do not need to install PaddleX separately.

shell
paddlex --install serving

Then, start the server using the PaddleX CLI:

shell
paddlex --serve --pipeline PaddleOCR-VL

To switch to the transformers engine for service deployment, use:

shell
paddlex --serve --pipeline PaddleOCR-VL --engine transformers

After startup, you will see output similar to the following, with the server listening on port 8080 by default:

text
INFO:     Started server process [63108]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

The command-line options related to serving are as follows:

<table> <thead> <tr> <th>Name</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>--pipeline</code></td> <td>PaddleX pipeline registration name or pipeline configuration file path.</td> </tr> <tr> <td><code>--device</code></td> <td>Deployment device for the pipeline. By default, a GPU will be used if available; otherwise, a CPU will be used."</td> </tr> <tr> <td><code>--host</code></td> <td>Hostname or IP address to which the server is bound. Defaults to <code>0.0.0.0</code>.</td> </tr> <tr> <td><code>--port</code></td> <td>Port number on which the server listens. Defaults to <code>8080</code>.</td> </tr> <tr> <td><code>--use_hpip</code></td> <td>If specified, uses high-performance inference. Refer to the High-Performance Inference documentation for more information.</td> </tr> <tr> <td><code>--hpi_config</code></td> <td>High-performance inference configuration. Refer to the High-Performance Inference documentation for more information.</td> </tr> </tbody> </table>

If you need to adjust pipeline configurations (such as model path, batch size, deployment device, etc.), you can specify the --pipeline parameter as a custom configuration file path. For the correspondence between PaddleOCR pipelines and PaddleX pipeline registration names, as well as how to obtain and modify PaddleX pipeline configuration files, please refer to PaddleOCR and PaddleX. Furthermore, section 4.1.3 will introduce how to adjust the pipeline configuration based on common requirements.

4.3 Client-Side Invocation

Below are the API reference and examples of multi-language service invocation:

<details><summary>API Reference</summary> <p>Main operations provided by the service:</p> <ul> <li>The HTTP request method is POST.</li> <li>Both the request body and response body are JSON data (JSON objects).</li> <li>When the request is processed successfully, the response status code is<code>200</code>, and the properties of the response body are as follows:</li> </ul> <table> <thead> <tr> <th>Name</th> <th>Type</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><code>logId</code></td> <td><code>string</code></td> <td>The UUID of the request.</td> </tr> <tr> <td><code>errorCode</code></td> <td><code>integer</code></td> <td>Error code. Fixed as <code>0</code>.</td> </tr> <tr> <td><code>errorMsg</code></td> <td><code>string</code></td> <td>Error description. Fixed as <code>"Success"</code>.</td> </tr> <tr> <td><code>result</code></td> <td><code>object</code></td> <td>Operation result.</td> </tr> </tbody> </table> <ul> <li>When the request is not processed successfully, the properties of the response body are as follows:</li> </ul> <table> <thead> <tr> <th>Name</th> <th>Type</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><code>logId</code></td> <td><code>string</code></td> <td>The UUID of the request.</td> </tr> <tr> <td><code>errorCode</code></td> <td><code>integer</code></td> <td>Error code. Same as the response status code.</td> </tr> <tr> <td><code>errorMsg</code></td> <td><code>string</code></td> <td>Error description.</td> </tr> </tbody> </table> <p>The main operations provided by the service are as follows:</p> <ul> <li><b><code>infer</code></b></li> </ul> <p>Perform layout parsing.</p> <p><code>POST /layout-parsing</code></p> <ul> <li>The properties of the request body are as follows:</li> </ul> <table> <thead> <tr> <th>Name</th> <th>Type</th> <th>Meaning</th> <th>Required</th> </tr> </thead> <tbody> <tr> <td><code>file</code></td> <td><code>string</code></td> <td>The URL of an image file or PDF file accessible to the server, or the Base64-encoded result of the content of the aforementioned file types. </td> <td>Yes</td> </tr> <tr> <td><code>fileType</code></td> <td><code>integer</code>|<code>null</code></td> <td>File type.<code>0</code> represents a PDF file,<code>1</code> represents an image file. If this property is not present in the request body, the file type will be inferred from the URL.</td> <td>No</td> </tr> <tr> <td><code>useDocOrientationClassify</code></td> <td><code>boolean</code> | <code>null</code></td> <td>Please refer to the description of the <code>use_doc_orientation_classify</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>useDocUnwarping</code></td> <td><code>boolean</code>|<code>null</code></td> <td>Please refer to the description of the <code>use_doc_unwarping</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>useLayoutDetection</code></td> <td><code>boolean</code>|<code>null</code></td> <td>Please refer to the description of the <code>use_layout_detection</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>useChartRecognition</code></td> <td><code>boolean</code>|<code>null</code></td> <td>Please refer to the description of the <code>use_chart_recognition</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>useSealRecognition</code></td> <td><code>boolean</code>|<code>null</code></td> <td>Please refer to the description of the <code>use_seal_recognition</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>useOcrForImageBlock</code></td> <td><code>boolean</code>|<code>null</code></td> <td>Please refer to the description of the <code>use_ocr_for_image_block</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>layoutThreshold</code></td> <td><code>number</code>|<code>object</code>|<code>null</code></td> <td>Please refer to the description of the <code>layout_threshold</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>layoutNms</code></td> <td><code>boolean</code>|<code>null</code></td> <td>Please refer to the description of the <code>layout_nms</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>layoutUnclipRatio</code></td> <td><code>number</code>|<code>array</code>|<code>object</code>|<code>null</code></td> <td>Please refer to the description of the <code>layout_unclip_ratio</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>layoutMergeBboxesMode</code></td> <td><code>string</code>|<code>object</code>|<code>null</code></td> <td>Please refer to the description of the <code>layout_merge_bboxes_mode</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>layoutShapeMode</code></td> <td><code>string</code></td> <td>Please refer to the description of the <code>layout_shape_mode</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>promptLabel</code></td> <td><code>string</code>|<code>null</code></td> <td>Please refer to the description of the <code>prompt_label</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>formatBlockContent</code></td> <td><code>boolean</code>|<code>null</code></td> <td>Please refer to the description of the <code>format_block_content</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>repetitionPenalty</code></td> <td><code>number</code>|<code>null</code></td> <td>Please refer to the description of the <code>repetition_penalty</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>temperature</code></td> <td><code>number</code>|<code>null</code></td> <td>Please refer to the description of the <code>temperature</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>topP</code></td> <td><code>number</code>|<code>null</code></td> <td>Please refer to the description of the <code>top_p</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>minPixels</code></td> <td><code>number</code>|<code>null</code></td> <td>Please refer to the description of the <code>min_pixels</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>maxPixels</code></td> <td><code>number</code>|<code>null</code></td> <td>Please refer to the description of the <code>max_pixels</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>maxNewTokens</code></td> <td><code>number</code>|<code>null</code></td> <td>Please refer to the description of the <code>max_new_tokens</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>mergeLayoutBlocks</code></td> <td><code>boolean</code>|<code>null</code></td> <td>Please refer to the description of the <code>merge_layout_blocks</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>markdownIgnoreLabels</code></td> <td><code>array</code>|<code>null</code></td> <td>Please refer to the description of the <code>markdown_ignore_labels</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>vlmExtraArgs</code></td> <td><code>object</code>|<code>null</code></td> <td>Please refer to the description of the <code>vlm_extra_args</code> parameter in the <code>predict</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>prettifyMarkdown</code></td> <td><code>boolean</code></td> <td>Whether to output beautified Markdown text. The default is <code>true</code>.</td> <td>No</td> </tr> <tr> <td><code>showFormulaNumber</code></td> <td><code>boolean</code></td> <td>Whether to include formula numbers in the output Markdown text. The default is <code>false</code>.</td> <td>No</td> </tr> <tr> <td><code>restructurePages</code></td> <td><code>boolean</code></td> <td>Whether to restructure results across multiple pages. The default is <code>false</code>.</td> <td>No</td> </tr> <tr> <td><code>mergeTables</code></td> <td><code>boolean</code></td> <td>Please refer to the description of the <code>merge_tables</code> parameter in the <code>restructure_pages</code> method of the PaddleOCR-VL object. Valid only when <code>restructurePages</code> is <code>true</code>.</td> <td>No</td> </tr> <tr> <td><code>relevelTitles</code></td> <td><code>boolean</code></td> <td>Please refer to the description of the <code>relevel_titles</code> parameter in the <code>restructure_pages</code> method of the PaddleOCR-VL object. Valid only when <code>restructurePages</code> is <code>true</code>.</td> <td>No</td> </tr> <tr> <td><code>outputFormats</code></td> <td><code>array</code> | <code>null</code></td> <td>Optional. List of extra document formats to return. By default, no extra formats are returned. Currently only <code>"docx"</code> is supported.</td> <td>No</td> </tr> <tr> <td><code>visualize</code></td> <td><code>boolean</code>|<code>null</code></td> <td>Whether to return visualization result images and intermediate images during the processing.<ul style="margin: 0 0 0 1em; padding-left: 0em;"> <li>Pass <code>true</code>: Return images.</li> <li>Pass <code>false</code>: Do not return images.</li> <li>If this parameter is not provided in the request body or <code>null</code> is passed: Follow the setting in the configuration file <code>Serving.visualize</code>.</li> </ul>

For example, add the following field in the configuration file:

<pre><code>Serving: visualize: False</code></pre>Images will not be returned by default, and the default behavior can be overridden by the <code>visualize</code> parameter in the request body. If this parameter is not set in either the request body or the configuration file (or <code>null</code> is passed in the request body and the configuration file is not set), images will be returned by default.</td> <td>No</td> </tr> </tbody> </table> <ul> <li>When the request is processed successfully, the <code>result</code> in the response body has the following attributes:</li> </ul> <table> <thead> <tr> <th>Name</th> <th>Type</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><code>layoutParsingResults</code></td> <td><code>array</code></td> <td>Layout parsing results. The array length is 1 (for image input) or the actual number of document pages processed (for PDF input). For PDF input, each element in the array represents the result of each actual page processed in the PDF file.</td> </tr> <tr> <td><code>dataInfo</code></td> <td><code>object</code></td> <td>Input data information.</td> </tr> </tbody> </table> <p>Each element in <code>layoutParsingResults</code> is an <code>object</code> with the following attributes:</p> <table> <thead> <tr> <th>Name</th> <th>Type</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><code>prunedResult</code></td> <td><code>object</code></td> <td>A simplified version of the <code>res</code> field in the JSON representation of the results generated by the <code>predict</code> method of the object, with the <code>input_path</code> and <code>page_index</code> fields removed.</td> </tr> <tr> <td><code>markdown</code></td> <td><code>object</code></td> <td>Markdown results.</td> </tr> <tr> <td><code>outputImages</code></td> <td><code>object</code>|<code>null</code></td> <td>Refer to the <code>img</code> property description of the prediction results. The image is in JPEG format and encoded using Base64.</td> </tr> <tr> <td><code>inputImage</code></td> <td><code>string</code>|<code>null</code></td> <td>Input image. The image is in JPEG format and encoded using Base64.</td> </tr> <tr> <td><code>exports</code></td> <td><code>object</code> | <code>null</code></td> <td>Optional additional exports. Present only when <code>outputFormats</code> is set. Example: <code>{"docx": {"content": "..."}}</code>, where <code>content</code> is the Base64-encoded file content.</td> </tr> </tbody> </table> <p><code>markdown</code> is an <code>object</code> with the following properties:</p> <table> <thead> <tr> <th>Name</th> <th>Type</th> <th>Meaning</th> </tr> </thead> <tbody> <tr> <td><code>text</code></td> <td><code>string</code></td> <td>Markdown text.</td> </tr> <tr> <td><code>images</code></td> <td><code>object</code></td> <td>Key-value pairs of relative paths to Markdown images and Base64-encoded images.</td> </tr> </tbody> </table> <ul> <li><b><code>restructurePages</code></b></li> </ul> <p>Restructure results across multiple pages.</p> <p><code>POST /restructure-pages</code></p> <ul> <li>The request body has the following properties:</li> </ul> <table> <thead> <tr> <th>Name</th> <th>Type</th> <th>Description</th> <th>Required</th> </tr> </thead> <tbody> <tr> <td><code>pages</code></td> <td><code>array</code></td> <td>An array of pages.</td> <td>Yes</td> </tr> <tr> <td><code>mergeTables</code></td> <td><code>boolean</code></td> <td>Please refer to the description of the <code>merge_tables</code> parameter in the <code>restructure_pages</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>relevelTitles</code></td> <td><code>boolean</code></td> <td>Please refer to the description of the <code>relevel_titles</code> parameter in the <code>restructure_pages</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>concatenatePages</code></td> <td><code>boolean</code></td> <td>Please refer to the description of the <code>concatenate_pages</code> parameter in the <code>restructure_pages</code> method of the PaddleOCR-VL object.</td> <td>No</td> </tr> <tr> <td><code>prettifyMarkdown</code></td> <td><code>boolean</code></td> <td>Whether to output beautified Markdown text. The default is <code>true</code>.</td> <td>No</td> </tr> <tr> <td><code>showFormulaNumber</code></td> <td><code>boolean</code></td> <td>Whether to include formula numbers in the output Markdown text. The default is <code>false</code>.</td> <td>No</td> </tr> <tr> <td><code>outputFormats</code></td> <td><code>array</code> | <code>null</code></td> <td>Optional extra export formats; same meaning as <code>outputFormats</code> on <code>infer</code>. Only <code>"docx"</code> is supported.</td> <td>No</td> </tr> </tbody> </table> <p>Each element in <code>pages</code> is an <code>object</code> with the following properties:</p> <table> <thead> <tr> <th>Name</th> <th>Type</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>prunedResult</code></td> <td><code>object</code></td> <td>The <code>prunedResult</code> object returned by the <code>infer</code> operation.</td> </tr> <tr> <td><code>markdownImages</code></td> <td><code>object</code>|<code>null</code></td> <td>The <code>images</code> property of the <code>markdown</code> object returned by the <code>infer</code> operation.</td> </tr> </tbody> </table> <ul> <li>When the request is processed successfully, the <code>result</code> field in the response body has the following properties:</li> </ul> <table> <thead> <tr> <th>Name</th> <th>Type</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td><code>layoutParsingResults</code></td> <td><code>array</code></td> <td>The restructured layout parsing results. For the fields that every element contains, please refer to the description of the result returned by the <code>infer</code> operation (excluding visualization result images and intermediate images).</td> </tr> </tbody> </table> </details> <details><summary>Multi-Language Service Invocation Examples</summary> <details> <summary>Python</summary> <pre><code class="language-python"> import base64 import requests import pathlib BASE_URL = "http://localhost:8080" image_path = "./demo.jpg" # Encode the local image in Base64 with open(image_path, "rb") as file: image_bytes = file.read() image_data = base64.b64encode(image_bytes).decode("ascii") payload = { "file": image_data, # Base64-encoded file content or file URL "fileType": 1, # File type, 1 indicates an image file } response = requests.post(BASE_URL + "/layout-parsing", json=payload) assert response.status_code == 200, (response.status_code, response.text) result = response.json()["result"] pages = [] for i, res in enumerate(result["layoutParsingResults"]): pages.append({"prunedResult": res["prunedResult"], "markdownImages": res["markdown"].get("images")}) for img_name, img in res["outputImages"].items(): img_path = f"{img_name}_{i}.jpg" pathlib.Path(img_path).parent.mkdir(exist_ok=True) with open(img_path, "wb") as f: f.write(base64.b64decode(img)) print(f"Output image saved at {img_path}") payload = { "pages": pages, "concatenatePages": True, } response = requests.post(BASE_URL + "/restructure-pages", json=payload) assert response.status_code == 200, (response.status_code, response.text) result = response.json()["result"] res = result["layoutParsingResults"][0] print(res["prunedResult"]) md_dir = pathlib.Path("markdown") md_dir.mkdir(exist_ok=True) (md_dir / "doc.md").write_text(res["markdown"]["text"]) for img_path, img in res["markdown"]["images"].items(): img_path = md_dir / img_path img_path.parent.mkdir(parents=True, exist_ok=True) img_path.write_bytes(base64.b64decode(img)) print(f"Markdown document saved at {md_dir / 'doc.md'}") </code></pre></details> <details><summary>C++</summary> <pre><code class="language-cpp">#include &lt;iostream&gt; #include &lt;filesystem&gt; #include &lt;fstream&gt; #include &lt;vector&gt; #include &lt;string&gt; #include "cpp-httplib/httplib.h" // https://github.com/Huiyicc/cpp-httplib #include "nlohmann/json.hpp" // https://github.com/nlohmann/json #include "base64.hpp" // https://github.com/tobiaslocker/base64 namespace fs = std::filesystem; int main() { httplib::Client client("localhost", 8080); const std::string filePath = "./demo.jpg"; std::ifstream file(filePath, std::ios::binary | std::ios::ate); if (!file) { std::cerr << "Error opening file: " << filePath << std::endl; return 1; } std::streamsize size = file.tellg(); file.seekg(0, std::ios::beg); std::vector<char> buffer(size); if (!file.read(buffer.data(), size)) { std::cerr << "Error reading file." << std::endl; return 1; } std::string bufferStr(buffer.data(), static_cast<size_t>(size)); std::string encodedFile = base64::to_base64(bufferStr); nlohmann::json jsonObj; jsonObj["file"] = encodedFile; jsonObj["fileType"] = 1; auto response = client.Post("/layout-parsing", jsonObj.dump(), "application/json"); if (response && response->status == 200) { nlohmann::json jsonResponse = nlohmann::json::parse(response->body); auto result = jsonResponse["result"]; if (!result.is_object() || !result.contains("layoutParsingResults")) { std::cerr << "Unexpected response format." << std::endl; return 1; } const auto& results = result["layoutParsingResults"]; for (size_t i = 0; i < results.size(); ++i) { const auto& res = results[i]; if (res.contains("prunedResult")) { std::cout << "Layout result [" << i << "]: " << res["prunedResult"].dump() << std::endl; } if (res.contains("outputImages") && res["outputImages"].is_object()) { for (auto& [imgName, imgBase64] : res["outputImages"].items()) { std::string outputPath = imgName + "_" + std::to_string(i) + ".jpg"; fs::path pathObj(outputPath); fs::path parentDir = pathObj.parent_path(); if (!parentDir.empty() && !fs::exists(parentDir)) { fs::create_directories(parentDir); } std::string decodedImage = base64::from_base64(imgBase64.get<std::string>()); std::ofstream outFile(outputPath, std::ios::binary); if (outFile.is_open()) { outFile.write(decodedImage.c_str(), decodedImage.size()); outFile.close(); std::cout << "Saved image: " << outputPath << std::endl; } else { std::cerr << "Failed to save image: " << outputPath << std::endl; } } } } } else { std::cerr << "Request failed." << std::endl; if (response) { std::cerr << "HTTP status: " << response->status << std::endl; std::cerr << "Response body: " << response->body << std::endl; } return 1; } return 0; } </code></pre></details> <details><summary>Java</summary> <pre><code class="language-java">import okhttp3.*; import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.databind.JsonNode; import com.fasterxml.jackson.databind.node.ObjectNode; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.util.Base64; import java.nio.file.Paths; import java.nio.file.Files; public class Main { public static void main(String[] args) throws IOException { String API_URL = "http://localhost:8080/layout-parsing"; String imagePath = "./demo.jpg"; File file = new File(imagePath); byte[] fileContent = java.nio.file.Files.readAllBytes(file.toPath()); String base64Image = Base64.getEncoder().encodeToString(fileContent); ObjectMapper objectMapper = new ObjectMapper(); ObjectNode payload = objectMapper.createObjectNode(); payload.put("file", base64Image); payload.put("fileType", 1); OkHttpClient client = new OkHttpClient(); MediaType JSON = MediaType.get("application/json; charset=utf-8"); RequestBody body = RequestBody.create(JSON, payload.toString()); Request request = new Request.Builder() .url(API_URL) .post(body) .build(); try (Response response = client.newCall(request).execute()) { if (response.isSuccessful()) { String responseBody = response.body().string(); JsonNode root = objectMapper.readTree(responseBody); JsonNode result = root.get("result"); JsonNode layoutParsingResults = result.get("layoutParsingResults"); for (int i = 0; i < layoutParsingResults.size(); i++) { JsonNode item = layoutParsingResults.get(i); int finalI = i; JsonNode prunedResult = item.get("prunedResult"); System.out.println("Pruned Result [" + i + "]: " + prunedResult.toString()); JsonNode outputImages = item.get("outputImages"); outputImages.fieldNames().forEachRemaining(imgName -> { try { String imgBase64 = outputImages.get(imgName).asText(); byte[] imgBytes = Base64.getDecoder().decode(imgBase64); String imgPath = imgName + "_" + finalI + ".jpg"; File outputFile = new File(imgPath); File parentDir = outputFile.getParentFile(); if (parentDir != null && !parentDir.exists()) { parentDir.mkdirs(); System.out.println("Created directory: " + parentDir.getAbsolutePath()); } try (FileOutputStream fos = new FileOutputStream(outputFile)) { fos.write(imgBytes); System.out.println("Saved image: " + imgPath); } } catch (IOException e) { System.err.println("Failed to save image: " + e.getMessage()); } }); } } else { System.err.println("Request failed with HTTP code: " + response.code()); } } } } </code></pre></details> <details><summary>Go</summary> <pre><code class="language-go">package main import ( "bytes" "encoding/base64" "encoding/json" "fmt" "io/ioutil" "net/http" "os" "path/filepath" ) func main() { API_URL := "http://localhost:8080/layout-parsing" filePath := "./demo.jpg" fileBytes, err := ioutil.ReadFile(filePath) if err != nil { fmt.Printf("Error reading file: %v\n", err) return } fileData := base64.StdEncoding.EncodeToString(fileBytes) payload := map[string]interface{}{ "file": fileData, "fileType": 1, } payloadBytes, err := json.Marshal(payload) if err != nil { fmt.Printf("Error marshaling payload: %v\n", err) return } client := &http.Client{} req, err := http.NewRequest("POST", API_URL, bytes.NewBuffer(payloadBytes)) if err != nil { fmt.Printf("Error creating request: %v\n", err) return } req.Header.Set("Content-Type", "application/json") res, err := client.Do(req) if err != nil { fmt.Printf("Error sending request: %v\n", err) return } defer res.Body.Close() if res.StatusCode != http.StatusOK { fmt.Printf("Unexpected status code: %d\n", res.StatusCode) return } body, err := ioutil.ReadAll(res.Body) if err != nil { fmt.Printf("Error reading response: %v\n", err) return } type Markdown struct { Text string `json:"text"` Images map[string]string `json:"images"` } type LayoutResult struct { PrunedResult map[string]interface{} `json:"prunedResult"` Markdown Markdown `json:"markdown"` OutputImages map[string]string `json:"outputImages"` InputImage *string `json:"inputImage"` } type Response struct { Result struct { LayoutParsingResults []LayoutResult `json:"layoutParsingResults"` DataInfo interface{} `json:"dataInfo"` } `json:"result"` } var respData Response if err := json.Unmarshal(body, &respData); err != nil { fmt.Printf("Error parsing response: %v\n", err) return } for i, res := range respData.Result.LayoutParsingResults { fmt.Printf("Result %d - prunedResult: %+v\n", i, res.PrunedResult) mdDir := fmt.Sprintf("markdown_%d", i) os.MkdirAll(mdDir, 0755) mdFile := filepath.Join(mdDir, "doc.md") if err := os.WriteFile(mdFile, []byte(res.Markdown.Text), 0644); err != nil { fmt.Printf("Error writing markdown file: %v\n", err) } else { fmt.Printf("Markdown document saved at %s\n", mdFile) } for path, imgBase64 := range res.Markdown.Images { fullPath := filepath.Join(mdDir, path) if err := os.MkdirAll(filepath.Dir(fullPath), 0755); err != nil { fmt.Printf("Error creating directory for markdown image: %v\n", err) continue } imgBytes, err := base64.StdEncoding.DecodeString(imgBase64) if err != nil { fmt.Printf("Error decoding markdown image: %v\n", err) continue } if err := os.WriteFile(fullPath, imgBytes, 0644); err != nil { fmt.Printf("Error saving markdown image: %v\n", err) } } for name, imgBase64 := range res.OutputImages { imgBytes, err := base64.StdEncoding.DecodeString(imgBase64) if err != nil { fmt.Printf("Error decoding output image %s: %v\n", name, err) continue } filename := fmt.Sprintf("%s_%d.jpg", name, i) if err := os.MkdirAll(filepath.Dir(filename), 0755); err != nil { fmt.Printf("Error creating directory for output image: %v\n", err) continue } if err := os.WriteFile(filename, imgBytes, 0644); err != nil { fmt.Printf("Error saving output image %s: %v\n", filename, err) } else { fmt.Printf("Output image saved at %s\n", filename) } } } } </code></pre></details> <details><summary>C#</summary> <pre><code class="language-csharp">using System; using System.IO; using System.Net.Http; using System.Text; using System.Threading.Tasks; using Newtonsoft.Json.Linq; class Program { static readonly string API_URL = "http://localhost:8080/layout-parsing"; static readonly string inputFilePath = "./demo.jpg"; static async Task Main(string[] args) { var httpClient = new HttpClient(); byte[] fileBytes = File.ReadAllBytes(inputFilePath); string fileData = Convert.ToBase64String(fileBytes); var payload = new JObject { { "file", fileData }, { "fileType", 1 } }; var content = new StringContent(payload.ToString(), Encoding.UTF8, "application/json"); HttpResponseMessage response = await httpClient.PostAsync(API_URL, content); response.EnsureSuccessStatusCode(); string responseBody = await response.Content.ReadAsStringAsync(); JObject jsonResponse = JObject.Parse(responseBody); JArray layoutParsingResults = (JArray)jsonResponse["result"]["layoutParsingResults"]; for (int i = 0; i < layoutParsingResults.Count; i++) { var res = layoutParsingResults[i]; Console.WriteLine($"[{i}] prunedResult:\n{res["prunedResult"]}"); JObject outputImages = res["outputImages"] as JObject; if (outputImages != null) { foreach (var img in outputImages) { string imgName = img.Key; string base64Img = img.Value?.ToString(); if (!string.IsNullOrEmpty(base64Img)) { string imgPath = $"{imgName}_{i}.jpg"; byte[] imageBytes = Convert.FromBase64String(base64Img); string directory = Path.GetDirectoryName(imgPath); if (!string.IsNullOrEmpty(directory) && !Directory.Exists(directory)) { Directory.CreateDirectory(directory); Console.WriteLine($"Created directory: {directory}"); } File.WriteAllBytes(imgPath, imageBytes); Console.WriteLine($"Output image saved at {imgPath}"); } } } } } } </code></pre></details> <details><summary>Node.js</summary> <pre><code class="language-js">const axios = require('axios'); const fs = require('fs'); const path = require('path'); const API_URL = 'http://localhost:8080/layout-parsing'; const imagePath = './demo.jpg'; const fileType = 1; function encodeImageToBase64(filePath) { const bitmap = fs.readFileSync(filePath); return Buffer.from(bitmap).toString('base64'); } const payload = { file: encodeImageToBase64(imagePath), fileType: fileType }; axios.post(API_URL, payload) .then(response => { const results = response.data.result.layoutParsingResults; results.forEach((res, index) => { console.log(`\n[${index}] prunedResult:`); console.log(res.prunedResult); const outputImages = res.outputImages; if (outputImages) { Object.entries(outputImages).forEach(([imgName, base64Img]) => { const imgPath = `${imgName}_${index}.jpg`; const directory = path.dirname(imgPath); if (!fs.existsSync(directory)) { fs.mkdirSync(directory, { recursive: true }); console.log(`Created directory: ${directory}`); } fs.writeFileSync(imgPath, Buffer.from(base64Img, 'base64')); console.log(`Output image saved at ${imgPath}`); }); } else { console.log(`[${index}] No outputImages.`); } }); }) .catch(error => { console.error('Error during API request:', error.message || error); }); </code></pre></details> <details><summary>PHP</summary> <pre><code class="language-php">&lt;?php $API_URL = "http://localhost:8080/layout-parsing"; $image_path = "./demo.jpg"; $image_data = base64_encode(file_get_contents($image_path)); $payload = array("file" => $image_data, "fileType" => 1); $ch = curl_init($API_URL); curl_setopt($ch, CURLOPT_POST, true); curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($payload)); curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json')); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $response = curl_exec($ch); curl_close($ch); $result = json_decode($response, true)["result"]["layoutParsingResults"]; foreach ($result as $i => $item) { echo "[$i] prunedResult:\n"; print_r($item["prunedResult"]); if (!empty($item["outputImages"])) { foreach ($item["outputImages"] as $img_name => $img_base64) { $output_image_path = "{$img_name}_{$i}.jpg"; $directory = dirname($output_image_path); if (!is_dir($directory)) { mkdir($directory, 0777, true); echo "Created directory: $directory\n"; } file_put_contents($output_image_path, base64_decode($img_base64)); echo "Output image saved at $output_image_path\n"; } } else { echo "No outputImages found for item $i\n"; } } ?&gt; </code></pre></details> </details>

4.4 Pipeline Configuration Adjustment Instructions

NOTE: If you do not need to adjust pipeline configurations, you can ignore this section.

Adjusting the PaddleOCR-VL configuration for service deployment involves only three steps:

  1. Obtain the configuration file
  2. Modify the configuration file
  3. Apply the configuration file

4.4.1 Obtain the Configuration File

If you are deploying using Docker Compose:

Download the corresponding pipeline configuration file based on the backend you are using:

If you are deploying by manually installing dependencies:

Execute the following command to generate the pipeline configuration file:

shell
paddlex --get_pipeline_config PaddleOCR-VL

4.4.2 Modify the Configuration File

Enhance VLM Inference Performance Using Acceleration Frameworks

To improve VLM inference performance using acceleration frameworks such as vLLM (refer to Section 2 for detailed instructions on starting the VLM inference service), modify the VLRecognition.genai_config.backend and VLRecognition.genai_config.server_url fields in the pipeline configuration file, as shown below:

yaml
VLRecognition:
  ...
  genai_config:
    backend: vllm-server
    server_url: http://localhost:8118/v1

The Docker Compose solution already uses an acceleration framework by default.

Enable Document Image Preprocessing Functionality

The service started with default configurations does not support document preprocessing. If a client attempts to invoke this functionality, an error message will be returned. To enable document preprocessing, set use_doc_preprocessor to True in the pipeline configuration file and start the service using the modified configuration file.

Disable Result Visualization Functionality

The service returns visualized results by default, which introduces additional overhead. To disable this functionality, add the following configuration to the pipeline configuration file (Serving is a top-level field):

yaml
Serving:
  visualize: False

Additionally, you can set the visualize field to false in the request body to disable visualization for a single request.

Configure Return of Image URLs

For visualized result images and images included in Markdown, the service returns them in Base64 encoding by default. To return images as URLs instead, add the following configuration to the pipeline configuration file (Serving is a top-level field):

yaml
Serving:
  extra:
    file_storage:
      type: bos
      endpoint: https://bj.bcebos.com
      bucket_name: some-bucket
      ak: xxx
      sk: xxx
      key_prefix: deploy
    return_img_urls: True
    url_expires_in: 3600

Currently, storing generated images in Baidu Intelligent Cloud Object Storage (BOS) and returning URLs is supported. The parameters are described as follows:

  • endpoint: Access domain name (required).
  • ak: Baidu Intelligent Cloud Access Key (required).
  • sk: Baidu Intelligent Cloud Secret Key (required).
  • bucket_name: Storage bucket name (required).
  • key_prefix: Unified prefix for object keys.
  • connection_timeout_in_mills: Request timeout in milliseconds.
  • url_expires_in: URL validity period (in seconds). -1 indicates it never expires.

For more information on obtaining AK/SK and other details, refer to the Baidu Intelligent Cloud Official Documentation.

Limit the Number of PDF Pages Parsed

By default, the service processes the entire PDF file. In production environments, if a PDF contains too many pages, it may affect system stability, leading to processing timeouts or excessive resource usage. To ensure stable service operation, it is recommended to set a reasonable page limit based on actual needs. You can add the following configuration to the production configuration file (Serving is the top-level field):

yaml
Serving:
  extra:
    max_num_input_imgs: <page limit, e.g., 100>

When max_num_input_imgs is set to null, there will be no limit on the number of PDF pages.

4.4.3 Apply the Configuration File

If you deployed using Docker Compose:

If you are deploying using Docker Compose:

Set the services.paddleocr-vl-api.volumes field in the Compose file to mount the pipeline configuration file to the /home/paddleocr directory. For example:

yaml
services:
  paddleocr-vl-api:
    ...
    volumes:
      - pipeline_config_vllm.yaml:/home/paddleocr/pipeline_config_vllm.yaml
...

In a production environment, you can also build the image yourself and package the configuration file into the image.

If you are deploying by manually installing dependencies:

When starting the service, specify the --pipeline parameter as the path to your custom configuration file.

5. Model Fine-Tuning

If you find that PaddleOCR-VL does not meet accuracy expectations in specific business scenarios, we recommend using the ERNIEKit suite to perform supervised fine-tuning (SFT) on the VLM (e.g. PaddleOCR-VL-0.9B). For detailed instructions, refer to the ERNIEKit Official Documentation.

Currently, fine-tuning of layout detection and ranking models is not supported.