PaddleOCR-VL Usage Tutorial

INFO: PaddleOCR provides a unified interface for the PaddleOCR-VL model series to facilitate quick setup and usage. Unless otherwise specified, the term "PaddleOCR-VL" in this tutorial and related hardware usage tutorials refers to the PaddleOCR-VL model series (e.g., PaddleOCR-VL-1.5). References specific to the PaddleOCR-VL v1 version will be explicitly noted.

PaddleOCR-VL is an advanced and efficient document parsing model designed specifically for element recognition in documents. Taking its initial version (PaddleOCR-VL v1) as an example, its core component is PaddleOCR-VL-0.9B, a compact yet powerful Vision-Language Model (VLM) composed of a NaViT-style dynamic resolution visual encoder and the ERNIE-4.5-0.3B language model, enabling precise element recognition. The model series supports 109 languages and excels in recognizing complex elements (such as text, tables, formulas, and charts) while maintaining extremely low resource consumption. Comprehensive evaluations on widely used public benchmarks and internal benchmarks demonstrate that PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing Pipeline-based solutions, document parsing multimodal schemes, and advanced general-purpose multimodal large models, while offering faster inference speeds. These advantages make it highly suitable for deployment in real-world scenarios.

On January 29, 2026, we released PaddleOCR-VL-1.5. PaddleOCR-VL-1.5 not only significantly improved the accuracy on the OmniDocBench v1.5 evaluation set to 94.5%, but also innovatively supports irregular-shaped bounding box localization. As a result, PaddleOCR-VL-1.5 demonstrates outstanding performance in real-world scenarios such as Skew, Warping, Screen Photography, Illumination, and Scanning. In addition, the model has added new capabilities for seal (stamp) recognition and text detection and recognition, with key metrics continuing to lead the industry.

Process Guide

You can first choose a reading path based on your goal, and then confirm whether you should continue with this tutorial or switch to the corresponding hardware-specific tutorial for the same chapter.

Before getting started, we recommend first identifying your device type:

x64 CPU: You can read this tutorial directly.
NVIDIA GPU:
- If you are using a Blackwell-architecture GPU such as the RTX 50 series, we recommend first continuing with this process guide to determine your goal, and then referring to the corresponding chapters in the PaddleOCR-VL NVIDIA Blackwell Architecture GPU Usage Tutorial.
- For other NVIDIA GPUs, you can read this tutorial directly.
Apple Silicon, Kunlunxin XPU, Hygon DCU, MetaX GPU, Iluvatar GPU, and Huawei Ascend NPU: We recommend first continuing with this process guide to determine your goal, and then referring to the corresponding chapters in the dedicated tutorial for your hardware.

Before proceeding directly to the following sections along the path described above, if you need to confirm which inference methods PaddleOCR-VL supports in your current hardware environment (for example, using the PaddlePaddle framework as the inference engine), please continue to the next section, “Inference Device Support for PaddleOCR-VL”.

After confirming the above, choose your reading path based on your goal:

Local Direct Inference (Quick Experience / Script Integration):

Suitable for directly calling PaddleOCR-VL on the local machine through the PaddleOCR CLI or Python API. This category usually corresponds to local inference engine methods such as PaddlePaddle or Transformers.

Please read 1. Environment Preparation and 2. Quick Start, or the corresponding chapters in the hardware-specific tutorial.
Client with a VLM Inference Service (Performance-Focused):

Suitable for offloading only the VLM stage to a dedicated inference service for better performance. You can either deploy your own VLM inference service based on backends such as vLLM, SGLang, FastDeploy, MLX-VLM, and llama.cpp, or directly use a compatible managed service. This category usually corresponds to combinations of "Layout Detection Inference Method + VLM Inference Service".

It is recommended to first complete the basic local direct inference flow described in the previous item, and then continue with 3. Improving Inference Performance with VLM Inference Services or the corresponding chapters in the hardware-specific tutorial.

Note that Section 3 launches a VLM inference service, not the full PaddleOCR-VL API service. Other stages such as layout detection are still executed on the client side.
Deploy the Full API Service:

Suitable for packaging the full PaddleOCR-VL capability as a web service so that the client only needs to call it through an HTTP interface. Unlike the previous option, what is deployed here is an API service that directly exposes the complete PaddleOCR-VL capability, rather than a backend service that is only responsible for VLM inference. If you do not have special requirements for concurrent request processing, choose either of the following:
- Deployment using Docker Compose (one-click startup, recommended): this uses the "PaddlePaddle + VLM Inference Service" inference method, where the underlying VLM service uses an inference acceleration framework. Please read 4.1 Method 1: Deploy Using Docker Compose and 4.3 Client-Side Invocation, or the corresponding chapters in the hardware-specific tutorial.
- Manual deployment: by default, this uses PaddlePaddle inference. You can also switch to Transformers, or configure a VLM inference service to form a "Layout Detection Inference Method + VLM Inference Service" combination. Please read 1. Environment Preparation, 4.2 Method 2: Manual Deployment, and 4.3 Client-Side Invocation, or the corresponding chapters in the hardware-specific tutorial.
For concurrent request processing, please refer to the High-Performance Service Deployment solution.
Model Fine-tuning:

If you find that the accuracy of PaddleOCR-VL in specific business scenarios does not meet expectations, please read 5. Model Fine-tuning or the corresponding chapters in the hardware-specific tutorial.

Hardware-specific usage tutorials:

Hardware Type	Usage Tutorial
x64 CPU	This tutorial (currently supports manual dependency installation only)
NVIDIA GPU	- NVIDIA Blackwell architecture GPUs (such as the RTX 50 series): PaddleOCR-VL NVIDIA Blackwell Architecture GPU Usage Tutorial

Other NVIDIA GPUs: this tutorial | | Kunlunxin XPU | PaddleOCR-VL Kunlunxin XPU Usage Tutorial | | Hygon DCU | PaddleOCR-VL Hygon DCU Usage Tutorial | | MetaX GPU | PaddleOCR-VL MetaX GPU Usage Tutorial | | Iluvatar GPU | PaddleOCR-VL Iluvatar GPU Usage Tutorial | | Huawei Ascend NPU | PaddleOCR-VL Huawei Ascend NPU Usage Tutorial | | Apple Silicon | PaddleOCR-VL Apple Silicon Usage Tutorial | | AMD GPU | PaddleOCR-VL AMD GPU Usage Tutorial | | Intel Arc GPU | PaddleOCR-VL Intel Arc GPU Usage Tutorial |

Inference Device Support for PaddleOCR-VL

PaddleOCR-VL currently provides multiple inference methods, and the supported inference devices are not exactly the same. Please confirm that your inference device meets the requirements in the table below before deploying PaddleOCR-VL:

<table border="1"> <thead> <tr> <th>Inference Method</th> <th>NVIDIA GPU</th> <th>Kunlunxin XPU</th> <th>Hygon DCU</th> <th>MetaX GPU</th> <th>Iluvatar GPU</th> <th>Huawei Ascend NPU</th> <th>x64 CPU</th> <th>Apple Silicon</th> <th>AMD GPU</th> <th>Intel Arc GPU</th> </tr> </thead> <tbody> <tr style="text-align: center;"> <td>PaddlePaddle</td> <td>✅</td> <td>✅</td> <td>✅</td> <td>✅</td> <td>✅</td> <td>🚧</td> <td>✅</td> <td>✅</td> <td>✅</td> <td>✅</td> </tr> <tr style="text-align: center;"> <td>Transformers</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> </tr> <tr style="text-align: center;"> <td>PaddlePaddle + vLLM</td> <td>✅</td> <td>🚧</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>✅</td> <td>-</td> <td>-</td> <td>✅</td> <td>✅</td> </tr> <tr style="text-align: center;"> <td>PaddlePaddle + SGLang</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>-</td> <td>-</td> <td>🚧</td> <td>🚧</td> </tr> <tr style="text-align: center;"> <td>PaddlePaddle + FastDeploy</td> <td>✅</td> <td>✅</td> <td>🚧</td> <td>✅</td> <td>✅</td> <td>🚧</td> <td>-</td> <td>-</td> <td>🚧</td> <td>🚧</td> </tr> <tr style="text-align: center;"> <td>PaddlePaddle + MLX-VLM</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>✅</td> <td>-</td> <td>-</td> </tr> <tr style="text-align: center;"> <td>PaddlePaddle + llama.cpp</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> </tr> <tr style="text-align: center;"> <td>Transformers + vLLM</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>-</td> <td>-</td> <td>🚧</td> <td>🚧</td> </tr> <tr style="text-align: center;"> <td>Transformers + SGLang</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>-</td> <td>-</td> <td>🚧</td> <td>🚧</td> </tr> <tr style="text-align: center;"> <td>Transformers + FastDeploy</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>-</td> <td>-</td> <td>🚧</td> <td>🚧</td> </tr> <tr style="text-align: center;"> <td>Transformers + MLX-VLM</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>✅</td> <td>-</td> <td>-</td> </tr> <tr style="text-align: center;"> <td>Transformers + llama.cpp</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> <td>✅</td> <td>🚧</td> <td>🚧</td> <td>🚧</td> </tr> </tbody> </table> <details><summary>Explanation of Inference Method</summary> "PaddlePaddle" indicates that both the layout detection model and the VLM use the PaddlePaddle framework for inference. This is the default mode for the PaddleOCR CLI and Python API. "Transformers" indicates that both the layout detection model and the VLM use the Transformers engine for inference. Other inference methods follow the format "Layout Detection Model Inference Method + VLM Inference Method". For example, "PaddlePaddle + vLLM" means that the layout detection model uses PaddlePaddle for inference, while the VLM uses vLLM. </details>

TIP:

When using NVIDIA GPU for inference, ensure that the Compute Capability (CC) and CUDA version meet the requirements:

PaddlePaddle: CC ≥ 7.0, CUDA ≥ 11.8

Transformers: CC ≥ 7.0, CUDA ≥ 11.8

vLLM: CC ≥ 8.0, CUDA ≥ 12.6

SGLang: 8.0 ≤ CC < 12.0, CUDA ≥ 12.6

FastDeploy: 8.0 ≤ CC < 12.0, CUDA ≥ 12.6

Common GPUs with CC ≥ 8 include RTX 30/40/50 series and A10/A100, etc. For more models, refer to CUDA GPU Compute Capability

vLLM compatibility note: Although vLLM can be launched on NVIDIA GPUs with CC 7.x such as T4/V100, timeout or OOM issues may occur, and its use is not recommended.

vLLM, SGLang, and FastDeploy cannot run natively on Windows. Please use the Docker images we provide.

Due to dependency conflicts between different libraries, when using mixed inference methods like Transformers + vLLM, it is recommended to deploy the layout detection model and VLM service in different environments.

1. Environment Preparation

This section explains how to set up the runtime environment for PaddleOCR-VL. This tutorial mainly applies to x64 CPU users and NVIDIA GPU users other than Blackwell. For other hardware, please refer first to the dedicated tutorials listed above.

This tutorial provides the following two methods for environment preparation:

Method 1: Use the official Docker image (NVIDIA GPU only).
Method 2: Manually install the inference engine and PaddleOCR (available for both x64 CPU and NVIDIA GPU).

We strongly recommend using the Docker image to minimize potential environment-related issues.

1.1 Method 1: Using Docker Image

We recommend using the official Docker image (requires Docker version >= 19.03, GPU-equipped machine with NVIDIA drivers supporting CUDA 12.6 or later):

shell

docker run \
    -it \
    --gpus all \
    --network host \
    --user root \
    ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu \
    /bin/bash
# Invoke PaddleOCR CLI or Python API within the container

If you need to use PaddleOCR-VL in an offline environment, replace ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu (image size approximately 8 GB) in the above command with the offline version image ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu-offline (image size is approximately 10 GB). You will need to pull the image on an internet-connected machine, import it into the offline machine, and then start the container using this image on the offline machine. For example:

shell

# Execute on an internet-connected machine
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu-offline
# Save the image to a file
docker save ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-nvidia-gpu-offline -o paddleocr-vl-latest-nvidia-gpu-offline.tar

# Transfer the image file to the offline machine

# Execute on the offline machine
docker load -i paddleocr-vl-latest-nvidia-gpu-offline.tar
# After that, you can use `docker run` to start the container on the offline machine

The image comes preinstalled with the PaddlePaddle framework and does not include any other inference engines. If you want to use other inference engines, it is recommended to install them manually using Method 2 (it is not recommended to install them in an environment where the PaddlePaddle framework is preinstalled).

TIP: Images with the latest-xxx tag correspond to the latest version. If you want to use a specific version of the image, you can replace latest in the tag with the desired PaddleOCR version number: paddleocr<major>.<minor>. For example: ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:paddleocr3.3-nvidia-gpu-offline

1.2 Method 2: Manually Install the Inference Engine and PaddleOCR

If you cannot use Docker, you can manually install PaddlePaddle and PaddleOCR. The required Python version is 3.8–3.13.

We strongly recommend installing PaddleOCR-VL in a virtual environment to avoid dependency conflicts. For example, use the Python venv standard library to create a virtual environment:

shell

# Create a virtual environment
python -m venv .venv_paddleocr
# Activate the environment
source .venv_paddleocr/bin/activate

Please first install the dependencies corresponding to your chosen inference engine:

If you use PaddlePaddle for inference, install PaddlePaddle 3.2.1 or later (do not install both the CPU and GPU versions of PaddlePaddle at the same time). Common installation commands are as follows:

shell

# NVIDIA GPU (CUDA 12.6 as an example)
python -m pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/

# x64 CPU
python -m pip install paddlepaddle==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/

For other CUDA versions, please refer to the PaddlePaddle installation guide: https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html

If you use transformers for inference, refer to the official Transformers documentation to install transformers and the required low-level inference framework dependencies.

After installing the inference engine, run the following command to install the base package required by PaddleOCR-VL:

shell

python -m pip install -U "paddleocr[doc-parser]"

2. Quick Start

This section introduces how to use PaddleOCR-VL through the CLI and Python API.

PaddleOCR-VL supports both CLI and Python API usage. The CLI method is simpler and suitable for quick verification, while the Python API is more flexible and suitable for integration into existing projects. The examples below use PaddlePaddle inference by default. To switch to the transformers engine, append --engine transformers in the CLI, or pass engine="transformers" when initializing the Python API.

IMPORTANT: The methods introduced in this section are primarily for rapid validation. Their inference speed, memory usage, and stability may not meet the requirements of a production environment. If deployment to a production environment is needed, we strongly recommend using a dedicated VLM inference service. For specific methods, please refer to the next section.

2.1 Command Line Usage

When you run PaddleOCR-VL for the first time, it will automatically download the official model files. Please make sure the current environment has internet access and allow some extra time for downloading and initialization.

If you would like to use the local demo image from this document directly, you can download it first:

shell

curl -L -o paddleocr_vl_demo.png https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png

The following are ready-to-copy example commands. For the first try, we recommend adding --save_path ./output so that you can inspect the saved results in the current directory:

shell

# NVIDIA GPU
paddleocr doc_parser -i ./paddleocr_vl_demo.png --save_path ./output

# Kunlunxin XPU
paddleocr doc_parser -i ./paddleocr_vl_demo.png --device xpu --save_path ./output

# Hygon DCU
paddleocr doc_parser -i ./paddleocr_vl_demo.png --device dcu --save_path ./output

# MetaX GPU
paddleocr doc_parser -i ./paddleocr_vl_demo.png --device metax_gpu --save_path ./output

# Apple Silicon
paddleocr doc_parser -i ./paddleocr_vl_demo.png --device cpu --save_path ./output

# Huawei Ascend NPU
# For Huawei Ascend NPU, please refer to Chapter 3 and use PaddlePaddle + vLLM for inference

# Use --use_doc_orientation_classify to enable document orientation classification
paddleocr doc_parser -i ./paddleocr_vl_demo.png --use_doc_orientation_classify True --save_path ./output

# Use --use_doc_unwarping to enable the document unwarping module
paddleocr doc_parser -i ./paddleocr_vl_demo.png --use_doc_unwarping True --save_path ./output

# Use --use_layout_detection to disable the layout detection and ordering module
paddleocr doc_parser -i ./paddleocr_vl_demo.png --use_layout_detection False --save_path ./output

After successful execution, the terminal will print the structured result. If you set --save_path ./output, the result files will also be saved under the output directory in the current working directory for further inspection and debugging.

To switch to the transformers engine, use:

bash

paddleocr doc_parser -i ./paddleocr_vl_demo.png --engine transformers --save_path ./output

<details><summary><b>Command line supports more parameters. Click to expand for detailed parameter descriptions</b></summary> <table> <thead> <tr> <th>Parameter</th> <th>Description</th> <th>Type</th> <th>Default</th> </tr> </thead> <tbody> <tr> <td><code>input</code></td> <td><b>Meaning:</b>Data to be predicted, required.

<b>Description:</b> For example, the local path of an image file or PDF file: <code>/root/data/img.jpg</code>;<b>Such as a URL link</b>, for example, the network URL of an image file or PDF file:<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/demo_paper.png">Example</a>;<b>Such as a local directory</b>, which should contain the images to be predicted, for example, the local path: <code>/root/data/</code>(Currently, prediction for directories containing PDF files is not supported. PDF files need to be specified with a specific file path).</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>save_path</code></td> <td><b>Meaning:</b>Specify the path where the inference result file will be saved.

<b>Description:</b> If not set, the inference results will not be saved locally.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>pipeline_version</code></td> <td> <b>Meaning:</b> Specifies the pipeline version.

<b>Description:</b> The currently available values are <code>"v1"</code> and <code>"v1.5"</code>.

</td> <td><code>str</code></td> <td>"v1.5"</td> </tr> <tr> <td><code>layout_detection_model_name</code></td> <td><b>Meaning:</b>Name of the layout area detection and ranking model.

<b>Description:</b> If not set, the default model of the production line will be used.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>layout_detection_model_dir</code></td> <td><b>Meaning:</b>Directory path of the layout area detection and ranking model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>layout_threshold</code></td> <td><b>Meaning:</b>Score threshold for the layout model.

<b>Description:</b> Any value between <code>0-1</code>. If not set, the default value is used, which is <code>0.5</code>.

<td><code>float</code></td> <td></td> </tr> <tr> <td><code>layout_nms</code></td> <td><b>Meaning:</b>Whether to use post-processing NMS for layout detection.

<b>Description:</b> If not set, the initialized default value will be used.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>layout_unclip_ratio</code></td> <td><b>Meaning:</b>Expansion coefficient for the detection boxes of the layout area detection model.

<b>Description:</b> Any floating-point number greater than <code>0</code>. If not set, the initialized default value will be used.</td>

<td><code>float</code></td> <td></td> </tr> <tr> <td><code>layout_merge_bboxes_mode</code></td> <td><b>Meaning:</b>Merging mode for the detection boxes output by the model in layout detection.

<b>Description:</b>

<ul> <li><b>large</b> when set to large, it means that among the detection boxes output by the model, for overlapping and contained boxes, only the outermost largest box is retained, and the overlapping inner boxes are deleted;</li> <li><b>small</b>, when set to small, it means that among the detection boxes output by the model, for overlapping and contained boxes, only the innermost contained small box is retained, and the overlapping outer boxes are deleted;</li> <li><b>union</b>,no filtering is performed on the boxes, and both inner and outer boxes are retained;</li></ul> If not set, the initialized parameter value will be used. </td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>vl_rec_model_name</code></td> <td><b>Meaning:</b>Name of the multimodal recognition model.

<b>Description:</b> If not set, the default model will be used.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>vl_rec_model_dir</code></td> <td><b>Meaning:</b>Directory path of the multimodal recognition model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>vl_rec_backend</code></td> <td><b>Meaning:</b>Inference backend used by the multimodal recognition model.</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>vl_rec_server_url</code></td> <td><b>Description:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the server URL.</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>vl_rec_max_concurrency</code></td> <td><b>Meaning:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the maximum number of concurrent requests.</td> <td><code>int</code></td> <td></td> </tr> <tr> <td><code>vl_rec_api_model_name</code></td> <td><b>Meaning:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the model name of the service.</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>vl_rec_api_key</code></td> <td><b>Meaning:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the API key of the service.</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>doc_orientation_classify_model_name</code></td> <td><b>Meaning:</b>Name of the document orientation classification model.

<b>Description:</b> If not set, the initialized default value will be used.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>doc_orientation_classify_model_dir</code></td> <td><b>Meaning:</b>Directory path of the document orientation classification model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>doc_unwarping_model_name</code></td> <td><b>Meaning:</b>Name of the text image rectification model.

<b>Description:</b> If not set, the initialized default value will be used.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>doc_unwarping_model_dir</code></td> <td><b>Meaning:</b>Directory path of the text image rectification model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>use_doc_orientation_classify</code></td> <td><b>Meaning:</b>Whether to load and use the document orientation classification module.

<b>Description:</b> If not set, the initialized default value will be used, which is initialized to<code>False</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>use_doc_unwarping</code></td> <td><b>Meaning:</b>Whether to load and use the text image rectification module.

<b>Description:</b> If not set, the initialized default value will be used, which is initialized to <code>False</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>use_layout_detection</code></td> <td><b>Meaning:</b>Whether to load and use the layout area detection and ranking module.

<b>Description:</b> If not set, the initialized default value will be used, which is initialized to <code>True</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>use_chart_recognition</code></td> <td><b>Meaning:</b>Whether to use the chart parsing function.

<b>Description:</b> If not set, the initialized default value will be used, which is initialized to <code>False</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>use_seal_recognition</code></td> <td><b>Meaning:</b>Whether to use the seal recognition function.

<b>Description:</b> If not set, the initialized default value will be used, which defaults to initialization as <code>False</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>use_ocr_for_image_block</code></td> <td><b>Meaning:</b>Whether to perform OCR on text within image blocks.

<b>Description:</b> If not set, the initialized default value will be used, which defaults to initialization as <code>False</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>format_block_content</code></td> <td><b>Meaning:</b>Controls whether to format the <code>block_content</code> content within as Markdown.

<b>Description:</b> If not set, the initialized default value will be used, which defaults to initialization as<code>False</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>merge_layout_blocks</code></td> <td><b>Meaning:</b>Control whether to merge the layout detection boxes for cross-column or staggered top and bottom columns.

<b>Description:</b> If not set, the initialized default value will be used, which defaults to initialization as<code>True</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>markdown_ignore_labels</code></td> <td><b>Meaning:</b>Layout labels that need to be ignored in Markdown.

<b>Description:</b> If not set, the initialized default value will be used, which defaults to initialization as<code>['number','footnote','header','header_image','footer','footer_image','aside_text']</code>.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>layout_shape_mode</code></td> <td> <b>Meaning:</b>Specifies the geometric representation mode for layout detection results. It defines how the boundaries of detected regions (e.g., text blocks, images, tables) are calculated and displayed.

<b>Description:</b> Value descriptions:
<ul>
  <li>
    <b>rect (rectangle)</b>:
    Outputs an axis-aligned bounding box (including x1, y1, x2, y2).
    Suitable for standard horizontally aligned layouts.
  </li>
  <li>
    <b>quad (quadrilateral)</b>:
    Outputs an arbitrary quadrilateral composed of four vertices.
    Suitable for regions with skew or perspective distortion.
  </li>
  <li>
    <b>poly (polygon)</b>:
    Outputs a closed contour composed of multiple coordinate points.
    Suitable for irregularly shaped or curved layout elements,
    offering the highest precision.
  </li>
  <li>
    <b>auto (automatic)</b>:
    The system automatically selects the most appropriate shape
    representation based on the complexity and confidence of the
    detected targets.
  </li>
</ul>

</td> <td><code>str</code></td> <td>"auto"</td> </tr> <tr> <td><code>use_queues</code></td> <td><b>Meaning:</b>Used to control whether to enable internal queues.

<b>Description:</b> When set to <code>True</code>, data loading (such as rendering PDF pages as images), layout detection model processing, and VLM inference will be executed asynchronously in separate threads, with data passed through queues, thereby improving efficiency. This approach is particularly efficient for PDF documents with a large number of pages or directories containing a large number of images or PDF files. If not set, the initialized default value will be used, which defaults to initialization as <code>True</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>prompt_label</code></td> <td><b>Meaning:</b>The prompt type setting for the VL model, which takes effect if and only if <code>use_layout_detection=False</code>.</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>repetition_penalty</code></td> <td><b>Meaning:</b>The repetition penalty parameter used in VL model sampling.</td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>temperature</code></td> <td><b>Meaning:</b>The temperature parameter used in VL model sampling.</td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>top_p</code></td> <td><b>Meaning:</b>The top-p parameter used in VL model sampling.</td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>min_pixels</code></td> <td><b>Meaning:</b>The minimum number of pixels allowed when the VL model preprocesses images.</td> <td><code>int</code></td> <td></td> </tr> <tr> <td><code>max_pixels</code></td> <td><b>Meaning:</b>The maximum number of pixels allowed when the VL model preprocesses images.</td> <td><code>int</code></td> <td></td> </tr> <tr> <td><code>device</code></td> <td><b>Meaning:</b>The device used for inference.

<b>Description:</b> Supports specifying specific card numbers:<ul>

<li><b>CPU</b>: For example,<code>cpu</code> indicates using the CPU for inference;</li> <li><b>GPU</b>: For example,<code>gpu:0</code> indicates using the first GPU for inference;</li> <li><b>NPU</b>: For example,<code>npu:0</code> indicates using the first NPU for inference;</li> <li><b>XPU</b>: For example,<code>xpu:0</code> indicates using the first XPU for inference;</li> <li><b>MLU</b>: For example,<code>mlu:0</code> indicates using the first MLU for inference;</li> <li><b>DCU</b>: For example,<code>dcu:0</code> indicates using the first DCU for inference;</li> <li><b>MetaX GPU</b>: For example,<code>metax_gpu:0</code> indicates using the first MetaX GPU for inference;</li> <li><b>Iluvatar GPU</b>: For example,<code>iluvatar_gpu:0</code> indicates using the first Iluvatar GPU for inference;</li> </ul>If not set, the initialized default value will be used. During initialization, the local GPU device 0 will be used preferentially. If it is not available, the CPU device will be used.</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>engine</code></td> <td><b>Meaning:</b> Inference engine. <b>Description:</b> Supports <code>None</code> (the default), <code>paddle</code>, <code>paddle_static</code>, <code>paddle_dynamic</code>, and <code>transformers</code>. When left as <code>None</code>, PaddleOCR preserves the behavior of earlier versions, which in most configurations is equivalent to <code>paddle</code>. For detailed descriptions, supported values, compatibility rules, and examples, see <a href="../inference_engine.en.md">Inference Engine and Configuration</a>.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>enable_hpi</code></td> <td><b>Meaning:</b> Whether to enable high-performance inference.</td> <td><code>bool</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_tensorrt</code></td> <td><b>Meaning:</b> Whether to enable the TensorRT subgraph engine of Paddle Inference.

<b>Description:</b> If the model does not support TensorRT acceleration, acceleration will not be used even if this flag is set.

For CUDA 11.8 versions of PaddlePaddle, the compatible TensorRT version is 8.x (x>=6). TensorRT 8.6.1.6 is recommended.

</td> <td><code>bool</code></td> <td><code>False</code></td> </tr> <tr> <td><code>precision</code></td> <td><b>Meaning:</b> Computation precision, such as <code>fp32</code> or <code>fp16</code>.</td> <td><code>str</code></td> <td><code>fp32</code></td> </tr> <tr> <td><code>enable_mkldnn</code></td> <td><b>Meaning:</b> Whether to enable MKL-DNN accelerated inference.

<b>Description:</b> If MKL-DNN is unavailable or the model does not support MKL-DNN acceleration, acceleration will not be used even if this flag is set.

The inference result will be printed in the terminal. The default output of PaddleOCR-VL is as follows:

<details><summary> 👉Click to expand</summary> <pre> <code> {'res': {'input_path': 'paddleocr_vl_demo.png', 'page_index': None, 'model_settings': {'use_doc_preprocessor': False, 'use_layout_detection': True, 'use_chart_recognition': False, 'format_block_content': False}, 'layout_det_res': {'input_path': None, 'page_index': None, 'boxes': [{'cls_id': 6, 'label': 'doc_title', 'score': 0.9636914134025574, 'coordinate': [np.float32(131.31366), np.float32(36.450516), np.float32(1384.522), np.float32(127.984665)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9281806349754333, 'coordinate': [np.float32(585.39465), np.float32(158.438), np.float32(930.2184), np.float32(182.57469)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9840355515480042, 'coordinate': [np.float32(9.023666), np.float32(200.86115), np.float32(361.41583), np.float32(343.8828)]}, {'cls_id': 14, 'label': 'image', 'score': 0.9871416091918945, 'coordinate': [np.float32(775.50574), np.float32(200.66502), np.float32(1503.3807), np.float32(684.9304)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9801855087280273, 'coordinate': [np.float32(9.532196), np.float32(344.90594), np.float32(361.4413), np.float32(440.8244)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9708921313285828, 'coordinate': [np.float32(28.040405), np.float32(455.87976), np.float32(341.7215), np.float32(520.7117)]}, {'cls_id': 24, 'label': 'vision_footnote', 'score': 0.9002962708473206, 'coordinate': [np.float32(809.0692), np.float32(703.70044), np.float32(1488.3016), np.float32(750.5238)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9825374484062195, 'coordinate': [np.float32(8.896561), np.float32(536.54895), np.float32(361.05237), np.float32(655.8058)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9822263717651367, 'coordinate': [np.float32(8.971573), np.float32(657.4949), np.float32(362.01715), np.float32(774.625)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9767460823059082, 'coordinate': [np.float32(9.407074), np.float32(776.5216), np.float32(361.31067), np.float32(846.82874)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9868153929710388, 'coordinate': [np.float32(8.669495), np.float32(848.2543), np.float32(361.64703), np.float32(1062.8568)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9826608300209045, 'coordinate': [np.float32(8.8025055), np.float32(1063.8615), np.float32(361.46588), np.float32(1182.8524)]}, {'cls_id': 22, 'label': 'text', 'score': 0.982555627822876, 'coordinate': [np.float32(8.820602), np.float32(1184.4663), np.float32(361.66394), np.float32(1302.4507)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9584776759147644, 'coordinate': [np.float32(9.170288), np.float32(1304.2161), np.float32(361.48898), np.float32(1351.7483)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9782056212425232, 'coordinate': [np.float32(389.1618), np.float32(200.38202), np.float32(742.7591), np.float32(295.65146)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9844875931739807, 'coordinate': [np.float32(388.73303), np.float32(297.18463), np.float32(744.00024), np.float32(441.3034)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9680547714233398, 'coordinate': [np.float32(409.39468), np.float32(455.89386), np.float32(721.7174), np.float32(520.9387)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9741666913032532, 'coordinate': [np.float32(389.71606), np.float32(536.8138), np.float32(742.7112), np.float32(608.00165)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9840384721755981, 'coordinate': [np.float32(389.30988), np.float32(609.39636), np.float32(743.09247), np.float32(750.3231)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9845995306968689, 'coordinate': [np.float32(389.13272), np.float32(751.7772), np.float32(743.058), np.float32(894.8815)]}, {'cls_id': 22, 'label': 'text', 'score': 0.984852135181427, 'coordinate': [np.float32(388.83267), np.float32(896.0371), np.float32(743.58215), np.float32(1038.7345)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9804865717887878, 'coordinate': [np.float32(389.08478), np.float32(1039.9119), np.float32(742.7585), np.float32(1134.4897)]}, {'cls_id': 22, 'label': 'text', 'score': 0.986461341381073, 'coordinate': [np.float32(388.52643), np.float32(1135.8137), np.float32(743.451), np.float32(1352.0085)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9869391918182373, 'coordinate': [np.float32(769.8341), np.float32(775.66235), np.float32(1124.9813), np.float32(1063.207)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9822869896888733, 'coordinate': [np.float32(770.30383), np.float32(1063.938), np.float32(1124.8295), np.float32(1184.2192)]}, {'cls_id': 17, 'label': 'paragraph_title', 'score': 0.9689218997955322, 'coordinate': [np.float32(791.3042), np.float32(1199.3169), np.float32(1104.4521), np.float32(1264.6985)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9713128209114075, 'coordinate': [np.float32(770.4253), np.float32(1279.6072), np.float32(1124.6917), np.float32(1351.8672)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9236552119255066, 'coordinate': [np.float32(1153.9058), np.float32(775.5814), np.float32(1334.0654), np.float32(798.1581)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9857938885688782, 'coordinate': [np.float32(1151.5197), np.float32(799.28015), np.float32(1506.3619), np.float32(991.1156)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9820687174797058, 'coordinate': [np.float32(1151.5686), np.float32(991.91095), np.float32(1506.6023), np.float32(1110.8875)]}, {'cls_id': 22, 'label': 'text', 'score': 0.9866049885749817, 'coordinate': [np.float32(1151.6919), np.float32(1112.1301), np.float32(1507.1611), np.float32(1351.9504)]}]}}} </code></pre></details>

For detailed descriptions of the running results and saving interfaces, refer to the result explanation in 2.2 Python Script Integration.

<b>Note: </b> Since the default model of PaddleOCR-VL is relatively large, inference may be slow. For actual use, it is recommended to use 3. Improving Inference Performance with VLM Inference Services for faster inference.

2.2 Python Script Integration

The command line method is intended for quick testing and visualization. In real projects, you usually integrate the model through code. You can quickly run PaddleOCR-VL inference with just a few lines of code:

python

from pathlib import Path

from paddleocr import PaddleOCRVL

output_dir = Path("./output")
output_dir.mkdir(parents=True, exist_ok=True)

# NVIDIA GPU
pipeline = PaddleOCRVL()
# Kunlunxin XPU
# pipeline = PaddleOCRVL(device="xpu")
# Hygon DCU
# pipeline = PaddleOCRVL(device="dcu")
# MetaX GPU
# pipeline = PaddleOCRVL(device="metax_gpu")
# Apple Silicon
# pipeline = PaddleOCRVL(device="cpu")
# Huawei Ascend NPU 
# Huawei Ascend NPU please refer to Chapter 3 for inference using PaddlePaddle + vLLM

# pipeline = PaddleOCRVL(use_doc_orientation_classify=True) # Use use_doc_orientation_classify to enable/disable document orientation classification model
# pipeline = PaddleOCRVL(use_doc_unwarping=True) # Use use_doc_unwarping to enable/disable document unwarping module
# pipeline = PaddleOCRVL(use_layout_detection=False) # Use use_layout_detection to enable/disable layout detection module

output = pipeline.predict("./paddleocr_vl_demo.png")
for res in output:
    res.print() ## Print the structured prediction output
    res.save_to_json(save_path=output_dir) ## Save the current image's structured result in JSON format
    res.save_to_markdown(save_path=output_dir) ## Save the current image's result in Markdown format
    res.save_to_word(save_path="output") ## Save the current image's result in Word format

To switch to the transformers engine, use:

python

from pathlib import Path

from paddleocr import PaddleOCRVL

output_dir = Path("./output")
output_dir.mkdir(parents=True, exist_ok=True)

pipeline = PaddleOCRVL(engine="transformers")
output = pipeline.predict("./paddleocr_vl_demo.png")
for res in output:
    res.print() ## Print the structured prediction output
    res.save_to_json(save_path=output_dir) ## Save the current image's structured result in JSON format
    res.save_to_markdown(save_path=output_dir) ## Save the current image's result in Markdown format
    res.save_to_word(save_path="output") ## Save the current image's result in Word format

For PDF files, each page will be processed individually, and a separate Markdown file will be generated for each page. If you wish to perform cross-page table merging, reconstruct multi-level headings, or merge multi-page results, you can achieve this using the following method:

python

from pathlib import Path

from paddleocr import PaddleOCRVL

input_file = "./your_pdf_file.pdf"
output_dir = Path("./output")
output_dir.mkdir(parents=True, exist_ok=True)

pipeline = PaddleOCRVL()

output = pipeline.predict(input=input_file)

pages_res = list(output)

output = pipeline.restructure_pages(pages_res)
# output = pipeline.restructure_pages(pages_res, merge_tables=True) # Merge tables across pages
# output = pipeline.restructure_pages(pages_res, merge_tables=True, relevel_titles=True) # Merge tables across pages and reconstruct multi-level titles
# output = pipeline.restructure_pages(pages_res, merge_tables=True, relevel_titles=True, concatenate_pages=True) # Merge tables across pages, reconstruct multi-level titles, and merge multiple pages

for res in output:
    res.print() ## Print the structured prediction output
    res.save_to_json(save_path=output_dir) ## Save the current image's structured result in JSON format
    res.save_to_markdown(save_path=output_dir) ## Save the current image's result in Markdown format

If you need to process multiple files, it is recommended to pass the directory path containing the files or a list of file paths to the predict method to maximize processing efficiency. For example:

python

# The `imgs` directory contains multiple images to be processed: file1.png, file2.png, file3.png
# Pass the directory path
output = pipeline.predict("imgs")
# Or pass a list of file paths
output = pipeline.predict(["imgs/file1.png", "imgs/file2.png", "imgs/file3.png"])
# Both of the above methods are more efficient than the following approach:
# for file in ["imgs/file1.png", "imgs/file2.png", "imgs/file3.png"]:
#     output = pipeline.predict(file)

Note:

In the example code above, the use_doc_orientation_classify and use_doc_unwarping parameters are both set to False by default, meaning document orientation classification and document unwarping are disabled. If you need these features, set them to True manually.

The above Python script performs the following steps:

<details><summary>(1) Instantiate the pipeline object. Specific parameter descriptions are as follows:</summary> <table> <thead> <tr> <th>Parameter</th> <th>Parameter Description</th> <th>Parameter Type</th> <th>Default Value</th> </tr> </thead> <tbody> <tr> <td><code>pipeline_version</code></td> <td> <b>Meaning:</b> Specifies the pipeline version.

<b>Description:</b> The currently available values are <code>"v1"</code> and <code>"v1.5"</code>.

</td> <td><code>str</code></td> <td>"v1.5"</td> </tr> <tr> <td><code>layout_detection_model_name</code></td> <td><b>Meaning:</b>Name of the layout area detection and ranking model.

<b>Description:</b> If set to <code>None</code>, the default model of the production line will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_detection_model_dir</code></td> <td><b>Meaning:</b>Directory path of the layout area detection and ranking model.

<b>Description:</b> If set to <code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_threshold</code></td> <td><b>Meaning:</b>Score threshold for the layout model.

<b>Description:</b>

<ul> <li><b>float</b>: Any floating-point number between <code>0-1</code>;</li> <li><b>dict</b>: <code>{0:0.1}</code> The key is the class ID, and the value is the threshold for that class;</li> <li><b>None</b>: If set to <code>None</code>, the parameter value initialized by the pipeline will be used.</li> </ul> <td><code>float|dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_nms</code></td> <td><b>Meaning:</b>Whether to use post-processing NMS for layout detection.

<b>Description:</b> If set to <code>None</code>, the parameter value initialized by the production line will be used.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_unclip_ratio</code></td> <td> <b>Meaning:</b>Expansion coefficient for the detection box of the layout area detection model.

<b>Description:</b>

<ul> <li><b>float</b>: Any floating-point number greater than <code>0</code></li> <li><b>Tuple[float,float]</b>: The respective expansion coefficients in the horizontal and vertical directions;</li> <li><b>dict</b>: where the key of the dict is of <b>int</b> type, representing <code>cls_id</code>, and the value is of</code>tuple <code>type, such as</code>{0: (1.1, 2.0)}, indicating that the center of the detection box for class 0 output by the model remains unchanged, with the width expanded by 1.1 times and the height expanded by 2.0 times;</li> <li><b>None</b>: If set to <code>None</code>, the parameter value initialized by the pipeline will be used.</li> </ul> <td><code>float|Tuple[float,float]|dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_merge_bboxes_mode</code><ul> <td><b>Meaning:</b>Merging mode for the detection boxes output by the model in layout detection.

<b>Description:</b>

<ul> <li><b>large</b> when set to large, it means that among the detection boxes output by the model, for overlapping and contained boxes, only the outermost largest box is retained, and the overlapping inner boxes are deleted;</li> <li><b>small</b>, when set to small, it means that among the detection boxes output by the model, for overlapping and contained boxes, only the innermost contained small box is retained, and the overlapping outer boxes are deleted;</li> <li><b>union</b>,no filtering is performed on the boxes, and both inner and outer boxes are retained;</li></ul> If set to <code>None</code>, the initialized parameter value will be used. </td> <td><code>str|dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>vl_rec_model_name</code></td> <td><b>Meaning:</b>Name of the multimodal recognition model.

<b>Description:</b> If set to <code>None</code>, the default model will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>vl_rec_model_dir</code></td> <td><b>Meaning:</b>Directory path of the multimodal recognition model.

<b>Description:</b> If set to <code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>vl_rec_backend</code></td> <td><b>Meaning:</b>Inference backend used by the multimodal recognition model.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>vl_rec_server_url</code></td> <td><b>Meaning:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the server URL.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>vl_rec_max_concurrency</code></td> <td><b>Meaning:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the maximum number of concurrent requests.</td> <td><code>int|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>vl_rec_api_model_name</code></td> <td><b>Meaning:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the model name of the service.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>vl_rec_api_key</code></td> <td><b>Meaning:</b>If the multimodal recognition model uses an inference service, this parameter is used to specify the API key of the service.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>doc_orientation_classify_model_name</code></td> <td><b>Meaning:</b>Name of the document orientation classification model.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>doc_orientation_classify_model_dir</code></td> <td><b>Meaning:</b>Directory path of the document orientation classification model.

<b>Description:</b> If set to <code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>doc_unwarping_model_name</code></td> <td><b>Meaning:</b>Name of the text image rectification model.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>doc_unwarping_model_dir</code></td> <td><b>Meaning:</b>Directory path of the text image rectification model.

<b>Description:</b> If set to <code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_doc_orientation_classify</code></td> <td><b>Meaning:</b>Whether to load and use the document orientation classification module.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which is initialized to<code>False</code>.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_doc_unwarping</code></td> <td><b>Meaning:</b>Whether to load and use the text image rectification module.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which is initialized to <code>False</code>.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_layout_detection</code></td> <td><b>Meaning:</b>Whether to load and use the layout area detection and ranking module.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which is initialized to <code>True</code>.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_chart_recognition</code></td> <td><b>Meaning:</b>Whether to use the chart parsing function.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which is initialized to <code>False</code>.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_seal_recognition</code></td> <td><b>Meaning:</b>Whether to use the seal recognition function.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which is initialized to <code>False</code>.</td>

<td><code>bool|None</code></td> <td></td> </tr> <tr> <td><code>use_ocr_for_image_block</code></td> <td><b>Meaning:</b>Whether to perform OCR on text within image blocks.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which is initialized to <code>False</code>.</td>

<td><code>bool|None</code></td> <td></td> </tr> <tr> <td><code>format_block_content</code></td> <td><b>Meaning:</b>Controls whether to format the <code>block_content</code> content within as Markdown.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which defaults to initialization as<code>False</code>.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> <td></td> </tr> <tr> <td><code>merge_layout_blocks</code></td> <td><b>Meaning:</b>Control whether to merge the layout detection boxes for cross-column or staggered top and bottom columns.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which defaults to initialization as<code>True</code>.</td>

<td><code>bool|None</code></td> <td></td> </tr> <tr> <td><code>markdown_ignore_labels</code></td> <td><b>Meaning:</b>Layout labels that need to be ignored in Markdown.

<b>Description:</b> If set to <code>None</code>, the initialized default value will be used, which defaults to initialization as <code>['number','footnote','header','header_image','footer','footer_image','aside_text']</code>.</td>

<td><code>list|None</code></td> <td></td> </tr> <tr> <td><code>use_queues</code></td> <td><b>Meaning:</b>Used to control whether to enable internal queues.

<b>Description:</b> When set to <code>True</code>, data loading (such as rendering PDF pages as images), layout detection model processing, and VLM inference will be executed asynchronously in separate threads, with data passed through queues, thereby improving efficiency. This approach is particularly efficient for PDF documents with many pages or directories containing a large number of images or PDF files. If set to <code>None</code>, the initialized default value will be used, which defaults to initialization as <code>True</code>.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>device</code></td> <td><b>Meaning:</b>The device used for inference.

<b>Description:</b> Supports specifying specific card numbers:<ul>

<li><b>CPU</b>: For example,<code>cpu</code> indicates using the CPU for inference;</li> <li><b>GPU</b>: For example,<code>gpu:0</code> indicates using the first GPU for inference;</li> <li><b>NPU</b>: For example,<code>npu:0</code> indicates using the first NPU for inference;</li> <li><b>XPU</b>: For example,<code>xpu:0</code> indicates using the first XPU for inference;</li> <li><b>MLU</b>: For example,<code>mlu:0</code> indicates using the first MLU for inference;</li> <li><b>DCU</b>: For example,<code>dcu:0</code> indicates using the first DCU for inference;</li> <li><b>MetaX GPU</b>: For example,<code>metax_gpu:0</code> indicates using the first MetaX GPU for inference;</li> <li><b>Iluvatar GPU</b>: For example,<code>iluvatar_gpu:0</code> indicates using the first Iluvatar GPU for inference;</li> </ul>If not set, the initialized default value will be used. During initialization, the local GPU device 0 will be used preferentially. If it is not available, the CPU device will be used.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>engine</code></td> <td><b>Meaning:</b> Inference engine. <b>Description:</b> Supports <code>None</code> (the default), <code>paddle</code>, <code>paddle_static</code>, <code>paddle_dynamic</code>, and <code>transformers</code>. When left as <code>None</code>, PaddleOCR preserves the behavior of earlier versions, which in most configurations is equivalent to <code>paddle</code>. For detailed descriptions, supported values, compatibility rules, and examples, see <a href="../inference_engine.en.md">Inference Engine and Configuration</a>.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>engine_config</code></td> <td><b>Meaning:</b> Inference-engine configuration. <b>Description:</b> Recommended together with <code>engine</code>. For supported fields, compatibility rules, and examples, see <a href="../inference_engine.en.md">Inference Engine and Configuration</a>.</td> <td><code>dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>enable_hpi</code></td> <td><b>Meaning:</b> Whether to enable high-performance inference.</td> <td><code>bool</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_tensorrt</code></td> <td><b>Meaning:</b> Whether to enable the TensorRT subgraph engine of Paddle Inference.

<b>Description:</b> If the model does not support TensorRT acceleration, acceleration will not be used even if this flag is set.

For CUDA 11.8 versions of PaddlePaddle, the compatible TensorRT version is 8.x (x>=6). TensorRT 8.6.1.6 is recommended.

</td> <td><code>bool</code></td> <td><code>False</code></td> </tr> <tr> <td><code>precision</code></td> <td><b>Meaning:</b> Computation precision, such as <code>"fp32"</code> or <code>"fp16"</code>.</td> <td><code>str</code></td> <td><code>"fp32"</code></td> </tr> <tr> <td><code>enable_mkldnn</code></td> <td><b>Meaning:</b> Whether to enable MKL-DNN accelerated inference.

<b>Description:</b> If MKL-DNN is unavailable or the model does not support MKL-DNN acceleration, acceleration will not be used even if this flag is set.

</td> <td><code>bool</code></td> <td><code>True</code></td> </tr> <tr> <td><code>mkldnn_cache_capacity</code></td> <td> <b>Meaning:</b> MKL-DNN cache capacity. </td> <td><code>int</code></td> <td><code>10</code></td> </tr> <tr> <td><code>cpu_threads</code></td> <td><b>Meaning:</b> Number of threads used for inference on CPU.</td> <td><code>int</code></td> <td><code>10</code></td> </tr> <tr> <td><code>paddlex_config</code></td> <td><b>Meaning:</b> Path to the PaddleX pipeline configuration file.</td> <td><code>str</code></td> <td><code>None</code></td> </tr> </tbody> </table> </details> <details><summary>(2) Call the <code>predict()</code>method of the PaddleOCR-VL pipeline object for inference prediction. This method will return a list of results. Additionally, the pipeline also provides the <code>predict_iter()</code>Method. The two are completely consistent in terms of parameter acceptance and result return. The difference lies in that <code>predict_iter()</code>returns a <code>generator</code>, which can process and obtain prediction results step by step. It is suitable for scenarios involving large datasets or where memory conservation is desired. You can choose either of these two methods based on actual needs. Below are the parameters of the <code>predict()</code>method and their descriptions:</summary> <table> <thead> <tr> <th>Parameter</th> <th>Parameter Description</th> <th>Parameter Type</th> <th>Default Value</th> </tr> </thead> <tr> <td><code>input</code></td> <td><b>Meaning:</b>Data to be predicted, supporting multiple input types. Required.

<b>Description:</b>

<ul> <li><b>Python Var</b>: such as <code>numpy.ndarray</code> representing image data</li> <li><b>str</b>: such as the local path of an image file or PDF file: <code>/root/data/img.jpg</code>;<b>such as a URL link</b>, such as the network URL of an image file or PDF file:<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/demo_paper.png">Example</a>;<b>such as a local directory</b>, which should contain the images to be predicted, such as the local path: <code>/root/data/</code>(Currently, prediction for directories containing PDF files is not supported. PDF files need to be specified with a specific file path)</li> <li><b>list</b>: List elements should be of the aforementioned data types, such as <code>[numpy.ndarray, numpy.ndarray]</code>, <code>["/root/data/img1.jpg", "/root/data/img2.jpg"]</code>, <code>["/root/data1", "/root/data2"].</code></li> </ul> </td> <td><code>Python Var|str|list</code></td> <td></td> </tr> <tr> <td><code>use_doc_orientation_classify</code></td> <td><b>Meaning:</b>Whether to use the document orientation classification module during inference.