PP-ChatOCRv4-doc Pipeline Usage Tutorial

1. Introduction to PP-ChatOCRv4-doc Pipeline

PP-ChatOCRv4-doc is a unique document and image intelligent analysis solution from PaddlePaddle, combining LLM, MLLM, and OCR technologies to address complex document information extraction challenges such as layout analysis, rare characters, multi-page PDFs, tables, and seal text recognition. Integrated with ERNIE Bot, it fuses massive data and knowledge, achieving high accuracy and wide applicability. This pipeline also provides flexible service deployment options, supporting deployment on various hardware. Furthermore, it offers custom development capabilities, allowing you to train and fine-tune models on your own datasets, with seamless integration of trained models.

The PP-ChatOCRv4 pipeline includes the following 9 modules. Each module can be trained and inferred independently and includes multiple models. For more details, please click on the respective module to view the documentation.

Document Image Orientation Classification Module (Optional)
Text Image Unwarping Module (Optional)
Layout Detection Module
Table Structure Recognition Module (Optional)
Text Detection Module
Text Recognition Module
Text Line Orientation Classification Module(Optional)
Formula Recognition Module (Optional)
Seal Text Detection Module (Optional)

In this pipeline, you can choose the model to use based on the benchmark data below.

The inference time only includes the model inference time and does not include the time for pre- or post-processing. In the inference time columns labeled [Normal Mode / High-Performance Mode], the Normal Mode values correspond to local Paddle inference engines. Each module selects the appropriate local Paddle inference engine according to the default model name: models that support only dynamic graph use paddle_dynamic, while models that support both static and dynamic graph prefer paddle_static.

<details> <summary><b>Document Image Orientation Classification Module (Optional):</b></summary> <table> <thead> <tr> <th>Model</th><th>Model Download Link</th> <th>Top-1 Acc (%)</th> <th>GPU Inference Time (ms) [Standard Mode / High-Performance Mode]</th> <th>CPU Inference Time (ms) [Standard Mode / High-Performance Mode]</th> <th>Model Size (MB)</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td>PP-LCNet_x1_0_doc_ori</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-LCNet_x1_0_doc_ori_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-LCNet_x1_0_doc_ori_pretrained.pdparams">Training Model</a></td> <td>99.06</td> <td>2.62 / 0.59</td> <td>3.24 / 1.19</td> <td>7</td> <td>Document image classification model based on PP-LCNet_x1_0, with four categories: 0°, 90°, 180°, and 270°.</td> </tr> </tbody> </table> </details> <details> <summary><b>Text Image Unwarp Module (Optional):</b></summary> <table> <thead> <tr> <th>Model</th><th>Model Download Link</th> <th>CER</th> <th>GPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>CPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>Model Storage Size (MB)</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td>UVDoc</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/UVDoc_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/UVDoc_pretrained.pdparams">Training Model</a></td> <td>0.179</td> <td>19.05 / 19.05</td> <td>- / 869.82</td> <td>30.3</td> <td>High-precision Text Image Unwarping model.</td> </tr> </tbody> </table> </details> <details> <summary><b>Layout Detection Module Model:</b></summary> * <b>The layout detection model includes 20 common categories: document title, paragraph title, text, page number, abstract, table, references, footnotes, header, footer, algorithm, formula, formula number, image, table, seal, figure_table title, chart, and sidebar text and lists of references</b> <table> <thead> <tr> <th>Model</th><th>Model Download Link</th> <th>mAP(0.5) (%)</th> <th>GPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>CPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>Model Storage Size (MB)</th> <th>Introduction</th> </tr> </thead> <tbody> <tr> <td>PP-DocLayout_plus-L</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-DocLayout_plus-L_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-DocLayout_plus-L_pretrained.pdparams">Training Model</a></td> <td>83.2</td> <td>53.03 / 17.23</td> <td>634.62 / 378.32</td> <td>126.01</td> <td>A higher-precision layout area localization model trained on a self-built dataset containing Chinese and English papers, PPT, multi-layout magazines, contracts, books, exams, ancient books and research reports using RT-DETR-L</td> </tr> <tr> </tbody> </table>

<b>The layout detection model includes 1 category: Block:</b>

<b>The layout detection model includes 23 common categories: document title, paragraph title, text, page number, abstract, table of contents, references, footnotes, header, footer, algorithm, formula, formula number, image, figure caption, table, table caption, seal, figure title, figure, header image, footer image, and sidebar text</b>

<table> <thead> <tr> <th>Model</th><th>Download Link</th> <th>mAP(0.5) (%)</th> <th>GPU Inference Time (ms) [Standard Mode / High Performance Mode]</th> <th>CPU Inference Time (ms) [Standard Mode / High Performance Mode]</th> <th>Model Storage Size (MB)</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td>PP-DocLayout-L</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-DocLayout-L_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-DocLayout-L_pretrained.pdparams">Pretrained Model</a></td> <td>90.4</td> <td>33.59 / 33.59</td> <td>503.01 / 251.08</td> <td>123.76</td> <td>A high-precision layout area localization model trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using RT-DETR-L.</td> </tr> <tr> <td>PP-DocLayout-M</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-DocLayout-M_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-DocLayout-M_pretrained.pdparams">Pretrained Model</a></td> <td>75.2</td> <td>13.03 / 4.72</td> <td>43.39 / 24.44</td> <td>22.578</td> <td>A layout area localization model with balanced precision and efficiency, trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using PicoDet-L.</td> </tr> <tr> <td>PP-DocLayout-S</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-DocLayout-S_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-DocLayout-S_pretrained.pdparams">Pretrained Model</a></td> <td>70.9</td> <td>11.54 / 3.86</td> <td>18.53 / 6.29</td> <td>4.834</td> <td>A high-efficiency layout area localization model trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using PicoDet-S.</td> </tr> </tbody> </table>

❗ The above list includes the <b>4 core models</b> that are key supported by the text recognition module. The module actually supports a total of <b>12 full models</b>, including several predefined models with different categories. The complete model list is as follows:

<details><summary> 👉 Details of Model List</summary>

<b>Table Layout Detection Model</b>

<b>3-Class Layout Detection Model, including Table, Image, and Stamp</b>

<b>5-Class English Document Area Detection Model, including Text, Title, Table, Image, and List</b>

<b>17-Class Area Detection Model, including 17 common layout categories: Paragraph Title, Image, Text, Number, Abstract, Content, Figure Caption, Formula, Table, Table Caption, References, Document Title, Footnote, Header, Algorithm, Footer, and Stamp</b>

<table> <thead> <tr> <th>Model</th><th>Model Download Link</th> <th>mAP(0.5) (%)</th> <th>GPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>CPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>Model Storage Size (MB)</th> <th>Introduction</th> </tr> </thead> <tbody> <tr> <td>PicoDet-S_layout_17cls</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PicoDet-S_layout_17cls_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PicoDet-S_layout_17cls_pretrained.pdparams">Training Model</a></td> <td>87.4</td> <td>8.80 / 3.62</td> <td>17.51 / 6.35</td> <td>4.8</td> <td>A high-efficiency layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-S.</td> </tr> <tr> <td>PicoDet-L_layout_17cls</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PicoDet-L_layout_17cls_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PicoDet-L_layout_17cls_pretrained.pdparams">Training Model</a></td> <td>89.0</td> <td>12.60 / 10.27</td> <td>43.70 / 24.42</td> <td>22.6</td> <td>A balanced efficiency and precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-L.</td> </tr> <tr> <td>RT-DETR-H_layout_17cls</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/RT-DETR-H_layout_17cls_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/RT-DETR-H_layout_17cls_pretrained.pdparams">Training Model</a></td> <td>98.3</td> <td>115.29 / 101.18</td> <td>964.75 / 964.75</td> <td>470.2</td> <td>A high-precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using RT-DETR-H.</td> </tr> </table> </details> </details> <details> <summary><b>Table Structure Recognition Module Models (Optional):</b></summary> <table> <tr> <th>Model</th><th>Model Download Link</th> <th>Accuracy (%)</th> <th>GPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>CPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>Model Storage Size (MB)</th> <th>Description</th> </tr> <tr> <td>SLANet</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/SLANet_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/SLANet_pretrained.pdparams">Training Model</a></td> <td>59.52</td> <td>23.96 / 21.75</td> <td>- / 43.12</td> <td>6.9</td> <td>SLANet is a table structure recognition model developed by Baidu PaddleX Team. The model significantly improves the accuracy and inference speed of table structure recognition by adopting a CPU-friendly lightweight backbone network PP-LCNet, a high-low-level feature fusion module CSP-PAN, and a feature decoding module SLA Head that aligns structural and positional information.</td> </tr> <tr> <td>SLANet_plus</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/SLANet_plus_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/SLANet_plus_pretrained.pdparams">Training Model</a></td> <td>63.69</td> <td>23.43 / 22.16</td> <td>- / 41.80</td> <td>6.9</td> <td>SLANet_plus is an enhanced version of SLANet, the table structure recognition model developed by Baidu PaddleX Team. Compared to SLANet, SLANet_plus significantly improves the recognition ability for wireless and complex tables and reduces the model's sensitivity to the accuracy of table positioning, enabling more accurate recognition even with offset table positioning.</td> </tr> </table> </details> <details> <summary><b>Text Detection Module Models</b></summary> <table> <thead> <tr> <th>Model</th><th>Model Download Link</th> <th>Detection Hmean (%)</th> <th>GPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>CPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>Model Storage Size (MB)</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td>PP-OCRv5_server_det</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-OCRv5_server_det_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv5_server_det_pretrained.pdparams">Training Model</a></td> <td>83.8</td> <td>89.55 / 70.19</td> <td>383.15 / 383.15</td> <td>84.3</td> <td>PP-OCRv5 server-side text detection model with higher accuracy, suitable for deployment on high-performance servers</td> </tr> <tr> <td>PP-OCRv5_mobile_det</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-OCRv5_mobile_det_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv5_mobile_det_pretrained.pdparams">Training Model</a></td> <td>79.0</td> <td>10.67 / 6.36</td> <td>57.77 / 28.15</td> <td>4.7</td> <td>PP-OCRv5 mobile-side text detection model with higher efficiency, suitable for deployment on edge devices</td> </tr> <tr> <td>PP-OCRv4_server_det</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-OCRv4_server_det_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv4_server_det_pretrained.pdparams">Training Model</a></td> <td>69.2</td> <td>127.82 / 98.87</td> <td>585.95 / 489.77</td> <td>109</td> <td>PP-OCRv4 server-side text detection model with higher accuracy, suitable for deployment on high-performance servers</td> </tr> <tr> <td>PP-OCRv4_mobile_det</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-OCRv4_mobile_det_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv4_mobile_det_pretrained.pdparams">Training Model</a></td> <td>63.8</td> <td>9.87 / 4.17</td> <td>56.60 / 20.79</td> <td>4.7</td> <td>PP-OCRv4 mobile-side text detection model with higher efficiency, suitable for deployment on edge devices</td> </tr> </tbody> </table> </details> <details> <summary><b>Text Recognition Module Models</b></summary> <table> <tr> <th>Model</th><th>Model Download Links</th> <th>Recognition Avg Accuracy(%)</th> <th>GPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>CPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>Model Storage Size (MB)</th> <th>Introduction</th> </tr> <tr> <td>PP-OCRv5_server_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/\ PP-OCRv5_server_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv5_server_rec_pretrained.pdparams">Pretrained Model</a></td> <td>86.38</td> <td>8.46 / 2.36</td> <td>31.21 / 31.21</td> <td>81</td> <td rowspan="2">PP-OCRv5_rec is a next-generation text recognition model. It aims to efficiently and accurately support the recognition of four major languages—Simplified Chinese, Traditional Chinese, English, and Japanese—as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters using a single model. While maintaining recognition performance, it balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.</td> </tr> <tr> <td>PP-OCRv5_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/\ PP-OCRv5_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv5_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>81.29</td> <td>5.43 / 1.46</td> <td>21.20 / 5.32</td> <td>16</td> </tr> <tr> <td>PP-OCRv4_server_rec_doc</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/\ PP-OCRv4_server_rec_doc_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv4_server_rec_doc_pretrained.pdparams">Pretrained Model</a></td> <td>86.58</td> <td>8.69 / 2.78</td> <td>37.93 / 37.93</td> <td>182</td> <td>PP-OCRv4_server_rec_doc is trained on a mixed dataset of more Chinese document data and PP-OCR training data, building upon PP-OCRv4_server_rec. It enhances the recognition capabilities for some Traditional Chinese characters, Japanese characters, and special symbols, supporting over 15,000 characters. In addition to improving document-related text recognition, it also enhances general text recognition capabilities.</td> </tr> <tr> <td>PP-OCRv4_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-OCRv4_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv4_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>78.74</td> <td>5.26 / 1.12</td> <td>17.48 / 3.61</td> <td>10.5</td> <td>A lightweight recognition model of PP-OCRv4 with high inference efficiency, suitable for deployment on various hardware devices, including edge devices.</td> </tr> <tr> <td>PP-OCRv4_server_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-OCRv4_server_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv4_server_rec_pretrained.pdparams">Pretrained Model</a></td> <td>85.19</td> <td>8.75 / 2.49</td> <td>36.93 / 36.93</td> <td>173</td> <td>The server-side model of PP-OCRv4, offering high inference accuracy and deployable on various servers.</td> </tr> <tr> <td>en_PP-OCRv4_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/\ en_PP-OCRv4_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/en_PP-OCRv4_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>70.39</td> <td>4.81 / 1.23</td> <td>17.20 / 4.18</td> <td>7.5</td> <td>An ultra-lightweight English recognition model trained based on the PP-OCRv4 recognition model, supporting English and numeric character recognition.</td> </tr> </table> * <b>PP-OCRv5 Multi-Scenario Models</b> <table> <tr> <th>Model</th><th>Download Link</th> <th>Chinese Avg Accuracy (%)</th> <th>English Avg Accuracy (%)</th> <th>Traditional Chinese Avg Accuracy (%)</th> <th>Japanese Avg Accuracy (%)</th> <th>GPU Inference Time (ms) [Standard Mode / High Performance Mode]</th> <th>CPU Inference Time (ms) [Standard Mode / High Performance Mode]</th> <th>Model Storage Size (MB)</th> <th>Description</th> </tr> <tr> <td>PP-OCRv5_server_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-OCRv5_server_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv5_server_rec_pretrained.pdparams">Pretrained Model</a></td> <td>86.38</td> <td>64.70</td> <td>93.29</td> <td>60.35</td> <td>8.46 / 2.36</td> <td>31.21 / 31.21</td> <td>81</td> <td>PP-OCRv5_server_rec is a new-generation text recognition model. It efficiently and accurately supports four major languages: Simplified Chinese, Traditional Chinese, English, and Japanese, as well as handwriting, vertical text, pinyin, and rare characters, offering robust and efficient support for document understanding.</td> </tr> <tr> <td>PP-OCRv5_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-OCRv5_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv5_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>81.29</td> <td>66.00</td> <td>83.55</td> <td>54.65</td> <td>5.43 / 1.46</td> <td>21.20 / 5.32</td> <td>136</td> <td>PP-OCRv5_mobile_rec is a new-generation text recognition model. It efficiently and accurately supports four major languages: Simplified Chinese, Traditional Chinese, English, and Japanese, as well as handwriting, vertical text, pinyin, and rare characters, offering robust and efficient support for document understanding.</td> </tr> </table>

<b>Chinese Recognition Models</b>

<table> <tr> <th>Model</th><th>Download Link</th> <th>Avg Accuracy (%)</th> <th>GPU Inference Time (ms) [Standard Mode / High Performance Mode]</th> <th>CPU Inference Time (ms) [Standard Mode / High Performance Mode]</th> <th>Model Storage Size (MB)</th> <th>Description</th> </tr> <tr> <td>PP-OCRv4_server_rec_doc</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-OCRv4_server_rec_doc_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv4_server_rec_doc_pretrained.pdparams">Pretrained Model</a></td> <td>86.58</td> <td>8.69 / 2.78</td> <td>37.93 / 37.93</td> <td>182</td> <td>Based on PP-OCRv4_server_rec, trained on additional Chinese documents and PP-OCR mixed data. It supports over 15,000 characters including Traditional Chinese, Japanese, and special symbols, enhancing both document-specific and general text recognition accuracy.</td> </tr> <tr> <td>PP-OCRv4_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-OCRv4_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv4_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>78.74</td> <td>5.26 / 1.12</td> <td>17.48 / 3.61</td> <td>10.5</td> <td>Lightweight model of PP-OCRv4 with high inference efficiency, suitable for deployment on various edge devices.</td> </tr> <tr> <td>PP-OCRv4_server_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-OCRv4_server_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv4_server_rec_pretrained.pdparams">Pretrained Model</a></td> <td>85.19</td> <td>8.75 / 2.49</td> <td>36.93 / 36.93</td> <td>173</td> <td>Server-side model of PP-OCRv4 with high recognition accuracy, suitable for deployment on various servers.</td> </tr> <tr> <td>PP-OCRv3_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-OCRv3_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv3_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>72.96</td> <td>3.89 / 1.16</td> <td>8.72 / 3.56</td> <td>10.3</td> <td>Lightweight model of PP-OCRv3 with high inference efficiency, suitable for deployment on various edge devices.</td> </tr> </table> <table> <tr> <th>Model</th><th>Model Download Link</th> <th>Recognition Avg Accuracy (%)</th> <th>GPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>CPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>Model Storage Size (MB)</th> <th>Description</th> </tr> <tr> <td>ch_SVTRv2_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/ch_SVTRv2_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/ch_SVTRv2_rec_pretrained.pdparams">Training Model</a></td> <td>68.81</td> <td>10.38 / 8.31</td> <td>66.52 / 30.83</td> <td>80.5</td> <td rowspan="1"> SVTRv2 is a server-side text recognition model developed by the OpenOCR team at the Vision and Learning Lab (FVL) of Fudan University. It won the first prize in the OCR End-to-End Recognition Task of the PaddleOCR Algorithm Model Challenge, with a 6% improvement in end-to-end recognition accuracy compared to PP-OCRv4 on the A-list. </td> </tr> </table> <table> <tr> <th>Model</th><th>Model Download Link</th> <th>Recognition Avg Accuracy (%)</th> <th>GPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>CPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>Model Storage Size (MB)</th> <th>Description</th> </tr> <tr> <td>ch_RepSVTR_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/ch_RepSVTR_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/ch_RepSVTR_rec_pretrained.pdparams">Training Model</a></td> <td>65.07</td> <td>6.29 / 1.57</td> <td>20.64 / 5.40</td> <td>48.8</td> <td rowspan="1"> The RepSVTR text recognition model is a mobile-oriented text recognition model based on SVTRv2. It won the first prize in the OCR End-to-End Recognition Task of the PaddleOCR Algorithm Model Challenge, with a 2.5% improvement in end-to-end recognition accuracy compared to PP-OCRv4 on the B-list, while maintaining similar inference speed. </td> </tr> </table> * <b>English Recognition Models</b> <table> <tr> <th>Model</th><th>Download Link</th> <th>Avg Accuracy (%)</th> <th>GPU Inference Time (ms) [Standard Mode / High Performance Mode]</th> <th>CPU Inference Time (ms) [Standard Mode / High Performance Mode]</th> <th>Model Storage Size (MB)</th> <th>Description</th> </tr> <tr> <td>en_PP-OCRv4_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/en_PP-OCRv4_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/en_PP-OCRv4_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>70.39</td> <td>4.81 / 1.23</td> <td>17.20 / 4.18</td> <td>7.5</td> <td>Ultra-lightweight English recognition model trained on PP-OCRv4, supporting English and number recognition.</td> </tr> <tr> <td>en_PP-OCRv3_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/en_PP-OCRv3_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/en_PP-OCRv3_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>70.69</td> <td>3.56 / 0.78</td> <td>8.44 / 5.78</td> <td>17.3</td> <td>Ultra-lightweight English recognition model trained on PP-OCRv3, supporting English and number recognition.</td> </tr> </table>

<b>Multilingual Recognition Models</b>

<table> <tr> <th>Model</th><th>Model Download Link</th> <th>Recognition Avg Accuracy(%)</th> <th>GPU Inference Time (ms) [Normal / High Performance]</th> <th>CPU Inference Time (ms) [Normal / High Performance]</th> <th>Model Storage Size (MB)</th> <th>Description</th> </tr> <tr> <td>korean_PP-OCRv3_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/\ korean_PP-OCRv3_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/korean_PP-OCRv3_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>60.21</td> <td>3.73 / 0.98</td> <td>8.76 / 2.91</td> <td>9.6</td> <td>An ultra-lightweight Korean text recognition model trained based on PP-OCRv3, supporting Korean and digits recognition</td> </tr> <tr> <td>japan_PP-OCRv3_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/\ japan_PP-OCRv3_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/japan_PP-OCRv3_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>45.69</td> <td>3.86 / 1.01</td> <td>8.62 / 2.92</td> <td>9.8</td> <td>An ultra-lightweight Japanese text recognition model trained based on PP-OCRv3, supporting Japanese and digits recognition</td> </tr> <tr> <td>chinese_cht_PP-OCRv3_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/\ chinese_cht_PP-OCRv3_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/chinese_cht_PP-OCRv3_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>82.06</td> <td>3.90 / 1.16</td> <td>9.24 / 3.18</td> <td>10.8</td> <td>An ultra-lightweight Traditional Chinese text recognition model trained based on PP-OCRv3, supporting Traditional Chinese and digits recognition</td> </tr> <tr> <td>te_PP-OCRv3_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/\ te_PP-OCRv3_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/te_PP-OCRv3_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>95.88</td> <td>3.59 / 0.81</td> <td>8.28 / 6.21</td> <td>8.7</td> <td>An ultra-lightweight Telugu text recognition model trained based on PP-OCRv3, supporting Telugu and digits recognition</td> </tr> <tr> <td>ka_PP-OCRv3_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/\ ka_PP-OCRv3_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/ka_PP-OCRv3_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>96.96</td> <td>3.49 / 0.89</td> <td>8.63 / 2.77</td> <td>17.4</td> <td>An ultra-lightweight Kannada text recognition model trained based on PP-OCRv3, supporting Kannada and digits recognition</td> </tr> <tr> <td>ta_PP-OCRv3_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/\ ta_PP-OCRv3_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/ta_PP-OCRv3_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>76.83</td> <td>3.49 / 0.86</td> <td>8.35 / 3.41</td> <td>8.7</td> <td>An ultra-lightweight Tamil text recognition model trained based on PP-OCRv3, supporting Tamil and digits recognition</td> </tr> <tr> <td>latin_PP-OCRv3_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/\ latin_PP-OCRv3_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/latin_PP-OCRv3_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>76.93</td> <td>3.53 / 0.78</td> <td>8.50 / 6.83</td> <td>8.7</td> <td>An ultra-lightweight Latin text recognition model trained based on PP-OCRv3, supporting Latin and digits recognition</td> </tr> <tr> <td>arabic_PP-OCRv3_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/\ arabic_PP-OCRv3_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/arabic_PP-OCRv3_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>73.55</td> <td>3.60 / 0.83</td> <td>8.44 / 4.69</td> <td>17.3</td> <td>An ultra-lightweight Arabic script recognition model trained based on PP-OCRv3, supporting Arabic script and digits recognition</td> </tr> <tr> <td>cyrillic_PP-OCRv3_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/\ cyrillic_PP-OCRv3_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/cyrillic_PP-OCRv3_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>94.28</td> <td>3.56 / 0.79</td> <td>8.22 / 2.76</td> <td>8.7</td> <td>An ultra-lightweight Cyrillic script recognition model trained based on PP-OCRv3, supporting Cyrillic script and digits recognition</td> </tr> <tr> <td>devanagari_PP-OCRv3_mobile_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/\ devanagari_PP-OCRv3_mobile_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/devanagari_PP-OCRv3_mobile_rec_pretrained.pdparams">Pretrained Model</a></td> <td>96.44</td> <td>3.60 / 0.78</td> <td>6.95 / 2.87</td> <td>8.7</td> <td>An ultra-lightweight Devanagari script recognition model trained based on PP-OCRv3, supporting Devanagari script and digits recognition</td> </tr> </table> </details> <details> <summary><b>Text Line Orientation Classification Module (Optional):</b></summary> <table> <thead> <tr> <th>Model</th><th>Model Download Link</th> <th>Top-1 Accuracy (%)</th> <th>GPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>CPU Inference Time (ms)</th> <th>Model Storage Size (MB)</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td>PP-LCNet_x0_25_textline_ori</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-LCNet_x0_25_textline_ori_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-LCNet_x0_25_textline_ori_pretrained.pdparams">Training Model</a></td> <td>98.85</td> <td>2.16 / 0.41</td> <td>2.37 / 0.73</td> <td>0.96</td> <td>Text line classification model based on PP-LCNet_x0_25, with two classes: 0 degrees and 180 degrees</td> </tr> </tbody> </table> </details> <details> <summary><b>Formula Recognition Module Models (Optional):</b></summary> <table> <tr> <th>Model</th><th>Model Download Link</th> <th>En-BLEU(%)</th> <th>Zh-BLEU(%)</th> <th>GPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>CPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>Model Storage Size (MB)</th> <th>Introduction</th> </tr> <tr> <td>UniMERNet</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/UniMERNet_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/UniMERNet_pretrained.pdparams">Training Model</a></td> <td>85.91</td> <td>43.50</td> <td>1311.84 / 1311.84</td> <td>- / 8288.07</td> <td>1530</td> <td>UniMERNet is a formula recognition model developed by Shanghai AI Lab. It uses Donut Swin as the encoder and MBartDecoder as the decoder. The model is trained on a dataset of one million samples, including simple formulas, complex formulas, scanned formulas, and handwritten formulas, significantly improving the recognition accuracy of real-world formulas.</td> </tr> <td>PP-FormulaNet-S</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-FormulaNet-S_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-FormulaNet-S_pretrained.pdparams">Training Model</a></td> <td>87.00</td> <td>45.71</td> <td>182.25 / 182.25</td> <td>- / 254.39</td> <td>224</td> <td rowspan="2">PP-FormulaNet is an advanced formula recognition model developed by the Baidu PaddlePaddle Vision Team. The PP-FormulaNet-S version uses PP-HGNetV2-B4 as its backbone network. Through parallel masking and model distillation techniques, it significantly improves inference speed while maintaining high recognition accuracy, making it suitable for applications requiring fast inference. The PP-FormulaNet-L version, on the other hand, uses Vary_VIT_B as its backbone network and is trained on a large-scale formula dataset, showing significant improvements in recognizing complex formulas compared to PP-FormulaNet-S.</td> </tr> <td>PP-FormulaNet-L</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-FormulaNet-L_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-FormulaNet-L_pretrained.pdparams">Training Model</a></td> <td>90.36</td> <td>45.78</td> <td>1482.03 / 1482.03</td> <td>- / 3131.54</td> <td>695</td> </tr> <td>PP-FormulaNet_plus-S</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-FormulaNet_plus-S_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-FormulaNet_plus-S_pretrained.pdparams">Training Model</a></td> <td>88.71</td> <td>53.32</td> <td>179.20 / 179.20</td> <td>- / 260.99</td> <td>248</td> <td rowspan="3">PP-FormulaNet_plus is an enhanced version of the formula recognition model developed by the Baidu PaddlePaddle Vision Team, building upon the original PP-FormulaNet. Compared to the original version, PP-FormulaNet_plus utilizes a more diverse formula dataset during training, including sources such as Chinese dissertations, professional books, textbooks, exam papers, and mathematics journals. This expansion significantly improves the model’s recognition capabilities. Among the models, PP-FormulaNet_plus-M and PP-FormulaNet_plus-L have added support for Chinese formulas and increased the maximum number of predicted tokens for formulas from 1,024 to 2,560, greatly enhancing the recognition performance for complex formulas. Meanwhile, the PP-FormulaNet_plus-S model focuses on improving the recognition of English formulas. With these improvements, the PP-FormulaNet_plus series models perform exceptionally well in handling complex and diverse formula recognition tasks. </td> </tr> <tr> <td>PP-FormulaNet_plus-M</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-FormulaNet_plus-M_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-FormulaNet_plus-M_pretrained.pdparams">Training Model</a></td> <td>91.45</td> <td>89.76</td> <td>1040.27 / 1040.27</td> <td>- / 1615.80</td> <td>592</td> </tr> <tr> <td>PP-FormulaNet_plus-L</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-FormulaNet_plus-L_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-FormulaNet_plus-L_pretrained.pdparams">Training Model</a></td> <td>92.22</td> <td>90.64</td> <td>1476.07 / 1476.07</td> <td>- / 3125.58</td> <td>698</td> </tr> <tr> <td>LaTeX_OCR_rec</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/LaTeX_OCR_rec_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/LaTeX_OCR_rec_pretrained.pdparams">Training Model</a></td> <td>74.55</td> <td>39.96</td> <td>1088.89 / 1088.89</td> <td>- / -</td> <td>99</td> <td>LaTeX-OCR is a formula recognition algorithm based on an autoregressive large model. It uses Hybrid ViT as the backbone network and a transformer as the decoder, significantly improving the accuracy of formula recognition.</td> </tr> </table> </details> <details> <summary><b>Seal Text Detection Module Models (Optional):</b></summary> <table> <thead> <tr> <th>Model</th><th>Model Download Link</th> <th>Detection Hmean (%)</th> <th>GPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>CPU Inference Time (ms) [Normal Mode / High-Performance Mode]</th> <th>Model Storage Size (MB)</th> <th>Description</th> </tr> </thead> <tbody> <tr> <td>PP-OCRv4_server_seal_det</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-OCRv4_server_seal_det_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv4_server_seal_det_pretrained.pdparams">Training Model</a></td> <td>98.40</td> <td>124.64 / 91.57</td> <td>545.68 / 439.86</td> <td>109</td> <td>PP-OCRv4's server-side seal text detection model, featuring higher accuracy, suitable for deployment on better-equipped servers</td> </tr> <tr> <td>PP-OCRv4_mobile_seal_det</td> <td><a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_inference_model/paddle3.0.0/PP-OCRv4_mobile_seal_det_infer.tar">Inference Model</a>/<a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/official_pretrained_model/PP-OCRv4_mobile_seal_det_pretrained.pdparams">Training Model</a></td> <td>96.36</td> <td>9.70 / 3.56</td> <td>50.38 / 19.64</td> <td>4.7</td> <td>PP-OCRv4's mobile seal text detection model, offering higher efficiency, suitable for deployment on edge devices</td> </tr> </tbody> </table> </details> <details> <summary> <b>Test Environment Description:</b></summary> <ul> <li><b>Performance Test Environment</b> <ul> <li><strong>Test Dataset: </strong> <ul> <li>Text Image Rectification Model: <a href="https://www3.cs.stonybrook.edu/~cvl/docunet.html">DocUNet</a></li> <li>Layout Region Detection Model: A self-built layout analysis dataset using PaddleOCR, containing 10,000 images of common document types such as Chinese and English papers, magazines, and research reports.</li> <li>Table Structure Recognition Model: A self-built English table recognition dataset using PaddleX.</li> <li>Text Detection Model: A self-built Chinese dataset using PaddleOCR, covering multiple scenarios such as street scenes, web images, documents, and handwriting, with 500 images for detection.</li> <li>Chinese Recognition Model: A self-built Chinese dataset using PaddleOCR, covering multiple scenarios such as street scenes, web images, documents, and handwriting, with 11,000 images for text recognition.</li> <li>ch_SVTRv2_rec: Evaluation set A for "OCR End-to-End Recognition Task" in the <a href="https://aistudio.baidu.com/competition/detail/1131/0/introduction">PaddleOCR Algorithm Model Challenge</a></li> <li>ch_RepSVTR_rec: Evaluation set B for "OCR End-to-End Recognition Task" in the <a href="https://aistudio.baidu.com/competition/detail/1131/0/introduction">PaddleOCR Algorithm Model Challenge</a></li> <li>English Recognition Model: A self-built English dataset using PaddleX.</li> <li>Multilingual Recognition Model: A self-built multilingual dataset using PaddleX.</li> <li>Text Line Orientation Classification Model: A self-built dataset using PaddleOCR, covering various scenarios such as ID cards and documents, containing 1000 images.</li> <li>Seal Text Detection Model: A self-built dataset using PaddleOCR, containing 500 images of circular seal textures.</li> </ul> </li> <li><strong>Hardware Configuration:</strong> <ul> <li>GPU: NVIDIA Tesla T4</li> <li>CPU: Intel Xeon Gold 6271C @ 2.60GHz</li> </ul> </li> <li><strong>Software Environment:</strong> <ul> <li>Ubuntu 20.04 / CUDA 11.8 / cuDNN 8.9 / TensorRT 8.6.1.6</li> <li>paddlepaddle-gpu 3.0.0 / paddleocr 3.0.3</li> </ul> </li> </ul> </li> <li><b>Inference Mode Description</b></li> </ul> <table border="1"> <thead> <tr> <th>Mode</th> <th>GPU Configuration </th> <th>CPU Configuration </th> <th>Acceleration Technology Combination</th> </tr> </thead> <tbody> <tr> <td>Normal Mode</td> <td>FP32 Precision / No TRT Acceleration</td> <td>FP32 Precision / 8 Threads</td> <td>Local Paddle inference engines (by default, the engine is selected according to the default model name; if both static and dynamic graph are available, <code>paddle_static</code> is preferred)</td> </tr> <tr> <td>High-Performance Mode</td> <td>Optimal combination of pre-selected precision types and acceleration strategies</td> <td>FP32 Precision / 8 Threads</td> <td>Pre-selected optimal backend (Paddle/OpenVINO/TRT, etc.)</td> </tr> </tbody> </table> </details>

<b>If you prioritize model accuracy, choose a model with higher accuracy. If you prioritize inference speed, select a model with faster inference. If you prioritize model storage size, choose a model with a smaller storage size.</b>

2. Quick Start

Before using the PP-ChatOCRv4-doc pipeline locally, ensure you have completed the installation of the PaddleOCR wheel package according to the PaddleOCR Local Installation Tutorial. If you prefer to install dependencies selectively, please refer to the relevant instructions in the installation documentation. The corresponding dependency group for this pipeline is ie.

Please note: If you encounter issues such as the program becoming unresponsive, unexpected program termination, running out of memory resources, or extremely slow inference during execution, please try adjusting the configuration according to the documentation, such as disabling unnecessary features or using lighter-weight models.

Before performing model inference, you first need to prepare the API key for the large language model. PP-ChatOCRv4 supports large model services on the Baidu Cloud Qianfan Platform or the locally deployed standard OpenAI interface. If using the Baidu Cloud Qianfan Platform, refer to Authentication and Authorization to obtain the API key. If using a locally deployed large model service, refer to the PaddleNLP Large Model Deployment Documentation for deployment of the dialogue interface and vectorization interface for large models, and fill in the corresponding base_url and api_key. If you need to use a multimodal large model for data fusion, refer to the OpenAI service deployment in the PaddleMIX Model Documentation for multimodal large model deployment, and fill in the corresponding base_url and api_key.

Note: If local deployment of a multimodal large model is restricted due to the local environment, you can comment out the lines containing the mllm variable in the code and only use the large language model for information extraction.

2.1 Command Line Experience

After updating the configuration file, you can complete quick inference using just a few lines of Python code. You can use the test file for testing:

bash

paddleocr pp_chatocrv4_doc -i vehicle_certificate-1.png -k 驾驶室准乘人数 --qianfan_api_key your_api_key

# 通过 --invoke_mllm 和 --pp_docbee_base_url 使用多模态大模型
paddleocr pp_chatocrv4_doc -i vehicle_certificate-1.png -k 驾驶室准乘人数 --qianfan_api_key your_api_key --invoke_mllm True --pp_docbee_base_url http://127.0.0.1:8080/

The examples above use local Paddle inference engines by default. By default, each module selects the appropriate local Paddle inference engine according to the default model name: models that support only dynamic graph use paddle_dynamic, while models that support both static and dynamic graph prefer paddle_static. To run them, first install PaddlePaddle by following PaddlePaddle Framework Installation.

If you choose transformers as the inference engine, make sure the Transformers environment is configured by following Inference Engine and Configuration, and then run the following command:

bash

# Use the transformers engine for inference
paddleocr pp_chatocrv4_doc -i vehicle_certificate-1.png -k 驾驶室准乘人数 --qianfan_api_key your_api_key \
    --engine transformers

<details><summary><b>The command line supports more parameter configurations. Click to expand for a detailed explanation of the command line parameters.</b></summary> <table> <thead> <tr> <th>Parameter</th> <th>Description</th> <th>Type</th> <th>Default</th> </tr> </thead> <tbody> <tr> <td><code>input</code></td> <td><b>Meaning:</b>Data to be predicted, required.

<b>Description:</b> Such as the local path of an image file or PDF file: <code>/root/data/img.jpg</code>; <b>URL link</b>, such as the network URL of an image file or PDF file: <a href="https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/vehicle_certificate-1.png">Example</a>; <b>Local directory</b>, which should contain images to be predicted, such as the local path: <code>/root/data/</code> (currently does not support prediction of PDF files in directories, PDF files need to be specified to the specific file path).

</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>keys</code></td> <td><b>Meaning:</b>Keys for information extraction.</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>save_path</code></td> <td> <b>Meaning:</b>Specify the path to save the inference results file.

<b>Description:</b> If not set, the inference results will not be saved locally.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>invoke_mllm</code></td> <td><b>Meaning:</b>Whether to load and use a multimodal large model.

<b>Description:</b> If not set, the default is <code>False</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>layout_detection_model_name</code></td> <td> <b>Meaning:</b>The name of the layout detection model.

<b>Description:</b> If not set, the default model in pipeline will be used. </td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>layout_detection_model_dir</code></td> <td><b>Meaning:</b>The directory path of the layout detection model.

<b>Description:</b> If not set, the official model will be downloaded.

</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>doc_orientation_classify_model_name</code></td> <td> <b>Meaning:</b>The name of the document orientation classification model.

<b>Description:</b> If not set, the default model in pipeline will be used. </td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>doc_orientation_classify_model_dir</code></td> <td><b>Meaning:</b>The directory path of the document orientation classification model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>doc_unwarping_model_name</code></td> <td><b>Meaning:</b>The name of the text image unwarping model.

<b>Description:</b> If not set, the default model in pipeline will be used.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>doc_unwarping_model_dir</code></td> <td> The directory path of the text image unwarping model. If not set, the official model will be downloaded. </td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>text_detection_model_name</code></td> <td><b>Meaning:</b>Name of the text detection model.

<b>Description:</b> If not set, the pipeline's default model will be used.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>text_detection_model_dir</code></td> <td><b>Meaning:</b>Directory path of the text detection model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>text_recognition_model_name</code></td> <td><b>Meaning:</b>Name of the text recognition model.

<b>Description:</b> If not set, the pipeline's default model will be used.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>text_recognition_model_dir</code></td> <td><b>Meaning:</b>Directory path of the text recognition model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>text_recognition_batch_size</code></td> <td><b>Meaning:</b>Batch size for the text recognition model.

<b>Description:</b> If not set, the default batch size will be <code>1</code>.</td>

<td><code>int</code></td> <td></td> </tr> <tr> <td><code>table_structure_recognition_model_name</code></td> <td><b>Meaning:</b>Name of the table structure recognition model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>table_structure_recognition_model_dir</code></td> <td><b>Meaning:</b>Directory path of the table structure recognition model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>seal_text_detection_model_name</code></td> <td><b>Meaning:</b>The name of the seal text detection model.

<b>Description:</b> If not set, the pipeline's default model will be used.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>seal_text_detection_model_dir</code></td> <td><b>Meaning:</b>The directory path of the seal text detection model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>seal_text_recognition_model_name</code></td> <td><b>Meaning:</b>The name of the seal text recognition model.

<b>Description:</b> If not set, the default model of the pipeline will be used.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>seal_text_recognition_model_dir</code></td> <td><b>Meaning:</b>The directory path of the seal text recognition model.

<b>Description:</b> If not set, the official model will be downloaded.</td>

<td><code>str</code></td> <td></td> </tr> <tr> <td><code>seal_text_recognition_batch_size</code></td> <td><b>Meaning:</b>The batch size for the seal text recognition model.

<b>Description:</b> If not set, the batch size will default to <code>1</code>.</td>

<td><code>int</code></td> <td></td> </tr> <tr> <td><code>use_doc_orientation_classify</code></td> <td><b>Meaning:</b>Whether to load and use the document orientation classification module.

<b>Description:</b> If not set, the parameter value initialized by the pipeline will be used, which defaults to <code>True</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>use_doc_unwarping</code></td> <td><b>Meaning:</b>Whether to load and use the text image unwarping module.

<b>Description:</b> If not set, the parameter value initialized by the pipeline will be used, which defaults to <code>True</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>use_textline_orientation</code></td> <td><b>Meaning:</b>Whether to load and use the text line orientation classification module.

<b>Description:</b> If not set, the parameter value initialized by the pipeline will be used, which defaults to <code>True</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>use_seal_recognition</code></td> <td><b>Meaning:</b>Whether to load and use the seal text recognition sub-pipeline.

<b>Description:</b> If not set, the parameter value initialized by the pipeline will be used, defaulting to <code>True</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>use_table_recognition</code></td> <td><b>Meaning:</b>Whether to load and use the table recognition sub-pipeline.

<b>Description:</b> If not set, the parameter's value initialized during pipeline setup will be used, defaulting to <code>True</code>.</td>

<td><code>bool</code></td> <td></td> </tr> <tr> <td><code>layout_threshold</code></td> <td><b>Meaning:</b>Score threshold for the layout model.

<b>Description:</b> Any value between <code>0-1</code>. If not set, the default value is used, which is <code>0.5</code>.

</td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>layout_nms</code></td> <td> <b>Meaning:</b>Whether to use Non-Maximum Suppression (NMS) as post-processing for layout detection.

<b>Description:</b> If not set, the parameter will be set to the value initialized in the pipeline, which defaults to <code>True</code> by default.

</td> <td><code>bool</code></td> <td></td> </tr> <tr> <td><code>layout_unclip_ratio</code></td> <td><b>Meaning:</b>Unclip ratio for detected boxes in layout detection model.

<b>Description:</b> Any float > <code>0</code>. If not set, the default is <code>1.0</code>.

</td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>layout_merge_bboxes_mode</code></td> <td><b>Meaning:</b>The merging mode for the detection boxes output by the model in layout region detection.

<b>Description:</b>

<ul> <li><b>large</b>: When set to "large", only the largest outer bounding box will be retained for overlapping bounding boxes, and the inner overlapping boxes will be removed;</li> <li><b>small</b>: When set to "small", only the smallest inner bounding boxes will be retained for overlapping bounding boxes, and the outer overlapping boxes will be removed;</li> <li><b>union</b>: No filtering of bounding boxes will be performed, and both inner and outer boxes will be retained;</li> </ul>If not set, the default is <code>large</code>. </td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>text_det_limit_side_len</code></td> <td><b>Meaning:</b>Image side length limitation for text detection.

<b>Description:</b> Any integer greater than <code>0</code>. If not set, the pipeline's initialized value for this parameter (initialized to <code>960</code>) will be used.

</td> <td><code>int</code></td> <td></td> </tr> <tr> <td><code>text_det_limit_type</code></td> <td><b>Meaning:</b>Type of side length limit for text detection.

<b>Description:</b> Supports <code>min</code> and <code>max</code>. <code>min</code> means ensuring the shortest side of the image is not smaller than <code>det_limit_side_len</code>, and <code>max</code> means ensuring the longest side of the image is not larger than <code>limit_side_len</code>. If not set, the pipeline's initialized value for this parameter (initialized to <code>max</code>) will be used.

</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>text_det_thresh</code></td> <td><b>Meaning:</b>Pixel threshold for text detection. In the output probability map, pixels with scores higher than this threshold will be considered text pixels.

<b>Description:</b> Any floating-point number greater than <code>0</code> . If not set, the pipeline's initialized value for this parameter (<code>0.3</code>) will be used.

</td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>text_det_box_thresh</code></td> <td><b>Meaning:</b>Text detection box threshold. If the average score of all pixels within the detected result boundary is higher than this threshold, the result will be considered a text region.

<b>Description:</b> Any floating-point number greater than <code>0</code>. If not set, the pipeline's initialized value for this parameter (<code>0.6</code>) will be used.

</td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>text_det_unclip_ratio</code></td> <td><b>Meaning:</b>Text detection expansion coefficient. This method is used to expand the text region—the larger the value, the larger the expanded area.

<b>Description:</b> Any floating-point number greater than <code>0</code> . If not set, the pipeline's initialized value for this parameter (<code>2.0</code>) will be used.

</td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>text_rec_score_thresh</code></td> <td><b>Meaning:</b>Text recognition threshold. Text results with scores higher than this threshold will be retained.

<b>Description:</b> Any floating-point number greater than <code>0</code> . If not set, the pipeline's initialized value for this parameter (<code>0.0</code>, i.e., no threshold) will be used.

</td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>seal_det_limit_side_len</code></td> <td><b>Meaning:</b>Image side length limit for seal text detection.

<b>Description:</b> Any integer > <code>0</code>. If not set, the default is <code>736</code>.

</td> <td><code>int</code></td>don’t <td></td> </tr> <tr> <td><code>seal_det_limit_type</code></td> <td><b>Meaning:</b>Limit type for image side in seal text detection.

<b>Description:</b> Supports <code>min</code> and <code>max</code>; <code>min</code> ensures shortest side ≥ <code>det_limit_side_len</code>, <code>max</code> ensures longest side ≤ <code>limit_side_len</code>. If not set, the default is <code>min</code>.

</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>seal_det_thresh</code></td> <td><b>Meaning:</b>Pixel threshold. Pixels with scores above this value in the probability map are considered text.

<b>Description:</b> Any float > <code>0</code></li>

</ul>If not set, the default is <code>0.2</code>. </td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>seal_det_box_thresh</code></td> <td><b>Meaning:</b>Box threshold. Boxes with average pixel scores above this value are considered text regions.

<b>Description:</b> Any float > <code>0</code>. If not set, the default is <code>0.6</code>.

</td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>seal_det_unclip_ratio</code></td> <td><b>Meaning:</b>Expansion ratio for seal text detection. Higher value means larger expansion area.

<b>Description:</b> Any float > <code>0</code>. If not set, the default is <code>0.5</code>.

</td> <td><code>float</code></td> <td></td> </tr> <tr> <td><code>seal_rec_score_thresh</code></td> <td><b>Meaning:</b>Recognition score threshold. Text results above this value will be kept.

<b>Description:</b> Any float > <code>0</code></li>

</ul>If not set, the default is <code>0.0</code> (no threshold). </td> <td><code>float</code></td> <td></td> </tr> <td><code>qianfan_api_key</code></td> <td><b>Meaning:</b>API key for the Qianfan Platform.</td> <td><code>str</code></td> <td></td> </tr> <td><code>pp_docbee_base_url</code></td> <td><b>Meaning:</b>URL for the multimodal large language model service.</td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>device</code></td> <td><b>Meaning:</b>The device used for inference.

<b>Description:</b> You can specify a particular card number:

<ul> <li><b>CPU</b>: e.g., <code>cpu</code> indicates using CPU for inference;</li> <li><b>GPU</b>: e.g., <code>gpu:0</code> indicates using the 1st GPU for inference;</li> <li><b>NPU</b>: e.g., <code>npu:0</code> indicates using the 1st NPU for inference;</li> <li><b>XPU</b>: e.g., <code>xpu:0</code> indicates using the 1st XPU for inference;</li> <li><b>MLU</b>: e.g., <code>mlu:0</code> indicates using the 1st MLU for inference;</li> <li><b>DCU</b>: e.g., <code>dcu:0</code> indicates using the 1st DCU for inference;</li> <li><b>MetaX GPU</b>: e.g., <code>metax_gpu:0</code> indicates using the 1st MetaX GPU for inference;</li> <li><b>Iluvatar GPU</b>: e.g., <code>iluvatar_gpu:0</code> indicates using the 1st Iluvatar GPU for inference;</li> </ul>If not set, the pipeline initialized value for this parameter will be used. During initialization, the local GPU device 0 will be preferred; if unavailable, the CPU device will be used. </td> <td><code>str</code></td> <td></td> </tr> <tr> <td><code>engine</code></td> <td><b>Meaning:</b> Inference engine. <b>Description:</b> Supports <code>None</code> (the default), <code>paddle</code>, <code>paddle_static</code>, <code>paddle_dynamic</code>, and <code>transformers</code>. When left as <code>None</code>, PaddleOCR preserves the behavior of earlier versions, which in most configurations is equivalent to <code>paddle</code>. For detailed descriptions, supported values, compatibility rules, and examples, see <a href="../inference_engine.en.md">Inference Engine and Configuration</a>.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>enable_hpi</code></td> <td><b>Meaning:</b> Whether to enable high-performance inference.</td> <td><code>bool</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_tensorrt</code></td> <td><b>Meaning:</b> Whether to enable the TensorRT subgraph engine of Paddle Inference.

<b>Description:</b> If the model does not support TensorRT acceleration, acceleration will not be used even if this flag is set.

For CUDA 11.8 versions of PaddlePaddle, the compatible TensorRT version is 8.x (x>=6). TensorRT 8.6.1.6 is recommended.

</td> <td><code>bool</code></td> <td><code>False</code></td> </tr> <tr> <td><code>precision</code></td> <td><b>Meaning:</b> Computation precision, such as <code>fp32</code> or <code>fp16</code>.</td> <td><code>str</code></td> <td><code>fp32</code></td> </tr> <tr> <td><code>enable_mkldnn</code></td> <td><b>Meaning:</b> Whether to enable MKL-DNN accelerated inference.

<b>Description:</b> If MKL-DNN is unavailable or the model does not support MKL-DNN acceleration, acceleration will not be used even if this flag is set.

This method will print the results to the terminal. The content printed to the terminal is explained as follows:

驾驶室准乘人数 2

2.2 Python Script Experience

The command-line method is for a quick experience and to view results. Generally, in projects, integration via code is often required. You can download the Test File and use the following example code for inference:

python

from paddleocr import PPChatOCRv4Doc

chat_bot_config = {
    "module_name": "chat_bot",
    "model_name": "ernie-3.5-8k",
    "base_url": "https://qianfan.baidubce.com/v2",
    "api_type": "openai",
    "api_key": "api_key",  # your api_key
}

retriever_config = {
    "module_name": "retriever",
    "model_name": "embedding-v1",
    "base_url": "https://qianfan.baidubce.com/v2",
    "api_type": "qianfan",
    "api_key": "api_key",  # your api_key
}

mllm_chat_bot_config = {
    "module_name": "chat_bot",
    "model_name": "PP-DocBee2",
    "base_url": "http://127.0.0.1:8080/",  # your local mllm service url
    "api_type": "openai",
    "api_key": "api_key",  # your api_key
}

pipeline = PPChatOCRv4Doc()

visual_predict_res = pipeline.visual_predict(
    input="vehicle_certificate-1.png",
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_common_ocr=True,
    use_seal_recognition=True,
    use_table_recognition=True,
)

visual_info_list = []
for res in visual_predict_res:
    visual_info_list.append(res["visual_info"])
    layout_parsing_result = res["layout_parsing_result"]

vector_info = pipeline.build_vector(
    visual_info_list, flag_save_bytes_vector=True, retriever_config=retriever_config
)
mllm_predict_res = pipeline.mllm_pred(
    input="vehicle_certificate-1.png",
    key_list=["驾驶室准乘人数"],
    mllm_chat_bot_config=mllm_chat_bot_config,
)
mllm_predict_info = mllm_predict_res["mllm_res"]
chat_result = pipeline.chat(
    key_list=["驾驶室准乘人数"],
    visual_info=visual_info_list,
    vector_info=vector_info,
    mllm_predict_info=mllm_predict_info,
    chat_bot_config=chat_bot_config,
    retriever_config=retriever_config,
)
print(chat_result)

The example above uses local Paddle inference engines by default. By default, each module selects the appropriate local Paddle inference engine according to the default model name: models that support only dynamic graph use paddle_dynamic, while models that support both static and dynamic graph prefer paddle_static. To run it, first install PaddlePaddle by following PaddlePaddle Framework Installation.

If you choose transformers as the inference engine, make sure the Transformers environment is configured by following Inference Engine and Configuration, and then run the following code:

python

from paddleocr import PPChatOCRv4Doc

chat_bot_config = {
    "module_name": "chat_bot",
    "model_name": "ernie-3.5-8k",
    "base_url": "https://qianfan.baidubce.com/v2",
    "api_type": "openai",
    "api_key": "api_key",  # your api_key
}

retriever_config = {
    "module_name": "retriever",
    "model_name": "embedding-v1",
    "base_url": "https://qianfan.baidubce.com/v2",
    "api_type": "qianfan",
    "api_key": "api_key",  # your api_key
}

mllm_chat_bot_config = {
    "module_name": "chat_bot",
    "model_name": "PP-DocBee2",
    "base_url": "http://127.0.0.1:8080/",  # your local mllm service url
    "api_type": "openai",
    "api_key": "api_key",  # your api_key
}

pipeline = PPChatOCRv4Doc(
    engine="transformers",
)

visual_predict_res = pipeline.visual_predict(
    input="vehicle_certificate-1.png",
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_common_ocr=True,
    use_seal_recognition=True,
    use_table_recognition=True,
)

visual_info_list = []
for res in visual_predict_res:
    visual_info_list.append(res["visual_info"])
    layout_parsing_result = res["layout_parsing_result"]

vector_info = pipeline.build_vector(
    visual_info_list, flag_save_bytes_vector=True, retriever_config=retriever_config
)
mllm_predict_res = pipeline.mllm_pred(
    input="vehicle_certificate-1.png",
    key_list=["驾驶室准乘人数"],
    mllm_chat_bot_config=mllm_chat_bot_config,
)
mllm_predict_info = mllm_predict_res["mllm_res"]
chat_result = pipeline.chat(
    key_list=["驾驶室准乘人数"],
    visual_info=visual_info_list,
    vector_info=vector_info,
    mllm_predict_info=mllm_predict_info,
    chat_bot_config=chat_bot_config,
    retriever_config=retriever_config,
)
print(chat_result)

After running, the output is as follows:

{'chat_res': {'驾驶室准乘人数': '2'}}

The prediction process, API description, and output description for PP-ChatOCRv4 are as follows:

<details><summary>(1) Call <code>PPChatOCRv4Doc</code> to instantiate the PP-ChatOCRv4 pipeline object.</summary>

The relevant parameter descriptions are as follows:

<table> <thead> <tr> <th>Parameter</th> <th>Parameter Description</th> <th>Parameter Type</th> <th>Default Value</th> </tr> </thead> <tbody> <tr> <td><code>layout_detection_model_name</code></td> <td><b>Meaning:</b>The name of the model used for layout region detection.

<b>Description:</b> If set to<code>None</code>, the pipeline's default model will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_detection_model_dir</code></td> <td><b>Meaning:</b>The directory path of the layout region detection model.

<b>Description:</b> If set to<code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>doc_orientation_classify_model_name</code></td> <td><b>Meaning:</b>The name of the document orientation classification model.

<b>Description:</b> If set to<code>None</code>, the pipeline's default model will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>doc_orientation_classify_model_dir</code></td> <td><b>Meaning:</b>The directory path of the document orientation classification model.

<b>Description:</b> If set to<code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>doc_unwarping_model_name</code></td> <td><b>Meaning:</b>The name of the document unwarping model.

<b>Description:</b> If set to<code>None</code>, the pipeline's default model will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>doc_unwarping_model_dir</code></td> <td><b>Meaning:</b>The directory path of the document unwarping model.

<b>Description:</b> If set to<code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>text_detection_model_name</code></td> <td><b>Meaning:</b>The name of the text detection model.

<b>Description:</b> If set to<code>None</code>, the pipeline's default model will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>text_detection_model_dir</code></td> <td><b>Meaning:</b>The directory path of the text detection model.

<b>Description:</b> If set to<code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>text_recognition_model_name</code></td> <td><b>Meaning:</b>The name of the text recognition model.

<b>Description:</b> If set to<code>None</code>, the pipeline's default model will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>text_recognition_model_dir</code></td> <td><b>Meaning:</b>The directory path of the text recognition model.

<b>Description:</b> If set to<code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>text_recognition_batch_size</code></td> <td><b>Meaning:</b>The batch size for the text recognition model.

<b>Description:</b> If set to<code>None</code>, the batch size will default to <code>1</code>.</td>

<td><code>int|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>table_structure_recognition_model_name</code></td> <td><b>Meaning:</b>The name of the table structure recognition model.

<b>Description:</b> If set to<code>None</code>, the pipeline's default model will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>table_structure_recognition_model_dir</code></td> <td><b>Meaning:</b>The directory path of the table structure recognition model.

<b>Description:</b> If set to<code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>seal_text_detection_model_name</code></td> <td><b>Meaning:</b>The name of the seal text detection model.

<b>Description:</b> If set to<code>None</code>, the pipeline's default model will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>seal_text_detection_model_dir</code></td> <td><b>Meaning:</b>The directory path of the seal text detection model.

<b>Description:</b> If set to<code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>seal_text_recognition_model_name</code></td> <td><b>Meaning:</b>The name of the seal text recognition model.

<b>Description:</b> If set to<code>None</code>, the pipeline's default model will be used.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>seal_text_recognition_model_dir</code></td> <td><b>Meaning:</b>The directory path of the seal text recognition model.

<b>Description:</b> If set to<code>None</code>, the official model will be downloaded.</td>

<td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>seal_text_recognition_batch_size</code></td> <td><b>Meaning:</b>The batch size for the seal text recognition model.

<b>Description:</b> If set to<code>None</code>, the batch size will default to <code>1</code>.</td>

<td><code>int|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_doc_orientation_classify</code></td> <td>Whether to load and use the document orientation classification module. If set to<code>None</code>, the value initialized by the pipeline for this parameter will be used (defaults to <code>True</code>).</td> <td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_doc_unwarping</code></td> <td><b>Meaning:</b>Whether to load and use the document unwarping module.

<b>Description:</b> If set to<code>None</code>, the value initialized by the pipeline for this parameter will be used (defaults to <code>True</code>).</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_textline_orientation</code></td> <td><b>Meaning:</b>Whether to load and use the text line orientation classification function.

<b>Description:</b> If set to<code>None</code>, the value initialized by the pipeline for this parameter will be used (defaults to <code>True</code>).</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_seal_recognition</code></td> <td><b>Meaning:</b>Whether to load and use the seal text recognition sub-pipeline.

<b>Description:</b> If set to<code>None</code>, the value initialized by the pipeline for this parameter will be used (defaults to <code>True</code>).</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_table_recognition</code></td> <td><b>Meaning:</b>Whether to load and use the table recognition sub-pipeline.

<b>Description:</b> If set to<code>None</code>, the value initialized by the pipeline for this parameter will be used (defaults to <code>True</code>).</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_threshold</code></td> <td><b>Meaning:</b>Layout model score threshold.

<b>Description:</b>

<ul> <li><b>float</b>: Any float between <code>0-1</code>;</li> <li><b>dict</b>: <code>{0:0.1}</code> where the key is the class ID and the value is the threshold for that class;</li> <li><b>None</b>: If set to <code>None</code>, uses the pipeline default of <code>0.5</code>.</li> </ul> </td> <td><code>float|dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_nms</code></td> <td><b>Meaning:</b>Whether to use Non-Maximum Suppression (NMS) as post-processing for layout detection.

<b>Description:</b> If set to <code>None</code>, the parameter will be set to the value initialized in the pipeline, which is set to <code>True</code> by default.</td>

<td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_unclip_ratio</code></td> <td><b>Meaning:</b>Expansion factor for the detection boxes of the layout region detection model.

<b>Description:</b>

<ul> <li><b>float</b>: Any float greater than <code>0</code>;</li> <li><b>Tuple[float,float]</b>: Expansion ratios in horizontal and vertical directions;</li> <li><b>dict</b>: A dictionary with <b>int</b> keys representing <code>cls_id</code>, and <b>tuple</b> values, e.g., <code>{0: (1.1, 2.0)}</code> means width is expanded 1.1× and height 2.0× for class 0 boxes;</li> <li><b>None</b>: If set to <code>None</code>, uses the pipeline default of <code>1.0</code>.</li> </ul> </td> <td><code>float|Tuple[float,float]|dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_merge_bboxes_mode</code></td> <td><b>Meaning:</b>Method for filtering overlapping boxes in layout region detection.

<b>Description:</b>

<ul> <li><b>str</b>: <code>large</code>,<code>small</code>, <code>union</code>, representing whether to keep the large box, small box, or both when filtering overlapping boxes;</li> <li><b>dict</b>, where the key is of <b>int</b> type, representing <code>cls_id</code>, and the value is of <b>str</b> type, e.g.,<code>{0: "large", 2: "small"}</code>, meaning use "large" mode for class 0 detection boxes and "small" mode for class 2 detection boxes;</li> <li><b>None</b>: If set to <code>None</code>, the value initialized by the pipeline for this parameter will be used (defaults to <code>large</code>).</li> </ul> </td> <td><code>str|dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>text_det_limit_side_len</code></td> <td><b>Meaning:</b>Image side length limitation for text detection.

<b>Description:</b>

<ul> <li><b>int</b>: Any integer greater than <code>0</code>;</li> <li><b>None</b>: If set to <code>None</code>, the value initialized by the pipeline for this parameter will be used (defaults to <code>960</code>).</li> </ul> </td> <td><code>int|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>text_det_limit_type</code></td> <td><b>Meaning:</b>Type of side length limit for text detection.

<b>Description:</b>

<ul> <li><b>str</b>: Supports <code>min</code> and <code>max</code>. <code>min</code> ensures the shortest side of the image is not less than <code>det_limit_side_len</code>. <code>max</code> ensures the longest side of the image is not greater than <code>limit_side_len</code>;</li> <li><b>None</b>: If set to <code>None</code>, the value initialized by the pipeline for this parameter will be used (defaults to <code>max</code>).</li> </ul> </td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>text_det_thresh</code></td> <td><b>Meaning:</b>Detection pixel threshold. In the output probability map, pixels with scores greater than this threshold are considered text pixels.

<b>Description:</b>

<ul> <li><b>float</b>: Any float greater than <code>0</code>;</li> <li><b>None</b>: If set to <code>None</code>, the value initialized by the pipeline for this parameter (defaults to <code>0.3</code>) will be used.</li></ul> </td> <td><code>float|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>text_det_box_thresh</code></td> <td><b>Meaning:</b>Detection box threshold. If the average score of all pixels within a detection result's bounding box is greater than this threshold, the result is considered a text region.

<b>Description:</b>

<ul> <li><b>float</b>: Any float greater than <code>0</code>;</li> <li><b>None</b>: If set to <code>None</code>, the value initialized by the pipeline for this parameter (defaults to <code>0.6</code>) will be used.</li></ul> </td> <td><code>float|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>text_det_unclip_ratio</code></td> <td><b>Meaning:</b>Text detection expansion factor. This method is used to expand text regions; the larger the value, the larger the expanded area.

<b>Description:</b>

<ul> <li><b>float</b>: Any float greater than <code>0</code>;</li> <li><b>None</b>: If set to <code>None</code>, the value initialized by the pipeline for this parameter (defaults to <code>2.0</code>) will be used.</li></ul> </td> <td><code>float|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>text_rec_score_thresh</code></td> <td><b>Meaning:</b>Text recognition threshold. Text results with scores greater than this threshold will be kept.

<b>Description:</b>

<ul> <li><b>float</b>: Any float greater than <code>0</code>;</li> <li><b>None</b>: If set to <code>None</code>, the value initialized by the pipeline for this parameter (defaults to <code>0.0</code>, i.e., no threshold) will be used.</li></ul> </td> <td><code>float|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>seal_det_limit_side_len</code></td> <td><b>Meaning:</b>Image side length limit for seal text detection.

<b>Description:</b>

<ul> <li><b>int</b>: Any integer greater than <code>0</code>;</li> <li><b>None</b>: If set to <code>None</code>, the value initialized by the pipeline for this parameter will be used (defaults to <code>736</code>).</li> </ul> </td> <td><code>int|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>seal_det_limit_type</code></td> <td><b>Meaning:</b>Type of image side length limit for seal text detection.

<b>Description:</b>

<ul> <li><b>str</b>: Supports <code>min</code> and <code>max</code>. <code>min</code> ensures the shortest side of the image is not less than <code>det_limit_side_len</code>. <code>max</code> ensures the longest side of the image is not greater than <code>limit_side_len</code>;</li> <li><b>None</b>: If set to <code>None</code>, the value initialized by the pipeline for this parameter will be used (defaults to <code>min</code>).</li> </ul> </td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>seal_det_thresh</code></td> <td><b>Meaning:</b>Detection pixel threshold. In the output probability map, pixels with scores greater than this threshold are considered text pixels.

<b>Description:</b>

<ul> <li><b>float</b>: Any float greater than <code>0</code>; <li><b>None</b>: If set to <code>None</code>, the value initialized by the pipeline for this parameter (defaults to <code>0.2</code>) will be used.</li></li></ul> </td> <td><code>float|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>seal_det_box_thresh</code></td> <td><b>Meaning:</b>Detection box threshold. If the average score of all pixels within a detection result's bounding box is greater than this threshold, the result is considered a text region.

<b>Description:</b>

<ul> <li><b>float</b>: Any float greater than <code>0</code>; <li><b>None</b>: If set to <code>None</code>, the value initialized by the pipeline for this parameter (defaults to <code>0.6</code>) will be used.</li></li></ul> </td> <td><code>float|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>seal_det_unclip_ratio</code></td> <td><b>Meaning:</b>Seal text detection expansion factor. This method is used to expand text regions; the larger the value, the larger the expanded area.

<b>Description:</b>

<ul> <li><b>float</b>: Any float greater than <code>0</code>; <li><b>None</b>: If set to <code>None</code>, the value initialized by the pipeline for this parameter (defaults to <code>0.5</code>) will be used.</li></li></ul> </td> <td><code>float|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>seal_rec_score_thresh</code></td> <td><b>Meaning:</b>Seal text recognition threshold. Text results with scores greater than this threshold will be kept.

<b>Description:</b>

<ul> <li><b>float</b>: Any float greater than <code>0</code>; <li><b>None</b>: If set to <code>None</code>, the value initialized by the pipeline for this parameter (defaults to <code>0.0</code>, i.e., no threshold) will be used.</li></li></ul> </td> <td><code>float|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>retriever_config</code></td> <td><b>Meaning:</b>Configuration parameters for the vector retrieval large model.

<b>Description:</b> The configuration content is the following dictionary:

<pre><code>{ "module_name": "retriever", "model_name": "embedding-v1", "base_url": "https://qianfan.baidubce.com/v2", "api_type": "qianfan", "api_key": "api_key" # Please set this to your actual API key }</code></pre> </td> <td><code>dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>mllm_chat_bot_config</code></td> <td><b>Meaning:</b>Configuration parameters for the multimodal large model.

<b>Description:</b> The configuration content is the following dictionary:

<pre><code>{ "module_name": "chat_bot", "model_name": "PP-DocBee", "base_url": "http://127.0.0.1:8080/", # Please set this to the actual URL of your multimodal large model service "api_type": "openai", "api_key": "api_key" # Please set this to your actual API key }</code></pre> </td> <td><code>dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>chat_bot_config</code></td> <td><b>Meaning:</b>Configuration information for the large language model.

<b>Description:</b> The configuration content is the following dictionary:

<pre><code>{ "module_name": "chat_bot", "model_name": "ernie-3.5-8k", "base_url": "https://qianfan.baidubce.com/v2", "api_type": "openai", "api_key": "api_key" # Please set this to your actual API key }</code></pre> </td> <td><code>dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>device</code></td> <td><b>Meaning:</b>Device used for inference.

<b>Description:</b> Supports specifying a specific card number:

<ul> <li><b>CPU</b>: e.g., <code>cpu</code> indicates using CPU for inference;</li> <li><b>GPU</b>: e.g., <code>gpu:0</code> indicates using the 1st GPU for inference;</li> <li><b>NPU</b>: e.g., <code>npu:0</code> indicates using the 1st NPU for inference;</li> <li><b>XPU</b>: e.g., <code>xpu:0</code> indicates using the 1st XPU for inference;</li> <li><b>MLU</b>: e.g., <code>mlu:0</code> indicates using the 1st MLU for inference;</li> <li><b>DCU</b>: e.g., <code>dcu:0</code> indicates using the 1st DCU for inference;</li> <li><b>MetaX GPU</b>: e.g., <code>metax_gpu:0</code> indicates using the 1st MetaX GPU for inference;</li> <li><b>Iluvatar GPU</b>: e.g., <code>iluvatar_gpu:0</code> indicates using the 1st Iluvatar GPU for inference;</li> <li><b>None</b>: If set to <code>None</code>, the pipeline initialized value for this parameter will be used. During initialization, the local GPU device 0 will be preferred; if unavailable, the CPU device will be used.</li> </ul> </td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>engine</code></td> <td><b>Meaning:</b> Inference engine. <b>Description:</b> Supports <code>None</code> (the default), <code>paddle</code>, <code>paddle_static</code>, <code>paddle_dynamic</code>, and <code>transformers</code>. When left as <code>None</code>, PaddleOCR preserves the behavior of earlier versions, which in most configurations is equivalent to <code>paddle</code>. For detailed descriptions, supported values, compatibility rules, and examples, see <a href="../inference_engine.en.md">Inference Engine and Configuration</a>.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>engine_config</code></td> <td><b>Meaning:</b> Inference-engine configuration. <b>Description:</b> Recommended together with <code>engine</code>. For supported fields, compatibility rules, and examples, see <a href="../inference_engine.en.md">Inference Engine and Configuration</a>.</td> <td><code>dict|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>enable_hpi</code></td> <td><b>Meaning:</b> Whether to enable high-performance inference.</td> <td><code>bool</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_tensorrt</code></td> <td><b>Meaning:</b> Whether to enable the TensorRT subgraph engine of Paddle Inference.

<b>Description:</b> If the model does not support TensorRT acceleration, acceleration will not be used even if this flag is set.

For CUDA 11.8 versions of PaddlePaddle, the compatible TensorRT version is 8.x (x>=6). TensorRT 8.6.1.6 is recommended.

</td> <td><code>bool</code></td> <td><code>False</code></td> </tr> <tr> <td><code>precision</code></td> <td><b>Meaning:</b> Computation precision, such as <code>"fp32"</code> or <code>"fp16"</code>.</td> <td><code>str</code></td> <td><code>"fp32"</code></td> </tr> <tr> <td><code>enable_mkldnn</code></td> <td><b>Meaning:</b> Whether to enable MKL-DNN accelerated inference.

<b>Description:</b> If MKL-DNN is unavailable or the model does not support MKL-DNN acceleration, acceleration will not be used even if this flag is set.

</td> <td><code>bool</code></td> <td><code>True</code></td> </tr> <tr> <td><code>mkldnn_cache_capacity</code></td> <td> <b>Meaning:</b> MKL-DNN cache capacity. </td> <td><code>int</code></td> <td><code>10</code></td> </tr> <tr> <td><code>cpu_threads</code></td> <td><b>Meaning:</b> Number of threads used for inference on CPU.</td> <td><code>int</code></td> <td><code>10</code></td> </tr> <tr> <td><code>paddlex_config</code></td> <td><b>Meaning:</b> Path to the PaddleX pipeline configuration file.</td> <td><code>str|None</code></td> <td><code>None</code></td> </tr> </tbody> </table> </details> <details><summary>(2) Call the <code>visual_predict()</code> method of the PP-ChatOCRv4 pipeline object to obtain visual prediction results. This method returns a list of results. Additionally, the pipeline also provides the <code>visual_predict_iter()</code> method. Both are identical in terms of parameter acceptance and result return, with the difference being that <code>visual_predict_iter()</code> returns a <code>generator</code>, allowing for step-by-step processing and retrieval of prediction results, suitable for handling large datasets or scenarios where memory saving is desired. You can choose either of these two methods based on your actual needs. The following are the parameters and their descriptions for the <code>visual_predict()</code> method:</summary> <table> <thead> <tr> <th>Parameter</th> <th>Parameter Description</th> <th>Parameter Type</th> <th>Default Value</th> </tr> </thead> <tr> <td><code>input</code></td> <td><b>Meaning:</b>Data to be predicted, supports multiple input types, required.

<b>Description:</b>

<ul> <li><b>Python Var</b>: e.g., image data represented by <code>numpy.ndarray</code>;</li> <li><b>str</b>: e.g., local path of an image file or PDF file: <code>/root/data/img.jpg</code>; <b>URL link</b>, e.g., network URL of an image file or PDF file: <a href = "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/vehicle_certificate-1.png">Example</a>; <b>Local directory</b>, which must contain images to be predicted, e.g., local path: <code>/root/data/</code> (Currently, prediction from directories containing PDF files is not supported; PDF files need to be specified by their full path);</li> <li><b>list</b>: List elements must be of the above types, e.g.,<code>[numpy.ndarray, numpy.ndarray]</code>,<code>["/root/data/img1.jpg", "/root/data/img2.jpg"]</code>,<code>["/root/data1", "/root/data2"]</code>.</li> </ul> </td> <td><code>Python Var|str|list</code></td> <td></td> </tr> <tr> <td><code>use_doc_orientation_classify</code></td> <td><b>Meaning:</b>Whether to use the document orientation classification module during inference.</td> <td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_doc_unwarping</code></td> <td><b>Meaning:</b>Whether to use the document image unwarping module during inference.</td> <td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_textline_orientation</code></td> <td><b>Meaning:</b>Whether to use the text line orientation classification module during inference.</td> <td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_seal_recognition</code></td> <td><b>Meaning:</b>Whether to use the seal text recognition sub-pipeline during inference.</td> <td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>use_table_recognition</code></td> <td><b>Meaning:</b>Whether to use the table recognition sub-pipeline during inference.</td> <td><code>bool|None</code></td> <td><code>None</code></td> </tr> <tr> <td><code>layout_threshold</code></td> <td><b>Meaning:</b>Same meaning as the instantiation parameters.