Back to Paddleocr

1. PP-OCRv6 Introduction

docs/version3.x/algorithm/PP-OCRv6/PP-OCRv6.en.md

3.7.013.8 KB
Original Source

1. PP-OCRv6 Introduction

PP-OCRv6 is the latest generation of the PP-OCR universal text recognition solution. Built on the newly designed PPLCNetV4 unified backbone, it offers tiny, small, and medium tiers targeting edge/IoT, mobile/desktop, and server scenarios respectively. PP-OCRv6 achieves a major breakthrough in language coverage—the medium/small tiers support 50 languages with a single unified model, including Simplified Chinese, Traditional Chinese, English, Japanese, and 46 Latin-script languages (tiny supports 49, excluding Japanese). On our in-house multi-scenario benchmark, PP-OCRv6_medium achieves +5.1% recognition accuracy and +4.6% detection Hmean over PP-OCRv5_server, with 2.37× GPU inference speedup; with only 34.5M parameters, it surpasses VLMs such as Qwen3-VL-235B and GPT-5.5 in accuracy.

Main contributions:

  1. Unified and Scalable Model Family: A three-tier OCR model family spanning 1.5M to 34.5M parameters. The medium tier achieves 86.2% detection Hmean and 83.2% recognition accuracy, serving as production-ready infrastructure for industrial deployment and large-scale data pipelines.
  2. Tailored Lightweight Architectural Innovations: (i) LCNetV4: a MetaFormer-style lightweight backbone with structural reparameterization; (ii) RepLKFPN: a detection neck with dilated reparameterizable depthwise convolutions for large receptive fields; (iii) EncoderWithLightSVTR: a recognition neck with local-global attention and additive skip connections.
  3. Extensive Multi-Language and Scenario Generalization: A single model scaled to support 50 languages and diverse challenging industrial scenes (e.g., digital displays, dot-matrix characters, tire prints), significantly improving OCR performance in scenarios traditionally underserved by general-purpose VLMs.
<div align="center"> </div> <p align="center">Performance comparison between PP-OCRv6, PP-OCRv5, and Vision-Language Models. Left: text detection average Hmean (%); Right: text recognition weighted average accuracy (%).</p>

2. Key Technical Improvements

2.1 Unified Backbone: PPLCNetV4

LCNetV4Block: Following the MetaFormer paradigm, each layer is decomposed into a Token Mixer and a Channel Mixer. Given input feature $\mathbf{x} \in \mathbb{R}^{C \times H \times W}$:

$$\hat{\mathbf{x}} = \text{SE}(\text{DW}(\mathbf{x})) + \mathbf{x}$$

$$\mathbf{y} = W_2,\sigma(W_1,\hat{\mathbf{x}}) + \hat{\mathbf{x}}$$

where $\text{DW}(\cdot)$ is a 3×3 depthwise convolution (Token Mixer), SE is an optional channel attention module, $W_1 \in \mathbb{R}^{2C \times C}$ and $W_2 \in \mathbb{R}^{C \times 2C}$ form the Channel Mixer with expansion ratio 2, and $\sigma$ is GELU activation.

Task-Adaptive Downsampling: The same backbone serves both tasks via different stride strategies—detection mode uses standard stride-2 spatial downsampling producing multi-scale feature maps (stride 4/8/16/32); recognition mode uses asymmetric stride $(2,1)$ at Stage 3/4, reducing height only while preserving width, followed by height-axis average pooling to produce 1-D sequential features for CTC/NRTR decoding.

Comparison with LCNetV3:

Design AspectLCNetV3LCNetV4
ArchitectureMobileNet-style (DW→SE→PW)MetaFormer (TokenMixer + ChannelMixer)
Channel InteractionSingle 1×1 PW ConvExpand(2×)→Act→Compress + residual
Spatial MixingPlain DW ConvRepDWConv (3×3 + 1×1 + identity)
BN InitializationStandardZero-init on compress BN
<div align="center"> </div> <p align="center">PPLCNetV4 backbone architecture.</p>

2.2 Detection Module

  • RepLKFPN: Lightweight large-kernel FPN using DilatedReparamBlock (7×7 depthwise conv + dilated branches), 31% fewer parameters than PP-OCRv5's RSEFPN (118K vs 172K) with receptive field expanded from 3×3 to 7×7.
  • Auxiliary Deep Supervision: Prediction heads at P2, P3, P4 levels for stronger gradient signals during training.
  • DiceBCE Loss: Combined DiceLoss + Focal Loss for better per-pixel supervision on small and dense text.
<div align="center"> </div> <p align="center">PP-OCRv6 detection module architecture.</p>

2.3 Recognition Module

  • EncoderWithLightSVTR Neck: Local context modeling (1×7 depthwise conv) + global self-attention (1-2 Transformer layers), with additive skip connections (instead of concatenation in PP-OCRv5) to reduce parameters.
  • Multi-Head Decoder: CTCHead for efficient parallel inference; NRTRHead for auxiliary training supervision (removed at inference).
  • Tiny Model Design: No neck (direct reshape + FC), trained with knowledge distillation from the medium model.
  • Multilingual Unification: Dictionary extended with ~200 diacritical characters, enabling single-model 50-language coverage.
<div align="center"> </div> <p align="center">PP-OCRv6 recognition module architecture.</p>

3. Key Metrics

3.1 Text Detection

Text detection Hmean (%) on our in-house multi-scenario benchmark (16 categories):

ModelAVGHW-CNHW-ENPrint-CNPrint-ENTCAnc.JPBlurEmo.WarpPin.Art.Tab.Rot.Indus.Gen.
PP-OCRv6_medium86.283.784.095.193.786.380.284.394.199.688.674.069.096.893.873.382.8
PP-OCRv6_small84.180.587.194.293.685.772.682.392.699.787.669.665.395.693.767.678.2
PP-OCRv6_tiny80.679.485.993.192.383.763.076.689.399.886.159.060.194.791.062.073.8
PP-OCRv5_server81.680.384.194.591.781.567.677.290.196.287.667.167.397.180.064.379.7
PP-OCRv5_mobile75.274.477.790.591.082.358.172.787.493.682.757.552.592.864.752.872.1
Gemini-3.1-Pro46.853.456.547.347.639.045.838.250.068.144.640.665.226.922.152.550.2
GPT-5.545.642.458.550.251.935.026.742.049.197.537.736.352.071.010.036.232.6
Qwen3-VL-235B38.356.566.041.737.019.313.127.038.581.228.533.068.319.62.148.432.3

3.2 Text Recognition

Text recognition accuracy (%) on our in-house multi-scenario benchmark (15 categories):

ModelW-AvgHW-CNHW-ENPrint-CNPrint-ENTCAnc.JPConf.Spec.Gen.Pin.Art.Indus.ScreenCard
PP-OCRv6_medium83.262.167.891.594.178.672.490.564.961.787.578.171.277.482.588.1
PP-OCRv6_small81.357.661.190.593.377.071.188.264.160.285.775.968.476.479.786.9
PP-OCRv6_tiny73.540.139.386.788.465.068.489.852.357.178.065.454.762.171.280.5
PP-OCRv5_server78.158.059.690.185.174.760.473.759.456.886.574.464.070.268.187.6
PP-OCRv5_mobile73.741.750.986.086.072.057.875.855.754.880.772.554.059.357.681.7
Qwen3-VL-235B74.949.773.282.386.276.433.666.256.149.082.576.569.674.773.878.7
Gemini-3.1-Pro71.446.473.080.090.569.518.067.254.450.374.675.963.169.173.275.9
GPT-5.564.219.256.975.782.257.563.758.649.148.367.750.453.062.467.771.1

3.3 End-to-End Inference Speed (s/image)

Tested on 200 images (general + document scenes), including image I/O, pre/post-processing, and model inference.

HardwareBackendPP-OCRv6_mediumPP-OCRv6_smallPP-OCRv6_tinyPP-OCRv5_serverPP-OCRv5_mobilePP-OCRv4_mobile
NVIDIA A100PaddlePaddle0.290.250.130.320.250.14
NVIDIA A100TensorRT--0.320.16--0.330.16
NVIDIA V100PaddlePaddle0.720.490.210.660.500.25
NVIDIA V100ONNX Runtime0.670.530.290.770.460.27
NVIDIA V100TensorRT0.770.600.230.730.590.27
Intel Xeon 8350CPaddlePaddle2.050.790.322.040.800.62
Intel Xeon 8350COpenVINO1.400.590.207.300.780.60
Intel Xeon 8350CONNX Runtime3.310.610.226.360.610.49
Apple M4PaddlePaddle8.823.070.96>105.825.65
Apple M4ONNX Runtime5.551.290.357.201.101.02
  • PP-OCRv6_medium matches or outperforms PP-OCRv5_server on all platforms: 1.1× faster on A100 (0.29s vs 0.32s), 1.15× on V100 ONNX Runtime (0.67s vs 0.77s), 5.2× on Intel Xeon OpenVINO (1.40s vs 7.30s).
  • PP-OCRv6_small matches PP-OCRv5_mobile in latency on most platforms with higher accuracy; 1.9× faster on Apple M4 PaddlePaddle (3.07s vs 5.82s).
  • PP-OCRv6_tiny is the fastest model across all platforms: 6.1× over PP-OCRv5_mobile on Apple M4 PaddlePaddle (0.96s vs 5.82s), 3.9× on Intel Xeon OpenVINO (0.20s vs 0.78s), reaching 0.13s on A100.

4. Visualization

4.1 Detection Comparison

<div align="center"> </div> <p align="center">Text detection comparison. Left to right: PP-OCRv6_medium, PP-OCRv5_server, Gemini-3.1-Pro, GPT-5.5.</p>

4.2 Hallucination Comparison

<div align="center"> </div> <p align="center">PP-OCRv6_medium vs VLMs hallucination comparison. PP-OCRv6 faithfully reproduces visual text content, while VLMs introduce hallucinated corrections based on linguistic priors.</p>

4.3 End-to-End OCR Comparison

<div align="center"> </div> <div align="center"> </div> <div align="center"> </div> <p align="center">End-to-end OCR comparison between PP-OCRv6_medium and PP-OCRv5_server across Chinese, English, Japanese, artistic fonts, industrial characters, rotated text, pinyin, and dot-matrix characters.</p>

5. Quick Start

python
from paddleocr import PaddleOCR

# Default: PP-OCRv6_medium
ocr = PaddleOCR(
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_textline_orientation=False,
)
result = ocr.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_ocr_002.png")

for res in result:
    res.print()
    res.save_to_img("output")
    res.save_to_json("output")
bash
# CLI usage
paddleocr ocr -i https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_ocr_002.png \
    --use_doc_orientation_classify False \
    --use_doc_unwarping False \
    --use_textline_orientation False

Using Transformers Engine:

PP-OCRv6 supports inference via Hugging Face Transformers (requires transformers>=5.8.0):

python
from paddleocr import TextRecognition

model = TextRecognition(
    model_name="PP-OCRv6_medium_rec",
    engine="transformers",
)
output = model.predict(input="general_ocr_rec_001.png", batch_size=1)
for res in output:
    res.print()

Using High-Performance Inference (ONNX Runtime backend):

Enable the high-performance inference plugin with enable_hpi=True:

python
from paddleocr import PaddleOCR

ocr = PaddleOCR(
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_textline_orientation=False,
    enable_hpi=True,
)
result = ocr.predict("general_ocr_002.png")

The HPI plugin requires additional installation. See High-Performance Inference Guide.

6. Deployment and Custom Development