Back to Docling

Model Catalog

docs/usage/model_catalog.md

2.92.013.3 KB
Original Source

Model Catalog

This document provides a comprehensive overview of all models and inference engines available in Docling, organized by processing stage.

Overview

Docling's document processing pipeline consists of multiple stages, each using specialized models and inference engines. This catalog helps you understand:

  • What stages are available for document processing
  • Which model families power each stage
  • What specific models you can use
  • Which inference engines support each model

Stages and Models Overview

The following table shows all processing stages in Docling, their model families, and available models.

<table border="1" cellpadding="6" cellspacing="0"> <thead> <tr> <th>Stage</th> <th>Model Family</th> <th>Models</th> </tr> </thead> <tbody> <tr> <td rowspan="4"><strong>Layout</strong> <em>Document structure detection</em></td> <td rowspan="4">Object Detection (RT-DETR based)</td> <td> <ul> <li><code>docling-layout-heron</code> ⭐</li> <li><code>docling-layout-heron-101</code></li> <li><code>docling-layout-egret-medium</code></li> <li><code>docling-layout-egret-large</code></li> <li><code>docling-layout-egret-xlarge</code></li> <li><code>docling-layout-v2</code> (legacy)</li> </ul> </td> </tr> <tr> <td colspan="2"><strong>Inference Engine:</strong> Transformers, ONNXRuntime (in progress)</td> </tr> <tr> <td colspan="2"><strong>Purpose:</strong> Detects document elements (paragraphs, tables, figures, headers, etc.)</td> </tr> <tr> <td colspan="2"><strong>Output:</strong> Bounding boxes with element labels (TEXT, TABLE, PICTURE, SECTION_HEADER, etc.)</td> </tr> <tr> <td rowspan="3"><strong>OCR</strong> <em>Text recognition</em></td> <td rowspan="3">Multiple OCR Engines</td> <td> <ul> <li><strong>Auto</strong> ⭐</li> <li><strong>Tesseract</strong> (CLI or Python bindings)</li> <li><strong>EasyOCR</strong></li> <li><strong>RapidOCR</strong> (ONNX, OpenVINO, PaddlePaddle)</li> <li><strong>macOS Vision</strong> (native macOS)</li> <li><strong>SuryaOCR</strong></li> </ul> </td> </tr> <tr> <td colspan="2"><strong>Inference Engines:</strong> Engine-specific</td> </tr> <tr> <td colspan="2"><strong>Purpose:</strong> Extracts text from images and scanned documents</td> </tr> <tr> <td rowspan="3"><strong>Table Structure</strong> <em>Table cell recognition</em></td> <td rowspan="3">TableFormer</td> <td> <ul> <li><code>TableFormer (accurate mode)</code> ⭐</li> <li><code>TableFormer (fast mode)</code></li> </ul> </td> </tr> <tr> <td colspan="2"><strong>Inference Engine:</strong> docling-ibm-models</td> </tr> <tr> <td colspan="2"><strong>Purpose:</strong> Recognizes table structure (rows, columns, cells) and relationships</td> </tr> <tr> <td rowspan="3"><strong>Table Structure</strong> <em>Table cell recognition</em></td> <td rowspan="3">Vision-Language Model (Granite Vision)</td> <td> <ul> <li><code>granite-4.0-3b-vision</code></li> </ul> </td> </tr> <tr> <td colspan="2"><strong>Inference Engine:</strong> Transformers</td> </tr> <tr> <td colspan="2"><strong>Purpose:</strong> VLM-based table structure recognition using OTSL (Open Table Structure Language) output</td> </tr> <tr> <td rowspan="3"><strong>Table Structure</strong> <em>Table cell recognition</em></td> <td rowspan="3">Object Detection</td> <td> <ul> <li><em>Work in progress</em></li> </ul> </td> </tr> <tr> <td colspan="2"><strong>Inference Engine:</strong> TBD</td> </tr> <tr> <td colspan="2"><strong>Purpose:</strong> Alternative approach for table structure recognition using object detection</td> </tr> <tr> <td rowspan="3"><strong>Picture Classifier</strong> <em>Image type classification</em></td> <td rowspan="3">Image Classifier (Vision Transformer)</td> <td> <ul> <li><code>DocumentFigureClassifier-v2.5</code> ⭐</li> </ul> </td> </tr> <tr> <td colspan="2"><strong>Inference Engine:</strong> Transformers</td> </tr> <tr> <td colspan="2"><strong>Purpose:</strong> Classifies pictures into categories (Chart, Diagram, Natural Image, etc.)</td> </tr> <tr> <td rowspan="4"><strong>VLM Convert</strong> <em>Full page conversion</em></td> <td rowspan="4">Vision-Language Models</td> <td> <ul> <li><strong>Granite-Docling-258M</strong> ⭐ (DocTags)</li> <li><strong>SmolDocling-256M</strong> (DocTags)</li> <li><strong>DeepSeek-OCR-3B</strong> (Markdown, API-only)</li> <li><strong>Granite-Vision-3.3-2B</strong> (Markdown)</li> <li><strong>Pixtral-12B</strong> (Markdown)</li> <li><strong>GOT-OCR-2.0</strong> (Markdown)</li> <li><strong>Phi-4-Multimodal</strong> (Markdown)</li> <li><strong>Qwen2.5-VL-3B</strong> (Markdown)</li> <li><strong>Nanonets-OCR2-3B</strong> (Markdown)</li> <li><strong>Gemma-3-12B/27B</strong> (Markdown, MLX-only)</li> <li><strong>Dolphin</strong> (Markdown)</li> </ul> </td> </tr> <tr> <td colspan="2"><strong>Inference Engines:</strong> Transformers, MLX, API (Ollama, LM Studio, OpenAI), vLLM, AUTO_INLINE</td> </tr> <tr> <td colspan="2"><strong>Purpose:</strong> Converts entire document pages to structured formats (DocTags or Markdown)</td> </tr> <tr> <td colspan="2"><strong>Output Formats:</strong> DocTags (structured), Markdown (human-readable)</td> </tr> <tr> <td rowspan="3"><strong>Picture Description</strong> <em>Image captioning</em></td> <td rowspan="3">Vision-Language Models</td> <td> <ul> <li><strong>SmolVLM-256M</strong> ⭐</li> <li><strong>Granite-Vision-3.3-2B</strong></li> <li><strong>Pixtral-12B</strong></li> <li><strong>Qwen2.5-VL-3B</strong></li> </ul> </td> </tr> <tr> <td colspan="2"><strong>Inference Engines:</strong> Transformers, MLX, API (Ollama, LM Studio), vLLM, AUTO_INLINE</td> </tr> <tr> <td colspan="2"><strong>Purpose:</strong> Generates natural language descriptions of images and figures</td> </tr> <tr> <td rowspan="3"><strong>Code & Formula</strong> <em>Code/math extraction</em></td> <td rowspan="3">Vision-Language Models</td> <td> <ul> <li><strong>CodeFormulaV2</strong> ⭐</li> <li><strong>Granite-Docling-258M</strong></li> </ul> </td> </tr> <tr> <td colspan="2"><strong>Inference Engines:</strong> Transformers, MLX, AUTO_INLINE</td> </tr> <tr> <td colspan="2"><strong>Purpose:</strong> Extracts and recognizes code blocks and mathematical formulas</td> </tr> </tbody> </table>

Inference Engines by Model Family

Object Detection Models (Layout)

ModelInference EngineSupported Devices
All Layout modelsdocling-ibm-modelsCPU, CUDA, MPS, XPU

Note: Layout models use a specialized RT-DETR-based object detection framework from docling-ibm-models.

TableFormer Models (Table Structure)

ModelInference EngineSupported Devices
TableFormer (fast)docling-ibm-modelsCPU, CUDA, XPU
TableFormer (accurate)docling-ibm-modelsCPU, CUDA, XPU

Note: MPS is currently disabled for TableFormer due to performance issues.

Image Classifier (Picture Classifier)

ModelInference EngineSupported Devices
DocumentFigureClassifier-v2.5Transformers (ViT)CPU, CUDA, MPS, XPU

OCR Engines

OCR EngineBackendLanguage SupportNotes
TesseractCLI or tesserocr100+ languagesMost widely used, good accuracy
EasyOCRPyTorch80+ languagesGPU-accelerated, good for Asian languages
RapidOCRONNX/OpenVINO/PaddleMultipleFast, multiple backend options
macOS VisionNative macOS20+ languagesmacOS only, excellent quality
SuryaOCRPyTorch90+ languagesModern, good for complex layouts
AutoAutomaticVariesAutomatically selects best available engine

Vision-Language Models (VLM)

VLM Convert Stage

Preset IDModelParametersTransformersMLXAPI (OpenAI-compatible)vLLMOutput Format
granite_doclingGranite-Docling-258M258MOllamaDocTags
smoldoclingSmolDocling-256M256MDocTags
deepseek_ocrDeepSeek-OCR-3B3BOllama
LM StudioMarkdown
granite_visionGranite-Vision-3.3-2B2BOllama
LM StudioMarkdown
pixtralPixtral-12B12BMarkdown
got_ocrGOT-OCR-2.0-Markdown
phi4Phi-4-Multimodal-Markdown
qwenQwen2.5-VL-3B3BMarkdown
nanonets_ocr2Nanonets-OCR2-3B3BOpenAI-compatible
LM StudioMarkdown
gemma_12bGemma-3-12B12BMarkdown
gemma_27bGemma-3-27B27BMarkdown
dolphinDolphin-Markdown

nanonets_ocr2 includes preset API overrides for OpenAI-compatible runtimes and LM Studio, and can also be used with vLLM runtimes.

Picture Description Stage

Preset IDModelParametersTransformersMLXAPI (OpenAI-compatible)vLLM
smolvlmSmolVLM-256M256MLM Studio
granite_visionGranite-Vision-3.3-2B2BOllama
LM Studio
pixtralPixtral-12B12B
qwenQwen2.5-VL-3B3B

Code & Formula Stage

Preset IDModelParametersTransformersMLX
codeformulav2CodeFormulaV2-
granite_doclingGranite-Docling-258M258M

Usage Examples

Layout Detection

python
from docling.datamodel.pipeline_options import LayoutOptions
from docling.datamodel.layout_model_specs import DOCLING_LAYOUT_HERON

# Use Heron layout model (default)
layout_options = LayoutOptions(model_spec=DOCLING_LAYOUT_HERON)

Table Structure Recognition

python
from docling.datamodel.pipeline_options import TableStructureOptions, TableFormerMode

# Use accurate mode for best quality
table_options = TableStructureOptions(
    mode=TableFormerMode.ACCURATE,
    do_cell_matching=True
)

Picture Classification

python
from docling.models.stages.picture_classifier.document_picture_classifier import (
    DocumentPictureClassifierOptions
)

# Use default picture classifier
classifier_options = DocumentPictureClassifierOptions.from_preset("document_figure_classifier_v2")

OCR

python
from docling.datamodel.pipeline_options import TesseractOcrOptions

# Use Tesseract with English and German
ocr_options = TesseractOcrOptions(lang=["eng", "deu"])

VLM Convert (Full Page)

python
from docling.datamodel.pipeline_options import VlmConvertOptions

# Use SmolDocling with auto-selected engine
options = VlmConvertOptions.from_preset("smoldocling")

# Or force specific engine
from docling.datamodel.vlm_engine_options import MlxVlmEngineOptions
options = VlmConvertOptions.from_preset(
    "smoldocling",
    engine_options=MlxVlmEngineOptions()
)

Picture Description

python
from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions

# Use Granite Vision for detailed descriptions
options = PictureDescriptionVlmOptions.from_preset("granite_vision")

Code & Formula Extraction

python
from docling.datamodel.pipeline_options import CodeFormulaVlmOptions

# Use specialized CodeFormulaV2 model
options = CodeFormulaVlmOptions.from_preset("codeformulav2")

Additional Resources

Notes

  • DocTags Format: Structured XML-like format optimized for document understanding
  • Markdown Format: Human-readable format for general-purpose conversion
  • Model Updates: New models are added regularly. Check the codebase for latest additions
  • Engine Compatibility: Not all engines work on all platforms. AUTO_INLINE handles this automatically
  • Performance: Actual performance varies by hardware, document complexity, and model size