Back to Opendataloader Pdf

Docling vs OpenDataLoader Output Comparison

docs/hybrid/research/comparison-summary.md

2.4.22.2 KB
Original Source

Docling vs OpenDataLoader Output Comparison

Test Document

  • File: 01030000000045.pdf (1 page with table)

Element Count Comparison

CategoryDoclingOpenDataLoader
Tables11
Text elements54 paragraphs
Images01
Headings(N/A - uses labels)1

Text Element Labels (Docling)

LabelCount
caption1
footnote1
page_footer1
page_header1
text1

Table Structure Comparison

PropertyDoclingOpenDataLoader
Rows93
Columns33
Total cells269

Note: Docling detects more rows in the table structure. This may be due to:

  • Different table detection algorithms
  • OpenDataLoader may have merged some rows
  • Different handling of header rows

Bounding Box Comparison (Table)

Systeml/leftt/topr/rightb/bottomOrigin
Docling53.22439.98373.94234.74BOTTOMLEFT
OpenDataLoader54.0234.44372.73440.21BOTTOMLEFT

Coordinate mapping: Both use BOTTOMLEFT origin.

  • Docling: {l, t, r, b} where t=top, b=bottom
  • OpenDataLoader: [left, bottom, right, top]

So the actual coordinates match closely:

  • Left: 53.22 ≈ 54.0
  • Bottom: 234.74 ≈ 234.44
  • Right: 373.94 ≈ 372.73
  • Top: 439.98 ≈ 440.21

Schema Mapping Summary

Docling TypeOpenDataLoader Type
texts (label: text)paragraph
texts (label: section_header)heading
tablestable
picturesimage
texts (label: page_header)paragraph (filtered as header)
texts (label: page_footer)paragraph (filtered as footer)
texts (label: caption)paragraph
texts (label: footnote)paragraph

Key Differences

  1. Type naming: Docling uses label field for text types, OpenDataLoader uses type
  2. Table structure: Docling detects more detailed row structure
  3. Coordinate format: Same origin but different field order
  4. Heading detection: Docling uses SectionHeaderItem with level, OpenDataLoader uses heading type with level