Back to Opendataloader Pdf

Docling Speed Optimization Plan

docs/hybrid/docling-speed-optimization-plan.md

2.4.211.6 KB
Original Source

Docling Speed Optimization Plan

Progress Tracker

TaskStatusCompletedResult
Phase 0: Baseline measurement✅ completed2026-01-032.283s/doc
Phase 0: FastAPI experiment✅ completed2026-01-030.685s/doc (PASS < 0.8s)
Phase 0: subprocess experiment✅ completed2026-01-030.661s/doc (PASS < 1.0s)
Phase 0: Results comparison✅ completed2026-01-033.3x-3.5x speedup
Task 1.1: docling_subprocess_worker.py⏭️ skipped-FastAPI only
Task 1.2: hybrid_server.py✅ completed2026-01-03opendataloader-pdf-hybrid
Task 2.1: DoclingSubprocessClient.java⏭️ skipped-FastAPI only
Task 2.2: DoclingFastServerClient.java✅ completed2026-01-03-
Task 2.3: HybridClientFactory modification✅ completed2026-01-03docling-fast only
Task 3.1: pdf_parser modules✅ completed2026-01-03docling-fast only
Task 3.2: engine_registry.py✅ completed2026-01-03-
Task 3.3: run.py CLI options✅ completed2026-01-03-
Task 4.1: Full benchmark✅ completed2026-01-03See experiments/speed/
Task 4.2: Results documentation✅ completed2026-01-03speed-experiment-2026-01-03.md

Status Legend:

  • not_started - Not yet begun
  • 🔄 in_progress - Currently working
  • completed - Done and verified
  • ⏭️ skipped - Excluded from plan
  • ⏸️ blocked - Waiting on dependency
  • failed - Did not meet criteria
  • 🚫 discarded - Plan abandoned

1. Background

Current Problem

  • DoclingClient (docling-serve HTTP API): ~2 seconds per page
  • docling SDK direct call: ~0.5 seconds per document (user-reported)
  • HTTP overhead negates the speed benefits of hybrid mode

Goal

Implement alternative approaches to efficiently call the docling SDK, then compare benchmark speeds


2. Experiment Phase (Phase 0)

Purpose

Validate the speed improvement hypothesis before full implementation

Experiment Targets

ApproachDescription
baselineCurrent docling-serve (reference)
fastapiOptimized FastAPI server
subprocessDirect Python subprocess call

Success Criteria

ApproachThresholdCondition
fastapi< 0.8 sec/doc (average)Based on 200 documents
subprocess< 1.0 sec/doc (average)Based on 200 documents

Failure Conditions

  • If fastapi approach exceeds 0.8 sec/doc: Discard entire plan
  • If only subprocess fails: Exclude that approach only

Experiment Environment

  • Benchmark PDFs: tests/benchmark/pdfs/ (200 files)
  • Settings: do_ocr=true, do_table_structure=true
  • Measurement: total_time / document_count

Experiment Scripts

scripts/experiments/
├── docling_baseline_bench.py     # docling-serve speed measurement
├── docling_fastapi_bench.py      # FastAPI server + client test
├── docling_subprocess_bench.py   # subprocess approach test
└── docling_speed_report.py       # Results comparison report

Experiment Execution

bash
# 1. baseline (requires docling-serve running)
python scripts/experiments/docling_baseline_bench.py

# 2. fastapi (server auto-starts)
python scripts/experiments/docling_fastapi_bench.py

# 3. subprocess
python scripts/experiments/docling_subprocess_bench.py

# 4. compare results
python scripts/experiments/docling_speed_report.py

Results Recording

docs/hybrid/experiments/
└── speed-experiment-YYYY-MM-DD.md

3. Implementation Tasks (After Phase 0 Success)

Task 1: Python Scripts

Task 1.1: docling_subprocess_worker.py

ItemDetails
Filescripts/docling_subprocess_worker.py
Prerequisitesdocling package installed
Descriptionstdin JSON → stdout JSON conversion
Success CriteriaSingle PDF conversion succeeds, JSON output parseable
Testecho '{"pdf_path":"/path/to.pdf"}' | python scripts/docling_subprocess_worker.py

Task 1.2: hybrid_server.py

ItemDetails
Filepython/opendataloader-pdf/src/opendataloader_pdf/hybrid_server.py
Prerequisitespip install opendataloader-pdf[hybrid]
DescriptionPOST /convert endpoint, DocumentConverter singleton
Success Criteriacurl PDF upload returns JSON response
Testopendataloader-pdf-hybrid & then curl -F "[email protected]" http://localhost:5002/v1/convert/file

Task 2: Java Client Implementation

Task 2.1: DoclingSubprocessClient.java

ItemDetails
Filejava/.../hybrid/DoclingSubprocessClient.java
PrerequisitesTask 1.1 complete
DescriptionProcessBuilder executes Python, stdin/stdout JSON
Success CriteriaImplements HybridClient interface, single PDF conversion succeeds
TestDoclingSubprocessClientTest.java unit tests pass

Task 2.2: DoclingFastServerClient.java

ItemDetails
Filejava/.../hybrid/DoclingFastServerClient.java
PrerequisitesTask 1.2 complete
DescriptionOkHttp calls FastAPI server
Success CriteriaImplements HybridClient interface, single PDF conversion succeeds
TestDoclingFastServerClientTest.java unit tests pass

Task 2.3: HybridClientFactory Modification

ItemDetails
Filejava/.../hybrid/HybridClientFactory.java
PrerequisitesTask 2.1, 2.2 complete
DescriptionRegister docling-subprocess, docling-fast backends
Success CriteriaHybridClientFactory.getOrCreate("docling-fast", config) works
TestExtend HybridClientFactoryTest.java

Task 3: Benchmark Integration

Task 3.1: Add pdf_parser Modules

ItemDetails
Filestests/benchmark/src/pdf_parser_opendataloader_hybrid_subprocess.py
tests/benchmark/src/pdf_parser_opendataloader_hybrid_fast.py
PrerequisitesTask 2.3 complete, JAR built
Success CriteriaBenchmark runs with --hybrid docling-subprocess option

Task 3.2: Modify engine_registry.py

ItemDetails
Filetests/benchmark/src/engine_registry.py
DescriptionRegister new engines
Success CriteriaNew engines queryable from ENGINE_DISPATCH

Task 3.3: Add run.py CLI Options

ItemDetails
Filetests/benchmark/run.py
DescriptionExtend --hybrid choices
Success Criteria./scripts/bench.sh --hybrid docling-fast runs

Task 4: Final Validation

Task 4.1: Full Benchmark Execution

ItemDetails
PrerequisitesTask 3 complete
ExecutionBenchmark 200 documents with all 3 approaches
Success Criteriaelapsed_per_doc comparison shows meaningful improvement

Task 4.2: Results Documentation

ItemDetails
Filedocs/hybrid/docling-speed-optimization-results.md
ContentSpeed comparison table, recommended approach, usage guide

4. Task Workflow

Phase 0: Experiment
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                    ┌─────────────────┐
                    │ baseline measure │
                    └────────┬────────┘
                             │
              ┌──────────────┴──────────────┐
              ▼                              ▼
    ┌─────────────────┐           ┌─────────────────┐
    │ fastapi test    │           │ subprocess test │
    └────────┬────────┘           └────────┬────────┘
              │                              │
              └──────────────┬──────────────┘
                             ▼
                    ┌─────────────────┐
                    │ compare results │
                    │ < 0.8 sec/doc?  │
                    └────────┬────────┘
                             │
              ┌──────────────┴──────────────┐
              ▼                              ▼
         [SUCCESS]                       [FAILURE]
       Proceed to                      Discard plan
        Phase 1

Phase 1~4: Implementation (parallelizable)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

    Task 1.1 ─────────────────► Task 2.1 ─┐
    (subprocess worker)        (Java client) │
                                            │
    Task 1.2 ─────────────────► Task 2.2 ─┼─► Task 2.3 ─► Task 3 ─► Task 4
    (fastapi server)           (Java client) │   (Factory)   (Bench)  (Validate)
                                            │
              ◄──── parallelizable ────►    │

Parallelizable Tasks

GroupTasksNotes
Phase 0fastapi test, subprocess testAfter baseline measurement
Phase 1Task 1.1, Task 1.2Independent
Phase 2Task 2.1, Task 2.2Depend on Task 1.1, 1.2 respectively
Phase 3Task 3.1, 3.2, 3.3After Task 2.3 complete

Dependencies

Task 1.1 → Task 2.1 ─┐
                      ├─► Task 2.3 → Task 3.* → Task 4.*
Task 1.2 → Task 2.2 ─┘

5. File List

New Files

FilePhaseDescription
scripts/experiments/docling_baseline_bench.py0Baseline measurement
scripts/experiments/docling_fastapi_bench.py0FastAPI experiment
scripts/experiments/docling_subprocess_bench.py0Subprocess experiment
scripts/experiments/docling_speed_report.py0Results report
scripts/docling_subprocess_worker.py1Subprocess worker (skipped)
python/.../hybrid_server.py1FastAPI server (opendataloader-pdf-hybrid)
java/.../hybrid/DoclingSubprocessClient.java2Java client
java/.../hybrid/DoclingFastServerClient.java2Java client
tests/.../pdf_parser_opendataloader_hybrid_subprocess.py3Benchmark parser
tests/.../pdf_parser_opendataloader_hybrid_fast.py3Benchmark parser

Modified Files

FilePhaseChanges
java/.../hybrid/HybridClientFactory.java2Register new backends
tests/benchmark/src/engine_registry.py3Register engines
tests/benchmark/run.py3CLI options

6. Risks and Mitigations

RiskProbabilityMitigation
FastAPI speed below thresholdMediumDiscard plan, explore other approaches
subprocess overheadMediumConsider process pooling
docling SDK version compatibilityLowPin version, test
Memory exhaustionLowAdjust batch size

7. Checklist

Phase 0 Completion Criteria

  • Baseline speed measurement complete
  • FastAPI experiment: < 0.8 sec/doc
  • subprocess experiment: < 1.0 sec/doc
  • Experiment results documented

Overall Completion Criteria

  • All Tasks complete
  • Benchmark runs successfully with all 3 approaches
  • Speed improvement confirmed (vs baseline)
  • Results documented