Back to Opendataloader Pdf

Hybrid PDF Processing System - Design Document

docs/hybrid/hybrid-mode-design.md

2.4.26.3 KB
Original Source

Hybrid PDF Processing System - Design Document

Overview

Hybrid PDF processing system combining Java heuristics + external AI backends. Routes pages via per-page Triage: simple pages to fast Java path, complex tables/OCR to AI backend.

Key Decisions

ItemDecision
CLI Option--hybrid <off|docling|hancom|...>
Defaultoff (Java-only, no external dependency)
First Backenddocling (docling-serve REST API)
AutomationSemi-automatic (benchmark/analysis auto, code changes require approval)
Triage StrategyConservative (minimize FN, accept FP, route uncertain pages to backend)

CLI Usage

bash
# Default: Java-only processing
opendataloader-pdf input.pdf
opendataloader-pdf --hybrid off input.pdf

# Use docling backend
opendataloader-pdf --hybrid docling input.pdf

# With custom backend URL
opendataloader-pdf --hybrid docling --hybrid-url http://localhost:5001 input.pdf

# Future backends
opendataloader-pdf --hybrid hancom input.pdf

Hybrid Options

OptionDescription
--hybrid <name>Hybrid backend: off (default), docling, hancom, etc.
--hybrid-url <url>Backend server URL (overrides default)
--hybrid-timeout <ms>Request timeout in milliseconds (default: 0, no timeout)
--hybrid-fallbackFallback to Java on backend error (default: true)

Supported Backends

BackendStatusDescription
off✅ DefaultJava-only, no external calls
docling-fast✅ Availabledocling-serve (local)
hancom📋 Future (Priority)Hancom Document AI
azure📋 FutureAzure Document Intelligence
google📋 FutureGoogle Document AI

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        PDF Input                                 │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                   ContentFilterProcessor                         │
│                   (existing: text filtering)                     │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  TriageProcessor.triageAllPages()                │
│   - Batch triage all pages                                       │
│   - Output: Map<PageNumber, TriageResult>                        │
└─────────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              │                               │
              ▼                               ▼
┌─────────────────────────┐     ┌─────────────────────────┐
│      JAVA Path          │     │     BACKEND Path        │
│  (parallel processing)  │     │  (single batch API call)│
│                         │     │                         │
│  ExecutorService        │     │  BackendClient          │
│  - TableBorderProcessor │     │  - Send all pages once  │
│  - TextLineProcessor    │     │  - Receive all results  │
│  - ParagraphProcessor   │     │  SchemaTransformer      │
└─────────────────────────┘     └─────────────────────────┘
              │                               │
              │         CONCURRENT            │
              └───────────────┬───────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Result Merger                                 │
│                  (preserve page order)                           │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│              Post-processing & Output Generation                 │
└─────────────────────────────────────────────────────────────────┘

Risks and Mitigations

RiskMitigation
Backend unavailable--hybrid-fallback (default: true)
Triage FN (missed tables)Conservative threshold, benchmark monitoring
Schema mismatchStep-by-step validation, type checking
Slow processingParallel execution, batch API calls