Back to Opendataloader Pdf

Triage Lab - Experiment Records

docs/hybrid/experiments/triage/triage-experiments.md

2.4.211.9 KB
Original Source

Triage Lab - Experiment Records

This skill manages experiment records and optimization history for triage logic.

Current Implementation

File: java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/hybrid/TriageProcessor.java

Signal Priority (classifyPage method)

  1. hasTableBorder - TableBorder presence (confidence: 1.0)
  2. hasVectorTableSignal - Grid lines, border lines, line art (confidence: 0.95)
  3. hasTextTablePattern - Text patterns with consecutive validation (confidence: 0.9)
  4. hasSuspiciousPattern - Y-overlap or large gap detection (confidence: 0.85)
  5. lineToTextRatio > 0.3 - High line chunk ratio (confidence: 0.8)
  6. alignedLineGroups >= 5 - Aligned baseline groups (confidence: 0.7)

Key Thresholds

ParameterValueLocation
LINE_RATIO_THRESHOLD0.3TriageProcessor:41
ALIGNED_LINE_GROUPS_THRESHOLD5TriageProcessor:46
GRID_GAP_MULTIPLIER3.0TriageProcessor:49
MIN_LINE_COUNT_FOR_TABLE8TriageProcessor:57
MIN_GRID_LINES3TriageProcessor:60
MIN_CONSECUTIVE_PATTERNS2TriageProcessor:79

Experiment History

Experiment 001 (2026-01-03): FP Cause Analysis

Goal: Identify root causes of high False Positive rate

Baseline:

  • Documents: 200 (42 with tables)
  • TP: 41, TN: 48, FP: 110, FN: 1
  • Precision: 27.15%, Recall: 97.62%, F1: 42.49%

FP by Signal:

SignalCount%
hasSuspiciousPattern6559.1%
hasVectorTableSignal2320.9%
hasTableBorder1412.7%
hasTextTablePattern54.5%
alignedLineGroups21.8%
highLineRatio10.9%

Root Cause: Y-overlap check in hasSuspiciousPattern is too sensitive

  • Condition previous.getTopY() < current.getBottomY() triggers on normal multi-column layouts

Experiments:

ConfigPrecisionRecallF1FPFN
Baseline27.15%97.62%42.49%1101
Disable Y-overlap36.28%97.62%52.90%~691
Only Reliable Signals50.67%90.48%64.96%~384
Disable SuspiciousPattern39.22%95.24%55.56%~642
Require 3+ patterns37.38%95.24%53.69%~672

Recommendation:

  • To maintain recall: Remove Y-overlap check (Precision +9%, Recall unchanged)
  • To optimize F1: Use only reliable signals (F1 +22%, Recall -7%)

FN Documents:

  • 01030000000110: Missed by all experiments (needs investigation)
  • 01030000000122, 01030000000116, 01030000000117: Only detected by SuspiciousPattern

Applied: Y-overlap check removed (2026-01-03)


Experiment 002 (2026-01-03): Further FP Reduction

Goal: Reduce remaining 72 FPs after Y-overlap removal

Current FP by Signal (after Experiment 001):

SignalCount%
hasSuspiciousPattern2129.2%
hasTableBorder1419.4%
hasVectorTableSignal1318.1%
alignedLineGroups1013.9%
unknown811.1%
hasTextTablePattern56.9%
highLineRatio11.4%

Experiment 2A: Gap Multiplier (hasSuspiciousPattern)

GapPrecisionRecallF1FPFN
3.0 (current)37.86%92.86%53.79%643
4.037.86%92.86%53.79%643
5.037.86%92.86%53.79%643
6.037.86%92.86%53.79%643

→ No effect (Y-overlap removal already optimized this signal)

Experiment 2B: AlignedLineGroups Threshold

ThresholdPrecisionRecallF1FPFN
3 (current)37.86%92.86%53.79%643
439.39%92.86%55.32%603
539.80%92.86%55.71%593
639.80%92.86%55.71%593

Recommended: Threshold 5 (FP -5, Recall maintained)

Experiment 2C: Vector Signal Criteria

LineCountGridLinesPrecisionRecallF1FPFN
8, 3 (current)37.86%92.86%53.79%643
10, 438.24%92.86%54.17%633
12, 437.62%90.48%53.15%634

→ Minimal effect (FP -1, higher values reduce Recall)

Recommendation:

  • Apply alignedLineGroups threshold 3 → 5
  • Expected: FP 64 → 59 (-5), Recall 92.86% (maintained), F1 +1.92%

Applied: alignedLineGroups threshold 3 → 5 (2026-01-03)

Actual Results:

MetricBefore (Exp 001)After (Exp 002)Change
FP7267-5
FN110
Precision36.28%37.96%+1.68%
Recall97.62%97.62%0
F152.90%54.67%+1.77%

Next Steps:

  • Investigate hasTableBorder FPs (14 cases, external library)
  • Investigate unknown FPs (8 cases)

Experiment 003 (2026-01-03): VectorTableSignal & SuspiciousPattern Analysis

Goal: Further reduce FP 67 while maintaining high Recall

Current FP by Signal (after Experiment 002):

SignalCount%
hasVectorTableSignal2334.3%
hasSuspiciousPattern1928.4%
hasTableBorder1420.9%
hasTextTablePattern57.5%
alignedLineGroups57.5%
highLineRatio11.5%

VectorTableSignal Sub-signal Analysis (23 FPs):

Sub-signalCount
hasAlignedShortLines30
hasTableBorderLines22
hasGridLines16
lineArt>=813
hasRowSeparatorPattern12

hasAlignedShortLines is the primary cause of VectorSignal FPs

Experiments:

ConfigPrecisionRecallF1FPFN
Current (Exp 002)37.96%97.62%54.67%671
003B: Disable VectorSignal40.00%90.48%55.47%574
003C: Grid OR BorderLines only40.21%92.86%56.12%583
003D: Disable SuspiciousPattern42.11%95.24%58.39%552
003F: Only Reliable Signals56.25%85.71%67.92%286
003G: Disable AlignedShortLines40.00%95.24%56.34%602
003I: Combined (NoAlign+NoSusp)44.71%90.48%59.84%474

Analysis:

  • hasTableBorder (14 FPs): External library, cannot modify
  • hasVectorTableSignal (23 FPs): hasAlignedShortLines too aggressive
  • hasSuspiciousPattern (19 FPs): Gap detection catches non-table layouts

Recommendation:

  • Best for Recall: 003D (Disable SuspiciousPattern)
    • FP: 67 → 55 (-12), FN: 1 → 2 (+1), Recall: 95.24%
  • Best for F1: 003I (Combined)
    • FP: 67 → 47 (-20), FN: 1 → 4 (+3), F1: 59.84%

FN Documents (003D):

  • 01030000000122: Only detected by SuspiciousPattern (gap-based)
  • 01030000000110: Never detected (needs separate investigation)

Applied: 003D - Disabled hasSuspiciousPattern (2026-01-03)

Actual Results:

MetricBefore (Exp 002)After (Exp 003)Change
FP6755-12
FN12+1
Precision37.96%42.11%+4.15%
Recall97.62%95.24%-2.38%
F154.67%58.39%+3.72%

Experiment 004 (2026-01-03): AlignedLineGroups Signal Analysis

Goal: Further reduce FP 55 while maintaining Recall 95.24%

Current FP by Signal (after Experiment 003):

SignalCount%
hasVectorTableSignal2341.8%
hasTableBorder1425.5%
alignedLineGroups1221.8%
hasTextTablePattern59.1%
highLineRatio11.8%

VectorTableSignal Sub-signal Analysis (23 FPs):

Sub-signalCount
hasAlignedShortLines16
hasTableBorderLines10
lineArt>=88
hasRowSeparatorPattern7
hasGridLines5

Experiments:

ConfigPrecisionRecallF1FPFN
Current (Exp 003)42.11%95.24%58.39%552
004A: NoAlignedShortLines44.71%90.48%59.84%474
004B: Grid+BorderLines only45.12%88.10%59.68%455
004D: No alignedLineGroups48.19%95.24%64.00%432
004E: alignedLineGroups>=744.94%95.24%61.07%492
004G: NoAlignShort+Groups>=748.10%90.48%62.81%414
004I: Reliable Only54.41%88.10%67.27%315

Analysis:

  • alignedLineGroups signal caused 12 FPs but detected no additional true tables
  • Disabling it removes all 12 FPs without any FN increase
  • Best option for maintaining Recall while improving Precision

Applied: 004D - Disabled alignedLineGroups signal (2026-01-03)

Actual Results:

MetricBefore (Exp 003)After (Exp 004)Change
FP5543-12
FN220
Precision42.11%48.19%+6.08%
Recall95.24%95.24%0
F158.39%64.00%+5.61%

Next Steps:

  • Investigate hasVectorTableSignal FPs (23 remaining) - hasAlignedShortLines main cause
  • Investigate hasTableBorder FPs (14 cases, external library limitation)

Experiment 005 (2026-01-03): Large Image Signal for FN Reduction

Goal: Reduce FN by detecting pages with large images (potential table/chart images)

Background:

  • FN documents 01030000000110 and 01030000000122 contain images with tables
  • 110: 28.64% page area (graph image)
  • 122: 11.73% page area (table image)

Implementation:

  • Added hasLargeImage signal to TriageProcessor
  • Detects ImageChunk objects and calculates max image area / page area ratio

Experiments:

ThresholdPrecisionRecallF1FPFN
Baseline (no image)48.19%95.24%64.00%432
10%33.07%100%49.70%850
11%33.60%100%50.30%830
15%35.96%97.62%52.56%731

Analysis:

  • 11% threshold achieves 100% Recall (all 42 table documents detected)
  • Trade-off: FP increases from 43 → 83 (+40), Precision drops from 48% → 34%
  • F1 decreases from 64% → 50% due to high FP increase
  • Many FPs are documents with decorative images, diagrams, photos

Experiment 005B: Adding Aspect Ratio Condition

Observation: FN documents have wide images (ratio 1.79, 3.68), while FP documents often have square/tall images (ratio 0.6~1.5).

ConfigPrecisionRecallF1FPFN
Baseline (no image)48.19%95.24%64.00%432
11% only33.60%100%50.30%830
11% + ratio 1.742.86%100%60.00%560
11% + ratio 1.7543.30%100%60.43%550
11% + ratio 2.046.59%97.62%63.08%471

Final Configuration:

  • Image area >= 11% of page area
  • Image aspect ratio (width/height) >= 1.75

Trade-off:

  • Achieves 100% Recall (all 42 table documents detected)
  • FP increases from 43 → 55 (+12)
  • F1 decreases from 64% → 60.43% (-3.57%)

Applied: 11% + aspect ratio 1.75 (2026-01-03)


Template for New Experiments

markdown
### Experiment XXX (YYYY-MM-DD): [Title]

**Goal**: [What are you trying to improve?]

**Changes**: [What did you modify?]

**Results**:
| Config | Precision | Recall | F1 | FP | FN |
|--------|-----------|--------|-----|-----|-----|
| Before | | | | | |
| After | | | | | |

**Conclusion**: [What did you learn? Should this be applied?]

**Next Steps**: [What to try next?]

How to Run Experiments

bash
# Run triage accuracy test
./scripts/test-java.sh -Dtest=TriageProcessorIntegrationTest#testTriageAccuracyOnBenchmarkPDFs

# Debug specific document
./scripts/bench.sh --doc-id 01030000000110