Back to Ragflow

OSS DeepDoc HTTP API Service

deepdoc/server/README.md

0.26.24.4 KB
Original Source

OSS DeepDoc HTTP API Service

Serves DLA (Document Layout Analysis), OCR (Optical Character Recognition), and TSR (Table Structure Recognition) models via a unified HTTP API using LitServe and OSS ONNX Runtime models.

Quick Start

bash
# Build
docker build -f Dockerfile_deepdoc_oss -t deepdoc_oss:latest .

# Run (CPU only; no GPU required)
docker run -p 9390:9390 deepdoc_oss:latest

# Or via docker compose
docker compose -f docker/docker-compose.yml up -d

The service listens on port 9390 by default. Pass --port to change it:

bash
python deepdoc/server/deepdoc_server.py --port 9000 --model-dir /path/to/models

Endpoints

All prediction endpoints accept JPEG images via multipart/form-data. The form field for file uploads is named request.

MethodPathDescription
GET/healthLiveness probe. Returns ok.
GET/modelModel metadata. Returns {"model":"oss","version":"1.0"}.
POST/predict/dlaDocument Layout Analysis.
POST/predict/tsrTable Structure Recognition.
POST/predict/ocrOCR — use form field operator=det for detection or operator=rec for recognition.

POST /predict/dla

Analyzes a full page image and returns labelled layout regions.

Request

curl -X POST http://localhost:9390/predict/dla \
  -F "[email protected];type=image/jpeg"

Response

json
{
  "bboxes": [
    [x0, y0, x1, y1, score, class_id],
    ...
  ]
}
class_idLabel
0title
1text
2reference
3figure
4figure caption
5table
6table caption
8equation

The OSS model uses 8 unique class IDs. IDs 7 and 9 are reserved for compatibility with the SaaS label scheme but are never produced by the OSS model.

POST /predict/tsr

Recognizes table structure from a cropped table image.

Request

curl -X POST http://localhost:9390/predict/tsr \
  -F "request=@table_crop.jpg;type=image/jpeg"

Response

json
{
  "bboxes": [
    [x0, y0, x1, y1, score, class_id],
    ...
  ]
}
class_idLabel
0table
1table column
2table row
3table column header
4table projected row header
5table spanning cell

POST /predict/ocr

Two modes controlled by the operator form field.

Detection (operator=det)

Returns quadrilateral bounding boxes for detected text regions.

curl -X POST "http://localhost:9390/predict/ocr" \
  -F "operator=det" \
  -F "[email protected];type=image/jpeg"

Response (5-level nested array):

json
{
  "output": [
    [
      [
        [
          [[x0,y0],[x1,y1],[x2,y2],[x3,y3]],
          ...
        ]
      ]
    ]
  ]
}

Recognition (operator=rec)

Recognizes text within a cropped region.

curl -X POST "http://localhost:9390/predict/ocr" \
  -F "operator=rec" \
  -F "request=@char_crop.jpg;type=image/jpeg"

Response (4-level nested array):

json
{
  "output": [
    [
      [
        ["recognized text", 1.0],
        ...
      ]
    ]
  ]
}

Confidence is always 1.0 — the OSS recognition model does not return per-character confidence scores.

Error Responses

ScenarioHTTP Status
Missing operator field (OCR)400
Invalid operator value400
Empty or corrupt image400
Image exceeds 4096×4096400
Internal inference error500

Models

All ONNX models are from the InfiniFlow/deepdoc HuggingFace repository (Apache 2.0 license):

FileSizePurpose
layout.onnx75.7 MBDLA (YOLOv10)
det.onnx4.7 MBOCR text detection (PP-OCRv4)
rec.onnx10.8 MBOCR text recognition (PP-OCRv4)
tsr.onnx12.2 MBTSR (PaddleDetection)
ocr.res26 KBOCR character dictionary

Architecture

deepdoc/server/
├── deepdoc_server.py       # LitServe entry point
├── endpoints/            # LitAPI endpoints (HTTP layer)
│   ├── dla_endpoint.py
│   ├── tsr_endpoint.py
│   └── ocr_endpoint.py
└── adapters/             # Model wrappers (inference + format conversion)
    ├── dla_adapter.py
    ├── tsr_adapter.py
    └── ocr_adapter.py

Endpoints → Adapters → deepdoc/vision/ (reused OSS model classes) → ONNX Runtime.