docs/en/architecture/overview.md
SeaTunnel is designed as a distributed multimodal data integration tool with the following core objectives:
SeaTunnel adopts a layered architecture that separates concerns and enables flexibility:
┌─────────────────────────────────────────────────────────────────┐
│ User Configuration Layer │
│ (HOCON Config / SQL / Web UI) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ SeaTunnel API Layer │
│ (Source API / Sink API / Transform API / Table API) │
│ │
│ • SeaTunnelSource • CatalogTable │
│ • SeaTunnelSink • TableSchema │
│ • SeaTunnelTransform • SchemaChangeEvent │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Connector Ecosystem │
│ │
│ [Jdbc] [Kafka] [MySQL-CDC] [Elasticsearch] [Iceberg] ... │
│ (Connector Ecosystem) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Translation Layer │
│ (Adapts SeaTunnel API to Engine-Specific API) │
│ │
│ • FlinkSource/FlinkSink • SparkSource/SparkSink │
│ • Context Adapters • Serialization Adapters │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ SeaTunnel │ │ Apache │ │ Apache │
│ Engine (Zeta)│ │ Flink │ │ Spark │
│ │ │ │ │ │
│ • Master │ │ • JobManager │ │ • Driver │
│ • Worker │ │ • TaskManager│ │ • Executor │
│ • Checkpoint │ │ • State │ │ • RDD/DS │
└──────────────┘ └──────────────┘ └──────────────┘
| Layer | Responsibility | Key Components |
|---|---|---|
| Configuration Layer | Job definition, parameter configuration | HOCON parser, SQL parser, config validation |
| API Layer | Unified abstraction for connectors | Source/Sink/Transform interfaces, CatalogTable |
| Connector Layer | Data source/sink implementations | Various connectors (JDBC, Kafka, CDC, etc.) |
| Translation Layer | Engine-specific adaptation | Flink/Spark adapters, context wrappers |
| Engine Layer | Job execution and resource management | Scheduling, fault tolerance, state management |
The API layer provides engine-independent abstractions:
Key Design: Separation of coordination (Enumerator) and execution (Reader) enables efficient parallel processing and fault tolerance.
Code Reference:
Key Design: Two-phase commit protocol (prepareCommit → commit) ensures exactly-once semantics.
Code Reference:
Code Reference:
Code Reference:
The native execution engine provides:
LogicalDag → PhysicalPlan → SubPlan (Pipeline) → PhysicalVertex → TaskGroup → SeaTunnelTask
Code Reference:
Enables engine portability through adapter pattern:
Code Reference:
All connectors follow a standardized structure:
connector-[name]/
├── src/main/java/.../
│ ├── [Name]Source.java # Implements SeaTunnelSource
│ ├── [Name]SourceReader.java # Implements SourceReader
│ ├── [Name]SourceSplitEnumerator.java
│ ├── [Name]SourceSplit.java
│ ├── [Name]Sink.java # Implements SeaTunnelSink
│ ├── [Name]SinkWriter.java # Implements SinkWriter
│ └── config/[Name]Config.java
└── src/main/resources/META-INF/services/
├── org.apache.seatunnel.api.table.factory.TableSourceFactory
└── org.apache.seatunnel.api.table.factory.TableSinkFactory
Discovery Mechanism: Java SPI (Service Provider Interface) for dynamic connector loading.
Data Source
│
▼
┌─────────────────────┐
│ SourceSplitEnumerator│ (Master Side)
│ • Generate Splits │
│ • Assign to Readers │
└─────────────────────┘
│ (Split Assignment)
▼
┌─────────────────────┐
│ SourceReader │ (Worker Side)
│ • Read from Split │
│ • Emit Records │
└─────────────────────┘
│
▼
SeaTunnelRow
│
▼
Transform Chain (Optional)
│
▼
SeaTunnelRow
│
▼
┌─────────────────────┐
│ SinkWriter │ (Worker Side)
│ • Buffer Records │
│ • Prepare Commit │
└─────────────────────┘
│ (CommitInfo)
▼
┌─────────────────────┐
│ SinkCommitter │ (Coordinator)
│ • Commit Changes │
└─────────────────────┘
│
▼
Data Sink
Jobs are divided into Pipelines (SubPlans):
Pipeline 1: [Source A] → [Transform 1] → [Sink A]
↓
Pipeline 2: [Source B] ───────→ [Transform 2] → [Sink B]
Each pipeline:
sequenceDiagram
participant Client
participant CoordinatorService
participant JobMaster
participant ResourceManager
Client->>CoordinatorService: Submit Job Config
CoordinatorService->>CoordinatorService: Parse Config → LogicalDag
CoordinatorService->>JobMaster: Create JobMaster
JobMaster->>JobMaster: Generate PhysicalPlan
JobMaster->>ResourceManager: Request Resources
ResourceManager->>JobMaster: Allocate Slots
JobMaster->>TaskExecutionService: Deploy Tasks
Task Initialization
Data Processing
Checkpoint Coordination
Commit Phase
Task State Transitions:
CREATED → INIT → WAITING_RESTORE → READY_START → STARTING → RUNNING
↓
FAILED ← ─────────────────────── → PREPARE_CLOSE → CLOSED
↓
CANCELED
Job State Transitions:
CREATED → SCHEDULED → RUNNING → FINISHED
↓ ↓
FAILED CANCELING → CANCELED
Checkpoint Mechanism:
Failover Strategy:
Two-Phase Commit Protocol:
Idempotency: SinkCommitter operations must be idempotent to handle retries
seatunnel/
├── seatunnel-api/ # Core API definitions
│ ├── source/ # Source API
│ ├── sink/ # Sink API
│ ├── transform/ # Transform API
│ └── table/ # Table and Schema API
│
├── seatunnel-connectors-v2/ # Connector implementations
│ ├── connector-jdbc/ # JDBC connector
│ ├── connector-kafka/ # Kafka connector
│ ├── connector-cdc-mysql/ # MySQL CDC connector
│ └── ... # connectors
│
├── seatunnel-transforms-v2/ # Transform implementations
│ ├── transform-sql/ # SQL transform
│ ├── transform-filter/ # Filter transform
│ └── ...
│
├── seatunnel-engine/ # SeaTunnel Engine (Zeta)
│ ├── seatunnel-engine-core/ # Core execution logic
│ ├── seatunnel-engine-server/ # Server components (Master/Worker)
│ └── seatunnel-engine-storage/ # Checkpoint storage
│
├── seatunnel-translation/ # Engine translation layers
│ ├── seatunnel-translation-flink/
│ └── seatunnel-translation-spark/
│
├── seatunnel-formats/ # Data format handlers
│ ├── seatunnel-format-json/
│ ├── seatunnel-format-avro/
│ └── ...
│
├── seatunnel-core/ # Job submission and CLI
└── seatunnel-e2e/ # End-to-end tests
To dive deeper into specific architectural components:
For practical guides: