docs/en/architecture/design-philosophy.md
This document explains the core design principles, philosophies, and trade-offs that shaped SeaTunnel's architecture. Understanding these principles helps contributors make consistent design decisions and users understand the system's strengths and limitations.
Principle: Decouple connector logic from execution engines.
Motivation:
Implementation:
Trade-offs:
Example: Connectors only implement SeaTunnel API abstractions (Source/Sink/Transform), and different execution engines complete adaptation through the translation layer; thus connector logic is decoupled from engine API changes.
Principle: Separate control logic (coordination) from data processing (execution).
Motivation:
Implementation Principle:
Coordination Layer (Master-side):
Execution Layer (Worker-side):
Communication Mechanism:
Trade-offs:
Example:
The key reason for this design: Fault tolerance requires distinguishing between "control state" (assigned/pending splits) and "execution progress" (offset/position per split) to enable precise recovery and fast reassignment after failures.
Principle: Divide data sources into independently processable splits.
Motivation:
Implementation:
Trade-offs:
Example:
// JDBC Source: Split by partition or chunk
class JdbcSourceSplit implements SourceSplit {
private final String splitId;
private final String query; // SELECT * FROM table WHERE id >= ? AND id < ?
private final long startOffset;
private final long endOffset;
}
// File Source: Split by file or byte range
class FileSplit implements SourceSplit {
private final String filePath;
private final long startOffset;
private final long length;
}
Principle: Guarantee exactly-once end-to-end data delivery.
Motivation:
Implementation Principle:
Two-phase commit protocol separates data writing into two independent phases:
Prepare Phase:
Commit Phase:
Abort Handling:
Trade-offs:
Example: A typical exactly-once implementation follows this pattern: "the writer first generates committable credentials (commit info), and after checkpoint succeeds, the coordinator performs the final commit". This approach delays side effects (visible changes to external systems) until after checkpoint success, avoiding duplicate visible writes during failure recovery.
Principle: Treat schema as explicit, typed metadata propagated through pipelines.
Motivation:
Implementation:
CatalogTable encapsulates complete table metadataTableSchema defines structure (columns, primary key, constraints)SchemaChangeEvent represents DDL changes (ADD/DROP/MODIFY columns)Trade-offs:
Example:
// Source produces typed schema
CatalogTable catalogTable = CatalogTable.of(
tableId,
TableSchema.builder()
.column("id", DataTypes.BIGINT())
.column("name", DataTypes.STRING())
.primaryKey("id")
.build()
);
// Transform validates and modifies schema
public CatalogTable getProducedCatalogTable() {
return inputCatalogTable.copy(
TableSchema.builder()
.column("id", DataTypes.BIGINT())
.column("name_upper", DataTypes.STRING()) // Transformed
.build()
);
}
Principle: Connectors are plugins loaded dynamically with isolated dependencies.
Motivation:
Implementation:
Trade-offs:
Example:
seatunnel-engine/lib/ # Core libraries
connector-jdbc/lib/ # JDBC driver (isolated)
connector-kafka/lib/ # Kafka client (isolated)
# Each connector loaded by separate ClassLoader
ConnectorClassLoader(connector-jdbc) -> loads mysql-connector-java-8.0.26.jar
ConnectorClassLoader(connector-kafka) -> loads kafka-clients-3.0.0.jar
Principle: Decouple state management from storage implementation.
Motivation:
Implementation:
CheckpointStorage abstraction (FileSystem, HDFS, S3, OSS)Trade-offs:
Principle: Support synchronizing multiple tables in a single job.
Motivation:
Implementation:
MultiTableSource / MultiTableSink wrap individual table sources/sinksTablePath routes records to correct tableTrade-offs:
Choice: Favor simplicity and correctness over extreme performance optimization.
Rationale:
Evidence:
Choice: Provide reasonable defaults while allowing advanced customization.
Rationale:
Implementation:
jdbc://host:port/db)Choice: General-purpose API with specialized implementations.
Rationale:
Example:
SourceSplitEnumerator general enough for files, databases, and message queuesChoice: Offer both exactly-once (high latency) and at-least-once (low latency) modes.
Rationale:
Configuration:
env {
checkpoint.mode = "EXACTLY_ONCE" # or "AT_LEAST_ONCE"
checkpoint.interval = 60000 # ms
}
SeaTunnel V1 (pre-2.3.0) had significant architectural limitations:
SeaTunnel V2 (2.3.0+) redesigned the architecture:
| Aspect | V1 | V2 |
|---|---|---|
| API | Engine-specific | Unified SeaTunnel API |
| Connectors | Duplicated code | Single implementation |
| Fault Tolerance | Engine-dependent | Explicit checkpoint protocol |
| Schema | Implicit | Explicit CatalogTable |
| Multi-Table | Not supported | Native support |
| Engine Support | Spark, Flink | Spark, Flink, Zeta |
| Exactly-Once | Partial | End-to-end with 2PC |
V1 and V2 connectors coexist but use different APIs:
seatunnel-connectors/ (deprecated)seatunnel-connectors-v2/ (recommended)V2 is the future; V1 is in maintenance mode.
Alternative: Single component handles both split generation and reading.
Decision: Separate components.
Reasoning:
Alternative: Two-level (Writer → Committer) or direct Writer commit.
Decision: Optional three-level commit.
Reasoning:
Many sinks only need Writer + Committer; AggregatedCommitter is for complex cases (e.g., Hive table commit requiring single global operation).
Alternative: Directly generate physical execution plan from config.
Decision: Two-stage planning.
Reasoning:
Alternative: Single global task graph.
Decision: Jobs divided into pipelines.
Reasoning:
Alternative: Rely entirely on Flink/Spark checkpoint mechanisms.
Decision: Explicit SeaTunnel checkpoint protocol.
Reasoning:
However, for Flink translation, SeaTunnel checkpoints align with Flink checkpoints to avoid duplication.
SeaTunnel's architecture reflects careful trade-offs between competing concerns:
The V2 redesign addressed major V1 limitations while establishing principles for long-term evolution. Understanding these design philosophies helps contributors make consistent decisions and users understand SeaTunnel's strengths and appropriate use cases.