Back to Connect

Iceberg Benchmark Results

docs/benchmark-results/iceberg.md

4.90.25.8 KB
Original Source

Iceberg Benchmark Results

Benchmarks for the iceberg output using a local REST catalog backed by MinIO (S3-compatible).

See internal/impl/iceberg/bench/ for the benchmark configs and run instructions.

Write Throughput — CPU & Batch Size Scaling

Synthetic events generated at maximum speed (generate input with count: 0, interval: ""), written to a single Iceberg table. Varying GOMAXPROCS and batching.count.

Environment: Intel Core i7-10850H @ 2.70GHz, 32 GB RAM, WSL2 (Linux 6.6.87.2), x86_64, MinIO + REST catalog running in Docker (localhost)

Dataset: Synthetic events, ~142 B per message (id, user_id, event_type, value, info, ts) — measured at the pipeline processor. Actual bytes written to MinIO will differ due to Parquet columnar compression.

Count: 1,000,000 messages per run

msg/sec

GOMAXPROCSbatch=1000batch=5000batch=10000
17573,1055,442
21,1864,4086,763
41,1474,7588,483
81,0564,1078,231
(unbounded)

kB/sec (batch=1000, batch=5000) / MB/sec (batch=10000)

GOMAXPROCSbatch=1000batch=5000batch=10000
1106435774
2166618961
41616671206
81485761170
(unbounded)

Observations:

  • Batch size is the dominant factor: throughput at 1 core scales from 757 (batch=1000) → 3,105 (batch=5000) → 5,442 (batch=10000) msg/sec. Each batch = one catalog commit round-trip, so fewer commits = dramatically higher throughput.
  • batch=5000 and batch=10000 benefit from more cores up to 4, then regress at 8 — the catalog commit overhead is reduced enough that CPU parallelism helps, but 8 cores reintroduces contention.

Write Throughput — Batch Size & max_in_flight Scaling

Fixed at GOMAXPROCS=4, varying batching.count and max_in_flight to measure the impact of concurrent catalog commits.

Environment: Intel Core i7-10850H @ 2.70GHz, 32 GB RAM, WSL2 (Linux 6.6.87.2), x86_64, MinIO + REST catalog running in Docker (localhost)

Dataset: Synthetic events, ~142 B per message

Count: 1,000,000 messages per run

msg/sec

max_in_flightbatch=5000batch=10000
44,7588,483
87,10513,839
1612,97323,316
3220,46234,835
6434,99333,703
12833,91133,742

MB/sec

max_in_flightbatch=5000batch=10000
40.671.21
81.002.00
161.803.30
322.905.00
645.004.80
1284.804.80

Observations:

  • max_in_flight is the most impactful knob: at batch=10000, throughput scales from 8,483 (MIF=4) → 13,839 (MIF=8) → 23,316 (MIF=16) → 34,835 (MIF=32) msg/sec — a 4x gain by increasing concurrent commits.
  • The ceiling is ~34K msg/sec / 5 MB/sec, hit at MIF=32 for batch=10000 and MIF=64 for batch=5000. This is the MinIO throughput limit, not the connector.
  • batch=5000 and batch=10000 converge at high MIF values — both plateau at ~34K msg/sec when given enough concurrent commits. batch=10000 reaches the ceiling with fewer in-flight requests (MIF=32 vs MIF=64).
  • Sweet spot: batch=10000, MIF=32 — reaches maximum throughput with the least concurrency overhead.
  • The fundamental insight from both sections: the Iceberg write bottleneck is catalog commit latency. The connector itself is not the bottleneck — throw more concurrent commits at it (max_in_flight) and it scales linearly until MinIO saturates.

Comparison: Kafka Connect vs Redpanda Connect Iceberg

Environment: Intel Core i7-10850H @ 2.70GHz, 32 GB RAM, WSL2 (Linux 6.6.87.2), x86_64 Dataset: 10,000,000 synthetic events, MinIO + Iceberg REST catalog in Docker

Both connectors use a 10s commit window and 16 Kafka partitions. The transformation computes 5 derived fields per message (event_id, value_usd, value_tier, ts_ms, is_high_value).

Results

Sink only

ConnectorThroughput
Kafka Connect (Tabular)84,745 msg/s
Redpanda Connect61,349 msg/s

Transform + sink

ConnectorKafka CPUsThroughput
Kafka Connect (Tabular)unbounded37,037 msg/s
Redpanda Connectunbounded47,272 msg/s
Redpanda Connect145,248 msg/s
Redpanda Connect248,829 msg/s

Notes

  • Kafka Connect sink-only is fastest in isolation — 16 tasks consuming pre-transformed data directly into Iceberg.
  • Kafka Connect with transformation requires a separate RPCN pre-processing step that writes to an intermediate Kafka topic (bench-events-transformed), then Kafka Connect reads from that topic and sinks to Iceberg. The two-stage I/O cuts throughput by more than half.
  • Redpanda Connect handles transformation and Iceberg writes in a single pipeline — no intermediate topic, no extra Kafka round-trip.
  • End-to-end (the realistic scenario): Redpanda Connect is ~1.3x faster than Kafka Connect (47k vs 37k msg/s).