Back to Pulsar

PIP-465: Split IO Connectors into Separate Repository

pip/pip-465.md

5.0.0-M110.0 KB
Original Source

PIP-465: Split IO Connectors into Separate Repository

Background Knowledge

Apache Pulsar ships ~30 IO connectors (Kafka, Kinesis, Cassandra, Elasticsearch, JDBC, Debezium, etc.) as part of its main repository. These connectors are packaged as NAR files and bundled into a pulsar-all Docker image alongside the core broker, client, and functions runtime.

Each connector brings its own dependency tree — often large and conflicting with other connectors or with Pulsar's core dependencies. The connectors interact with Pulsar exclusively through the stable pulsar-io-core API, making them natural candidates for independent development and release.

Motivation

The primary goal of this PIP is to make development of Pulsar easier by shrinking the core codebase. Removing ~30 connectors and their dependency trees from the main repository will massively improve compile time, test execution time, CI resource consumption, and CI stability.

Build and CI impact. Compiling and packaging 30+ connector NARs adds significant time to every CI run and local build, even when a developer is only working on the broker or client. The connectors collectively bring hundreds of transitive dependencies into the build graph, which slows down dependency resolution, inflates vulnerability reports (OWASP checks must scan connector dependencies), and creates version conflicts that require careful management in the main repository's BOM. Removing them dramatically reduces the surface area of the build.

Release coupling. Connectors are tied to the Pulsar release cycle. A bug fix in a single connector (e.g., updating the Elasticsearch client) requires waiting for the next Pulsar release. Conversely, a Pulsar patch release must rebuild all connectors even when none of them changed. The release cadence for connectors will be independent from Pulsar releases, similar to what we already do for client SDKs (Go, Python, Node.js).

Low integration risk. The pulsar-io-core API that connectors depend on has been very stable for a long time. There have been no breaking changes to the connector API in years, so there is essentially no risk of integration pain from this split.

Docker image bloat. The pulsar-all image bundles every connector NAR, weighing in at ~2.9 GB — a very large image that most deployments don't need. Users typically deploy only 1-2 connectors but pay the image pull cost for all of them. The main reason users chose pulsar-all over pulsar was to get the tiered-storage offloaders — this PIP addresses that by packaging the offloader NARs directly into the pulsar image. Users who need specific connectors can still build tailored images by adding just the connector NARs they need on top of apachepulsar/pulsar.

Independent velocity. Connector maintainers should be able to release new connector versions against a stable Pulsar API without coordinating with the core release train.

Goals

In Scope

  • Create apache/pulsar-connectors repository containing all IO connector modules, with their own Gradle build, version catalog, and CI pipeline. The repository is forked from the main Pulsar repository to preserve full git history.

  • Remove connector modules from the main Pulsar repository. Retain only:

    • pulsar-io-core (the connector API)
    • pulsar-io-data-generator (minimal connector used in integration tests)
    • The functions runtime and worker that load connectors at runtime
  • Remove the pulsar-all Docker image. The image is too large and most users don't need all connectors in a single image. The pulsar image becomes the single official image. Tiered-storage offloader NARs — the main reason users chose pulsar-all — are included directly in the pulsar image.

  • Independent connector releases. The pulsar-connectors repository has its own versioning and release cadence, independent from Pulsar releases — similar to what we already do for client SDKs. It can release new connector versions against any compatible Pulsar release.

  • Connector distribution packaging. The connectors repository produces a single release containing all connector NARs, as a distribution tarball that users can deploy into an existing Pulsar installation.

Out of Scope

  • Changing the connector API (pulsar-io-core)
  • Changing how the functions worker discovers and loads connector NARs
  • A connector marketplace or registry (future enhancement)
  • Splitting out tiered-storage offloaders into their own repository

High Level Design

The split creates two repositories from what is currently one:

apache/pulsar (main repo)
├── pulsar-io/core/          # Connector API (retained)
├── pulsar-io/data-generator/ # Test connector (retained)
├── pulsar-functions/        # Runtime + worker (retained)
├── docker/pulsar/           # Single Docker image
└── (broker, client, etc.)

apache/pulsar-connectors (new repo)
├── aerospike/
├── aws/
├── cassandra/
├── debezium/
│   ├── core/
│   ├── mysql/
│   ├── postgres/
│   └── ...
├── elastic-search/
├── jdbc/
│   ├── core/
│   ├── postgres/
│   └── ...
├── kafka/
├── kafka-connect-adaptor/
├── kinesis/
├── rabbitmq/
├── ... (all other connectors)
├── distribution/io/         # Distribution packaging
└── docs/                    # Connector docs generation

The connectors repository consumes Pulsar artifacts (pulsar-io-core, pulsar-client, etc.) as external Maven dependencies, not as source dependencies. This ensures connectors build against the published API and don't accidentally depend on internals.

Detailed Design

Repository Structure

The new pulsar-connectors repository is forked from the main Pulsar repository to preserve git history, then trimmed to contain only connector-related modules. Connectors are promoted from nested pulsar-io/<name> paths to top-level <name>/ directories for a flatter structure.

Build Configuration

The connectors repository has its own:

  • settings.gradle.kts with all connector modules
  • gradle/libs.versions.toml with connector-specific dependency versions
  • pulsar-dependencies/ platform module pinning Pulsar artifact versions
  • build.gradle.kts root build with shared configuration

Pulsar core artifacts are declared as dependencies with a configurable version:

kotlin
implementation("org.apache.pulsar:pulsar-io-core:${pulsarVersion}")

Versioning Strategy

The initial release of pulsar-connectors will use the same version as the next Pulsar release (whether that is 4.3 or 5.0), to make the transition clear. After that, the connectors repository follows its own independent release cadence. All connectors are released together as a single release (not individually), and each release specifies which Pulsar versions it is compatible with.

Docker Image Changes

The pulsar-all image is removed. It bundled all connector NARs alongside the broker, producing a very large image that most deployments didn't need. The main reason users chose pulsar-all over pulsar was to get the tiered-storage offloaders. With this change:

  • Tiered-storage offloader NARs move into the pulsar image, eliminating the primary reason for pulsar-all to exist
  • The pulsar Docker image becomes the single official image, containing the broker, functions runtime, and tiered-storage offloader NARs
  • Users who need specific connectors can build tailored images by adding just the connector NARs they need on top of apachepulsar/pulsar, or mount them via volume mounts

CI and Testing

  • The main Pulsar repository's CI no longer builds or tests connectors
  • The connectors repository has its own CI that builds and tests all connectors
  • Integration tests that exercise specific connectors (e.g., Cassandra sink, Kafka source) move to the connectors repository
  • The main repository retains integration tests using data-generator for testing the connector loading and runtime machinery

Migration for Users

Users who currently use pulsar-all Docker image:

  1. Switch to the pulsar Docker image
  2. Download needed connector NARs from the connectors release
  3. Mount NARs into the container (e.g., via volume mount to /pulsar/connectors/)

Users who build from source:

  1. Build the main Pulsar repository as before (faster, since connectors are gone)
  2. Build the connectors repository separately if needed

Public-facing Changes

Docker Images

BeforeAfter
pulsar — core onlypulsar — core + tiered-storage offloaders
pulsar-all — core + all connectors + offloaders(removed)

Artifacts

  • All connector NARs move from the main Pulsar release to a single unified release from the pulsar-connectors repository
  • All other Pulsar artifacts remain unchanged

Configuration

No changes to broker, client, or functions worker configuration.

Backward & Forward Compatibility

Backward Compatibility

The connector API (pulsar-io-core) does not change. Existing connector NARs continue to work with the functions worker without modification.

The pulsar-io-core API has been very stable for years with no breaking changes, so connectors built against older API versions will continue to work with newer Pulsar releases and vice versa.

Forward Compatibility

New connector releases can target older Pulsar versions, as long as the pulsar-io-core API they depend on is compatible. Given the long track record of API stability, this is expected to work seamlessly across Pulsar 4.x releases.

Security Considerations

No security implications. Connectors continue to be loaded through the same NAR classloader isolation mechanism. The split does not change the security model.

Separating connector dependencies from the main repository actually improves security posture by reducing the attack surface of the core Pulsar build and making connector dependency updates independently releasable.

Links

  • Mailing List discussion thread: [link]
  • Mailing List voting thread: [link]