docs/2.developers/7.templates/ETL/180.kafka-alternative.md
::
Apache Kafka is a distributed event store for publishing and consuming data streams. It is a de facto choice for building real-time streaming data pipelines. However, Kafka's complexity and high infrastructure demands make it a costly and challenging solution for message queuing. It requires dedicated server clusters, ongoing maintenance, and fine-tuning, which can add significant expense and operational overhead.
Deploying and maintaining self-managed Apache Kafka clusters for minimal use cases will cost you over $20,000 monthly according to Confluent estimates. Even using managed services like Confluent Cloud will be roughly $1,800 per month—and that's just for the service itself, not including the additional costs of risk management and operations. What if there's a simpler, more cost-effective way to achieve sub-second latency for your streaming applications—all while leveraging your existing S3 storage and keeping your system straightforward?
Introducing a Kafka-free architecture using Pathway (GitHub repo) alongside Delta Tables on S3-compatible storage, such as MinIO. Pathway is a stream processing engine with a Rust engine that allows you to build efficient, real-time data pipelines in Python for both on-premise and cloud environments (AWS, Google Cloud, Azure). In this architecture, Pathway manages the reading and writing of data to and from Delta Tables, creating a seamless, scalable, and cost-effective streaming solution that leverages your existing S3 storage—eliminating the need for Kafka.
In these benchmarks, Pathway achieved stable near-real-time latency for workloads up to 250,000 messages per second. Reducing the batch size provides sub-second latency for workloads up to 60,000 messages per second, demonstrating performance suitable for many real-time applications while significantly reducing complexity and cost compared to Kafka.
When using Pathway in conjunction with Delta Lake and MinIO, you will:
In this article, we'll guide you through setting up a streaming data pipeline, present benchmark results, and explore how you can do this setup to simplify your infrastructure while delivering the performance you need.
:article-toc-without-title
Kafka is a common tool for handling high-throughput, real-time data streams, but it comes with significant challenges:
Costs
Complexity
You don't need to get into the complexities of Kafka to achieve sub-second latency.
Instead of using Kafka as a message queue, we propose to use Delta Tables on S3-compatible storage, such as MinIO, to manage your data streams. Pathway provides the streaming capabilities that Delta Tables alone lack: Pathway continuously monitors, reads, and processes new data entries and updates onto the Delta Tables. By handling the reading and writing of data to and from these Delta Tables, Pathway provides an easy-to-deploy and efficient way to process data streams on S3.
With Pathway in integration with your existing S3 storage, you can handle up to 60,000 messages per second with 0.7s (p99) latency, and even 250k messages per second with near-real-time latency. Here you will see how you will get:
While the achieved performance of Pathway and Delta Tables over MinIO—sub-second latency for workloads up to 60,000 messages per second and near-real-time latency (1-4 seconds) for workloads up to 250,000 messages per second—is an order of magnitude lower than the best possible with Kafka, it is still more than sufficient for a wide range of applications, such as:
This is not an exhaustive list, but it illustrates the wide range of scenarios this setup serves as an effective Python Kafka alternative for real-time data processing needs.
Using Pathway with S3-compatible storage like MinIO allows you to create a streaming architecture that is both effective and straightforward. Unlike traditional Kafka setups, which involve managing clusters, brokers, and zookeepers, a Pathway-based solution simplifies your infrastructure by relying on your existing S3 storage. This architecture reduces your maintenance overhead, costs, and the need for complex configurations, all while delivering reliable, near-real-time data processing.
::
This simplified architecture focuses on three main components:
The MinIO Enterprise Object Store (EOS) is a high-performance, Kubernetes-native, S3-compatible object store offering scalable storage for unstructured data like files, videos, images, and backups. EOS delivers S3-like infrastructure across public clouds, private clouds, on-prem and the edge. It offers a rich suite of enterprise features targeting security, resiliency, data protection, and scalability. MinIO's EOS is commonly used to build streaming data pipelines and AI data lakes because it is highly scalable, performant and durable but can also handle backup and archival workloads - all from a single platform.
S3 storage, short for Simple Storage Service, is a cloud-based solution that lets you store and retrieve any amount of data from anywhere on the web. You can use MinIO as a drop-in replacement for Amazon S3, enjoying the same features such as scalability, durability, and flexibility. This compatibility allows you to handle data storage needs efficiently without relying on Amazon's infrastructure, making MinIO a great alternative for your cloud storage requirements.
The main difference between S3 storage and a file system is how they manage and access the data. In S3 storage, data is represented as objects within buckets, each object having a unique key and metadata. Inside a bucket, objects are accessed in a key-value store fashion: you access them by their key, and there is no extension or ordering.
Delta Tables is an ACID-compliant storage layer implemented through Delta Lake. Delta Tables are a way to make object storage behave like database tables: They track data changes, ensure consistency, support schema evolution, and enable high-performance analytics. Delta Tables are especially useful for real-time data ingestion and processing, providing an append-only mechanism for writing and reading data.
Here's a step-by-step guide to building this pipeline with Pathway, Delta Lake, and S3/MinIO.
::callout{type="basic"} #summary Code Walkthrough
#content
MINIO_S3_ACCESS_KEY and the MINIO_S3_SECRET_ACCESS_KEY.Get the sources from our GitHub repository. The project has two directories:
.
├── minio-ETL/
└── benchmarks/
Navigate to minio-ETL/ folder, where you’ll find the pipeline project files:
.
├── .env
├── base.py
├── etl.py
├── producer.py
├── read-results.py
└── README.md
.env is the environment file for setting MinIO credentials.base.pycontains basic configuration settings for accessing MinIO.etl.py is the main Pathway ETL pipeline, handling data loading and writing to Delta Tables on MinIO.producer.py generates messages with timestamps from two time zones (New York and Paris) and writes them to the respective Delta Tables.read-results.py reads the output data stream from the ETL pipeline and saves it to a CSV file.Open .env and enter your MinIO credentials:
MINIO_S3_ACCESS_KEY = *******
MINIO_S3_SECRET_ACCESS_KEY = *******
In base.py, add your bucket name and endpoint URL. In producer.py, update the storage_option dictionary with these settings, including the AWS_REGION.
Pathway here reads and writes data streams directly from Delta Tables on S3-compatible storage.
Define the data ingestion setup in etl.py to read timestamps from Delta Tables:
timestamps_timezone_1 = pw.io.deltalake.read(
base_path + "timezone1",
schema=InputStreamSchema,
s3_connection_settings=s3_connection_settings,
autocommit_duration_ms=1000,
)
timestamps_timezone_2 = pw.io.deltalake.read(
base_path + "timezone2",
schema=InputStreamSchema,
s3_connection_settings=s3_connection_settings,
autocommit_duration_ms=1000,
)
Unify timestamps across different time zones with this transformation logic:
def process_timestamps(stream1, stream2):
unified_stream = pw.transform.unify_timezones(stream1, stream2)
return unified_stream
timestamps_unified = process_timestamps(timestamps_timezone_1, timestamps_timezone_2)
Write the transformed data back to Delta Tables on MinIO:
pw.io.deltalake.write(
timestamps_unified,
base_path + "timezone_unified",
s3_connection_settings=s3_connection_settings,
min_commit_frequency=1000
)
cat results.csv
You should obtain something like this:
timestamp,message,time,diff
1724403133066.217,"0",1724403268648,1
1724403138388.917,"1",1724403269748,1
1724403144896.706,"3",1724403270848,1
1724403147095.393,"4",1724403271948,1
1724403149295.165,"5",1724403272948,1
1724403151499.115,"6",1724403274048,1
1724403153736.456,"7",1724403275148,1
1724403140576.744,"2",1724403276248,1
1724403158229.244,"9",1724403276248,1
Bravo! You’ve successfully built a streaming pipeline on MinIO using Pathway and Delta Tables—an effective alternative to traditional messaging systems like Kafka. ::
Pathway is often used by developers for ultra-low latency use cases, such as Formula 1 racing analytics and mission-critical applications for organizations like Intel and NATO. However, in many scenarios, ultra-low latency isn't essential, and achieving sub-second or near-real-time latency is sufficient (like the use cases listed above) while giving you the advantage of working with the existing infrastructure itself.
The benchmark experiments below are to see if Pathway performs as a Python Kafka alternative with your existing S3-compatible storage.
In these tests, messages are written to a Delta Table hosted on MinIO (a popular S3-compatible storage system) and then read back. Latency is measured from the moment a message is sent to when it's successfully processed by Pathway.
<!-- https://www.canva.com/design/DAGW1SzKjZg/2k0z2_GoLXP8SeCjVF5dFQ/edit?utm_content=DAGW1SzKjZg&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton -->::
The workloads simulate scenarios with message rates ranging from 10k to 250k messages per second. Each test runs for 5 minutes, excluding the first 30 seconds, to allow the system to stabilize for more accurate measurements. MinIO is hosted on the same network as the test machine to minimize external network delays. Pathway is configured to run in single-threaded mode for consistency.
The first 30 seconds were excluded from the results because the producer starts writing entries before the consumer begins. This means initial entries have higher latency as the consumer catches up with data already written. Excluding this period provides a more accurate measurement of steady-state performance.
To replicate these tests, you can find the benchmarking script here.
::
For higher message rates, up to 250,000 per second, latency increased but remained within acceptable bounds for applications where near-real-time processing is sufficient.
Furthermore, we’ve recommended necessary modifications based on the workloads you’re dealing with.
Latency in this setup is influenced by several factors, including time spent creating and processing batches, network communication between Pathway and S3 storage, and processing overheads such as data parsing, system I/O, and other computational tasks.
Initially, the system uses default settings for batch sizes and commit intervals. By default, batches may be accumulated for up to one second before being committed to the Delta Table. This batching helps with throughput but can increase latency.
To improve latency, considering your workload, you can adjust autocommit_duration_ms and min_commit_frequency:
autocommit_duration_ms: Set this to 100 milliseconds. Reduce batch sizes and decrease the time messages spend waiting before being processed.Here's how this setup affects performance:
| Workload (messages/sec) | p50 (s) | p75 (s) | p85 (s) | p95 (s) | p99 (s) |
|---|---|---|---|---|---|
| 10,000 | 0.26 | 0.35 | 0.36 | 0.53 | 0.67 |
| 30,000 | 0.28 | 0.31 | 0.32 | 0.38 | 0.64 |
| 70,000 | 1.24 | 16.06 | 24.82 | 35.66 | 41.32 |
As you can see, the optimization significantly improves latency at rates of 10,000 to 30,000 messages per second, and the improvement holds up until around 60,000 messages per second. The timings are stable, with only minor fluctuations (around 50 ms), caused by network and batching delays. The improvement is about four times for median processing time and up to three times for the 99th percentile.
However, at rates above 60,000 messages per second, the system starts to fail:
::
While latency improves, throughput suffers. The system now makes ten times more network calls and synchronization operations, such as polling for the latest Delta Table versions, which reduces the maximum throughput it can handle. Additionally, this increases the impact of network errors: if access to the Delta Table is lost even briefly, it may be more difficult for the system to recover.
autocommit_duration_ms and min_commit_frequency to find the optimal balance between latency and throughput for your use case. Ensure your testing environment is adequately provisioned, as CPU, memory, and network capacity can affect performance.Pathway integrates seamlessly with AWS S3, providing a powerful Kafka alternative for AWS users. Available on the AWS Marketplace, Pathway simplifies deployment within your AWS environment.
To set up Pathway with AWS S3, deploy Pathway from the AWS Marketplace and configure it with your AWS credentials. Use Pathway's Delta Lake connectors to read from and write to Delta Tables stored in S3. This allows you to run your streaming pipeline, leveraging AWS's scalability and reliability without the complexity of Kafka.
You can read the documentation for it here.
Pathway serves as an efficient Kafka alternative on the Google Cloud Platform (GCP). By deploying Pathway on GCP, you can build streaming pipelines without the need to manage Kafka clusters.
To set up Pathway on Google Cloud, deploy it using Google Kubernetes Engine (GKE). Configure Pathway to interact with Google Cloud Storage, enabling you to process streaming data effectively. This setup leverages Google's scalable infrastructure without the complexity of Kafka.
You can read more about deploying Pathway on GCP in our GCP deployment guide.
Pathway offers a compelling Kafka alternative for Microsoft Azure users. Deploying Pathway on Azure allows you to build streaming pipelines without the overhead of managing Kafka infrastructure.
To set up Pathway on Azure, deploy it using Azure Container Instances. Configure Pathway to interact with Azure storage services, enabling you to process streaming data efficiently. This setup utilizes Azure's scalable infrastructure without the complexity of Kafka.
You can read more about deploying Pathway on Azure in our Azure deployment guide.
Kafka is widely used for real-time data streaming but is complex and costly to manage, often requiring substantial infrastructure and operational overhead. If you're already using S3-compatible storage like MinIO, you can achieve sub-second latency for your streaming applications without Kafka by using Pathway.
Pathway offers a lightweight alternative that integrates seamlessly with your existing S3 storage, using Delta Tables to manage data streams efficiently. Benchmarks demonstrate that Pathway can handle up to 60,000 messages per second with sub-second latency and up to 250,000 messages per second with near-real-time latency.
This setup simplifies your architecture, reduces costs, and is suitable for various applications like IoT sensor data collection, financial transactions, web analytics, and logistics tracking. The article provides a step-by-step guide to building a streaming pipeline using Pathway and MinIO, and explores how Pathway serves as a Kafka alternative on AWS, Google Cloud Platform, and Azure, leveraging each platform's storage services to process streaming data effectively without the complexity of Kafka. While this Kafka-free architecture relies on Pathway for managing data flows and interacting with Delta Tables, it can be seamlessly extended to handle advanced stream processing tasks. Thanks to Pathway, you can easily turn your S3 storage into a full data streaming processing pipeline.
Need help?
Pathway is trusted by thousands of developers and enterprises, including Intel, NATO, Formula 1 Teams, CMA CGM, and more. Reach out for assistance with your enterprise applications. Contact us here to discuss your project needs or to request a demo.