python/custreamz/README.md
Built as an extension to python streamz, cuStreamz provides GPU accelerated abstractions for streaming data. CuStreamz can be used along side python streamz or as a standalone library for ingesting streaming data to cudf dataframes.
The most common use for cuStreamz is accelerated data ingestion to a cudf dataframe. CuStreamz currently supports ingestion from Apache Kafka in the following message formats; Avro, CSV, JSON, Parquet, and ORC.
For example, the following snippet consumes CSV data from a Kafka topic named custreamz_tips and generates a cudf dataframe.
Users can visit Apache Kafka Quickstart to learn how to install, create custreamz_tips topic, and insert the tips data into Kafka.
from custreamz import kafka
# Full list of configurations can be found at: https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md
kafka_configs = {
"metadata.broker.list": "localhost:9092",
"group.id": "custreamz-client",
}
# Create a reusable Kafka Consumer client; "datasource"
consumer = kafka.Consumer(kafka_configs)
# Read 10,000 messages from `custreamz_tips` topic in CSV format.
tips_df = consumer.read_gdf(topic="custreamz_tips",
partition=0,
start=0,
end=10000,
message_format="csv")
print(tips_df.head())
tips_df['tip_percentage'] = tips_df['tip'] / tips_df['total_bill'] * 100
# display average tip by dining party size
print(tips_df.groupby('size').tip_percentage.mean())
A "hello world" of using cuStreamz with python streamz can be found here
A more detailed example of parsing haproxy logs is also available.
Please see the Demo Docker Repository, choosing a tag based on the NVIDIA CUDA version you're running. This provides a ready to run Docker container with cuStreamz already installed.
cuStraamz can be installed with conda (via miniforge) from the rapidsai channel:
Release:
conda install -c rapidsai cudf_kafka custreamz
Nightly:
conda install -c rapidsai-nightly cudf_kafka custreamz
See the Get RAPIDS version picker for more OS and version info.