Back to Hudi

๐Ÿš€ Spark + Hudi + MinIO + Hive Metastore Docker Demo

hudi-notebooks/README.md

0.5.33.6 KB
Original Source
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

๐Ÿš€ Spark + Hudi + MinIO + Hive Metastore Docker Demo

This project provides a ready-to-use Docker Compose environment for running Apache Spark with Hudi, Hive Metastore, and MinIO (S3-compatible storage) for data lake development and testing. JupyterLab is included for interactive development.

๐Ÿ› ๏ธ Services

  • spark-hudi: Spark with Hudi and JupyterLab
  • hive-metastore: Hive Metastore (backed by Derby)
  • minio: S3-compatible object storage

๐Ÿ“‚ Directory Structure

  • Dockerfile.spark / Dockerfile.hive: Custom Dockerfiles for Spark and Hive
  • build.sh: Build all Docker images
  • run_spark_hudi.sh: Start/stop/restart the stack
  • conf/: Configuration files for Spark, Hive, and Hudi
  • notebooks/: Jupyter notebooks (mounted in Spark container)
  • data/: Persistent data for MinIO, Spark event logs, etc.

โšก Quick Start

1. Build Docker Images

sh
./build.sh

2. Start the Environment

sh
./run_spark_hudi.sh start

3. Stop the Environment

sh
./run_spark_hudi.sh stop

4. Restart the Environment

sh
./run_spark_hudi.sh restart

๐ŸŒ Accessing Services

โš™๏ธ Configuration

  • Spark, Hive, and Hudi configs are in conf/ and automatically copied into containers.
  • S3 access keys and endpoints are set for MinIO and referenced in Spark/Hive configs.

๐Ÿ“’ Example: Using JupyterLab

  1. Open http://localhost:8888 in your browser.
  2. Use the provided notebooks or create your own to interact with Spark and Hudi tables.
  3. Run Spark jobs that write/read Hudi datasets on MinIO S3.

๐Ÿงน Cleaning Up

To remove all containers and volumes:

sh
docker-compose down -v

๐Ÿ“– Notes

  • The Hive Metastore here uses Derby DB for simplicity. For production-like setups, replace Derby with MySQL/Postgres.
  • Spark jars include Hudi Spark bundle + Hadoop AWS jars to enable MinIO S3 and Hudi integration.

๐Ÿ›Ž๏ธ Support

๐Ÿ“š Further Reading

Spark Quick Start Guide Python/Rust Quick Start Guide

๐Ÿค Contributing

Please check out our contribution guide to learn more about how to contribute. For code contributions, please refer to the developer setup.