docker/thirdparties/docker-compose/hudi/README.md
This directory contains the Docker Compose configuration for setting up a Hudi test environment with Spark, Hive Metastore, MinIO (S3-compatible storage), and PostgreSQL.
CONTAINER_UID in custom_settings.envdoris--CONTAINER_UID="doris--bender--"hudi.env.tpl)HIVE_METASTORE_PORT: Port for Hive Metastore Thrift service (default: 19083)MINIO_API_PORT: MinIO S3 API port (default: 19100)MINIO_CONSOLE_PORT: MinIO web console port (default: 19101)SPARK_UI_PORT: Spark web UI port (default: 18080)hudi.env.tpl)MINIO_ROOT_USER: MinIO access key (default: minio)MINIO_ROOT_PASSWORD: MinIO secret key (default: minio123)HUDI_BUCKET: S3 bucket name for Hudi data (default: datalake)⚠️ Important: Hadoop versions must match Spark's built-in Hadoop version
hudi.env.tpl)All JAR file versions and URLs are configurable:
HUDI_BUNDLE_VERSION / HUDI_BUNDLE_URL: Hudi Spark bundleHADOOP_AWS_VERSION / HADOOP_AWS_URL: Hadoop S3A filesystem supportAWS_SDK_BUNDLE_VERSION / AWS_SDK_BUNDLE_URL: AWS Java SDK Bundle v1 (required for Hadoop 3.3.4 S3A support, 1.12.x series)Note: hadoop-common is already included in Spark's built-in Hadoop distribution, so it's not configured here.
POSTGRESQL_JDBC_VERSION / POSTGRESQL_JDBC_URL: PostgreSQL JDBC driver# Start Hudi environment
./docker/thirdparties/run-thirdparties-docker.sh -c hudi
# Stop Hudi environment
./docker/thirdparties/run-thirdparties-docker.sh -c hudi --stop
⚠️ Important: To ensure data consistency after Docker restarts, only use SQL scripts to add data. Data added through spark-sql interactive shell is temporary and will not persist after container restart.
Add new SQL files in scripts/create_preinstalled_scripts/hudi/ directory:
01_config_and_database.sql, 02_create_user_activity_log_tables.sql, etc.)${HIVE_METASTORE_URIS} and ${HUDI_BUCKET}Example: Create 08_create_custom_table.sql:
USE regression_hudi;
CREATE TABLE IF NOT EXISTS my_hudi_table (
id BIGINT,
name STRING,
created_at TIMESTAMP
) USING hudi
TBLPROPERTIES (
type = 'cow',
primaryKey = 'id',
preCombineField = 'created_at',
hoodie.datasource.hive_sync.enable = 'true',
hoodie.datasource.hive_sync.metastore.uris = '${HIVE_METASTORE_URIS}',
hoodie.datasource.hive_sync.mode = 'hms'
)
LOCATION 's3a://${HUDI_BUCKET}/warehouse/regression_hudi/my_hudi_table';
INSERT INTO my_hudi_table VALUES
(1, 'Alice', TIMESTAMP '2024-01-01 10:00:00'),
(2, 'Bob', TIMESTAMP '2024-01-02 11:00:00');
After adding SQL files, restart the container to execute them:
docker restart doris--hudi-spark
After starting the Hudi Docker environment, you can create a Hudi catalog in Doris to access Hudi tables:
-- Create Hudi catalog
CREATE CATALOG IF NOT EXISTS hudi_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://<externalEnvIp>:19083',
's3.endpoint' = 'http://<externalEnvIp>:19100',
's3.access_key' = 'minio',
's3.secret_key' = 'minio123',
's3.region' = 'us-east-1',
'use_path_style' = 'true'
);
-- Switch to Hudi catalog
SWITCH hudi_catalog;
-- Use database
USE regression_hudi;
-- Show tables
SHOW TABLES;
-- Query Hudi table
SELECT * FROM user_activity_log_cow_partition LIMIT 10;
Configuration Parameters:
hive.metastore.uris: Hive Metastore Thrift service address (default port: 19083)s3.endpoint: MinIO S3 API endpoint (default port: 19100)s3.access_key: MinIO access key (default: minio)s3.secret_key: MinIO secret key (default: minio123)s3.region: S3 region (default: us-east-1)use_path_style: Use path-style access for MinIO (required: true)Replace <externalEnvIp> with your actual external environment IP address (e.g., 127.0.0.1 for localhost).
⚠️ Note: The methods below are for debugging purposes only. Data created through spark-sql interactive shell will not persist after Docker restart. To add persistent data, use SQL scripts as described in the "Adding Data" section.
docker exec -it doris--hudi-spark bash
/opt/spark/bin/spark-sql \
--master local[*] \
--name hudi-debug \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.catalogImplementation=hive \
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
--conf spark.sql.warehouse.dir=s3a://datalake/warehouse
-- Show databases
SHOW DATABASES;
-- Use database
USE regression_hudi;
-- Show tables
SHOW TABLES;
-- Describe table structure
DESCRIBE EXTENDED user_activity_log_cow_partition;
-- Query data
SELECT * FROM user_activity_log_cow_partition LIMIT 10;
-- Check Hudi table properties
SHOW TBLPROPERTIES user_activity_log_cow_partition;
-- View Spark configuration
SET -v;
-- Check Hudi-specific configurations
SET hoodie.datasource.write.hive_style_partitioning;
Access Spark Web UI at: http://localhost:18080 (or configured SPARK_UI_PORT)
# View Spark container logs
docker logs doris--hudi-spark --tail 100 -f
# View Hive Metastore logs
docker logs doris--hudi-metastore --tail 100 -f
# View MinIO logs
docker logs doris--hudi-minio --tail 100 -f
# Access MinIO console
# URL: http://localhost:19101 (or configured MINIO_CONSOLE_PORT)
# Username: minio (or MINIO_ROOT_USER)
# Password: minio123 (or MINIO_ROOT_PASSWORD)
# Or use MinIO client
docker exec -it doris--hudi-minio-mc mc ls myminio/datalake/warehouse/regression_hudi/
docker logs doris--hudi-sparkdocker exec doris--hudi-spark test -f /opt/hudi-scripts/SUCCESSdocker ps | grep metastoredocker exec doris--hudi-spark ls -lh /opt/hudi-cache/hudi.env.tpl for correct version numbersdocker ps | grep miniohudi.env.tpldocker exec doris--hudi-minio-mc mc ls myminio/docker logs doris--hudi-metastore | grep "Metastore is ready"docker ps | grep metastore-dbdocker exec doris--hudi-metastore-db pg_isready -U hivehudi/
├── hudi.yaml.tpl # Docker Compose template
├── hudi.env.tpl # Environment variables template
├── scripts/
│ ├── init.sh # Initialization script
│ ├── create_preinstalled_scripts/
│ │ └── hudi/ # SQL scripts (01_config_and_database.sql, 02_create_user_activity_log_tables.sql, ...)
│ └── SUCCESS # Initialization marker (generated)
└── cache/ # Downloaded JAR files (generated)
.yaml, .env, cache/, SUCCESS) are ignored by Git${VARIABLE_NAME} syntax