hudi-notebooks/notebooks/04-schema-evolution.ipynb
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Welcome to this hands-on guide to Schema Evolution with Apache Hudi. In a modern data lake, schemas are rarely static. As business requirements change, we need to be able to modify our data's structure, such as adding new columns or changing data types—without breaking our data pipelines. Hudi provides powerful, built-in features to manage these changes gracefully, turning your data lake into a flexible, database-like environment.
In this notebook, we will demonstrate the following key schema evolution concepts:
Hudi allows several schema changes safely as long as they adhere to backward-compatibility rules.
Supported Changes:
Hudi offers experimental, more flexible evolution behavior only during queries. This allows operations like renaming, deleting, or modifying nested columns when reading.
To enable it we need to use the below configuration.
Supported transformations during read include:
First, we begin by importing our necessary libraries and starting a SparkSession configured to work with Hudi and MinIO.
%run utils.py
Now, let's start the SparkSession. We'll give it the app name 'Schema-Evolution' and configure it to use our Hudi and MinIO settings.
spark = get_spark_session("Hudi Schema Evolution")
We'll start with a simple dataset of ride information. This will be the foundation of our Hudi table.
from pyspark.sql.types import StructType, StructField, DoubleType, StringType, IntegerType
initial_data = [
("2025-08-10 08:15:30", "uuid-101", "rider-X", "driver-A", 18.50, "new_york", 900), # 15 mins
("2025-08-10 09:22:10", "uuid-102", "rider-Y", "driver-B", 22.75, "chicago", 1200), # 20 mins
("2025-08-10 10:05:45", "uuid-103", "rider-Z", "driver-C", 14.60, "boston", 1100), # 18 mins
("2025-08-10 11:25:25", "uuid-104", "rider-W", "driver-D", 18.90, "seattle", 850), # 14 mins
("2025-08-10 11:55:30", "uuid-105", "rider-V", "driver-E", 20.40, "miami", 1000) # 16.6 mins
]
# Schema for our dataset
initial_schema = StructType([
StructField("ts", StringType(), False),
StructField("uuid", StringType(), False),
StructField("rider", StringType(), True),
StructField("driver", StringType(), True),
StructField("fare", DoubleType(), True),
StructField("city", StringType(), True),
StructField("trip_duration", IntegerType(), True), # Initially int
])
initial_df = spark.createDataFrame(initial_data, initial_schema)
initial_df.printSchema()
display(initial_df)
Now, let's create a new Hudi table using this data. This table is the starting point for all our schema changes. While the concepts of schema evolution apply to both COW and Merge-on-Read (MOR) tables, this specific notebook demonstrates the process using a COW table.
table_name = "rides_schema_evolution"
base_path = "s3a://warehouse/hudi-schema-evolution"
hudi_conf = {
"hoodie.table.name": table_name,
"hoodie.datasource.write.recordkey.field": "uuid",
"hoodie.datasource.write.table.type": "COPY_ON_WRITE",
"hoodie.datasource.write.precombine.field": "ts",
"hoodie.datasource.write.partitionpath.field": "city"
}
initial_df.write.format("hudi") \
.options(**hudi_conf) \
.mode("overwrite") \
.save(f"{base_path}/{table_name}")
# Register a temp view to easily query the table
spark.read.format("hudi").load(f"{base_path}/{table_name}").createOrReplaceTempView(table_name)
Now, imagine we need to add a new column, ride_status, to our data. Hudi can handle this seamlessly. We'll create a new DataFrame that includes this column for an existing record and upsert it into our table.
# Create a new DataFrame with a new column for one record
new_column_data = [
("2025-08-10 08:15:30", "uuid-101", "rider-X", "driver-A", 18.50, "new_york", 900, "completed")
]
new_columns_schema = StructType([
StructField("ts", StringType(), False),
StructField("uuid", StringType(), False),
StructField("rider", StringType(), True),
StructField("driver", StringType(), True),
StructField("fare", DoubleType(), True),
StructField("city", StringType(), True),
StructField("trip_duration", IntegerType(), True), # Initially int
StructField("ride_status", StringType(), True),
])
new_column_df = spark.createDataFrame(new_column_data, new_columns_schema)
new_column_df.printSchema()
display(new_column_df)
We'll now upsert the new data into our existing Hudi table. Hudi will detect the new schema, merge the data, and update the table's schema automatically. The records that don't have a value for ride_status will have null in the new column.
new_column_df.write.format("hudi") \
.options(**hudi_conf) \
.mode("append") \
.save(f"{base_path}/{table_name}")
# Query the updated table to see the new column and its schema
updated_df = spark.read.format("hudi").load(f"{base_path}/{table_name}")
updated_df.printSchema()
display(updated_df.select("uuid", "fare", "ride_status"))
Next, let's see how Hudi handles a data type change. We'll change the trip_duration column from a int to a more precise long. To ensure the write succeeds, we will first load the existing table schema and then merge it with the updated schema of the trip_update_df.
from pyspark.sql.types import LongType
from pyspark.sql.functions import lit
updated_data = [
("2025-08-10 09:30:40", "uuid-102", "rider-Y", "driver-B", 22.75, "chicago", 1350, "completed"), # New value with long type
("2025-08-10 15:35:10", "uuid-106", "rider-Z", "driver-C", 14.60, "boston", 1500, "completed") # New record
]
updated_schema = StructType([
StructField("ts", StringType(), False),
StructField("uuid", StringType(), False),
StructField("rider", StringType(), True),
StructField("driver", StringType(), True),
StructField("fare", DoubleType(), True),
StructField("city", StringType(), True),
StructField("trip_duration", LongType(), True), # upgraded to long
StructField("ride_status", StringType(), True),
])
trips_updated_df = spark.createDataFrame(updated_data, updated_schema)
# Update Hudi config to disable schema reconciliation
hudi_conf_update = hudi_conf.copy()
hudi_conf_update.update({
"hoodie.datasource.write.reconcile.schema": "false",
"hoodie.datasource.write.schema.evolution.enable": "true"
})
After upserting this new data, Hudi will automatically handle the schema change. The fare column will now be a DecimalType in the table's schema, and the values for the other records will be correctly cast to the new type.
# Upsert with new schema
trips_updated_df.write.format("hudi") \
.options(**hudi_conf_update) \
.mode("append") \
.save(f"{base_path}/{table_name}")
# Query the table and check the schema
updated_schema_df = spark.read.format("hudi").load(f"{base_path}/{table_name}")
updated_schema_df.printSchema()
display(updated_schema_df.select("uuid", "fare"))
Step 1: Load the Hudi Table & Rename the Column
# Load the Hudi table
hudi_df = spark.read.format("hudi").load(f"{base_path}/{table_name}")
# Rename trip_duration → duration_seconds
renamed_df = hudi_df.withColumnRenamed("trip_duration", "duration_in_seconds")
Step 2: Write Back to Hudi (Upsert)
When writing back, ensure schema reconciliation is enabled so Hudi registers the new column name.
# Update Hudi config to disable schema reconciliation
hudi_conf_rename = hudi_conf.copy()
hudi_conf_rename.update({
"hoodie.datasource.write.reconcile.schema": "true",
"hoodie.schema.on.read.enable": "true",
"hoodie.datasource.write.schema.on.read.enable": "true"
})
# Upsert with new schema
renamed_df.write.format("hudi") \
.options(**hudi_conf_rename) \
.mode("append") \
.save(f"{base_path}/{table_name}")
Step 3: Verify Schema Change
updated_df = spark.read.format("hudi").load(f"{base_path}/{table_name}")
updated_df.printSchema()
# Should show duration_in_seconds instead of trip_duration
display(updated_df.select("uuid", "fare", "duration_in_seconds", "ride_status"))
stop_spark_session()