Back to Hudi

Licensed to the Apache Software Foundation (ASF) under one

hudi-notebooks/notebooks/04-schema-evolution.ipynb

0.5.39.6 KB
Original Source
python
#  Licensed to the Apache Software Foundation (ASF) under one
#  or more contributor license agreements.  See the NOTICE file
#  distributed with this work for additional information
#  regarding copyright ownership.  The ASF licenses this file
#  to you under the Apache License, Version 2.0 (the
#  "License"); you may not use this file except in compliance
#  with the License.  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
# limitations under the License.
<center> </center>

Schema Evolution with Apache Hudi: Concepts and Practical Use

Welcome to this hands-on guide to Schema Evolution with Apache Hudi. In a modern data lake, schemas are rarely static. As business requirements change, we need to be able to modify our data's structure, such as adding new columns or changing data types—without breaking our data pipelines. Hudi provides powerful, built-in features to manage these changes gracefully, turning your data lake into a flexible, database-like environment.

In this notebook, we will demonstrate the following key schema evolution concepts:

1. Schema Evolution on Write

Hudi allows several schema changes safely as long as they adhere to backward-compatibility rules.

Supported Changes:

  • Adding new nullable columns at root or nested levels
  • Promoting a field’s data type within compatibility matrix (e.g., int → long)

2. Schema Evolution on Read (Experimental)

Hudi offers experimental, more flexible evolution behavior only during queries. This allows operations like renaming, deleting, or modifying nested columns when reading.

To enable it we need to use the below configuration.

  • hoodie.schema.on.read.enable=true

Supported transformations during read include:

  • Add/delete/modify/move columns
  • Rename columns
  • Operations on nested columns of the Array type

Setting up the Environment

First, we begin by importing our necessary libraries and starting a SparkSession configured to work with Hudi and MinIO.

python
%run utils.py

Now, let's start the SparkSession. We'll give it the app name 'Schema-Evolution' and configure it to use our Hudi and MinIO settings.

python
spark = get_spark_session("Hudi Schema Evolution")

Initial Data and Table Creation

We'll start with a simple dataset of ride information. This will be the foundation of our Hudi table.

python
from pyspark.sql.types import StructType, StructField, DoubleType, StringType, IntegerType

initial_data = [
    ("2025-08-10 08:15:30", "uuid-101", "rider-X", "driver-A", 18.50, "new_york", 900),   # 15 mins
    ("2025-08-10 09:22:10", "uuid-102", "rider-Y", "driver-B", 22.75, "chicago", 1200),   # 20 mins
    ("2025-08-10 10:05:45", "uuid-103", "rider-Z", "driver-C", 14.60, "boston", 1100),    # 18 mins
    ("2025-08-10 11:25:25", "uuid-104", "rider-W", "driver-D", 18.90, "seattle", 850),    # 14 mins
    ("2025-08-10 11:55:30", "uuid-105", "rider-V", "driver-E", 20.40, "miami", 1000)      # 16.6 mins
]

# Schema for our dataset
initial_schema = StructType([
    StructField("ts", StringType(), False),
    StructField("uuid", StringType(), False),
    StructField("rider", StringType(), True),
    StructField("driver", StringType(), True),
    StructField("fare", DoubleType(), True),
    StructField("city", StringType(), True),
    StructField("trip_duration", IntegerType(), True),  # Initially int
])

initial_df = spark.createDataFrame(initial_data, initial_schema)
initial_df.printSchema()
display(initial_df)

Now, let's create a new Hudi table using this data. This table is the starting point for all our schema changes. While the concepts of schema evolution apply to both COW and Merge-on-Read (MOR) tables, this specific notebook demonstrates the process using a COW table.

python
table_name = "rides_schema_evolution"
base_path = "s3a://warehouse/hudi-schema-evolution"

hudi_conf = {
    "hoodie.table.name": table_name,
    "hoodie.datasource.write.recordkey.field": "uuid",
    "hoodie.datasource.write.table.type": "COPY_ON_WRITE",
    "hoodie.datasource.write.precombine.field": "ts",
    "hoodie.datasource.write.partitionpath.field": "city"
}

initial_df.write.format("hudi") \
    .options(**hudi_conf) \
    .mode("overwrite") \
    .save(f"{base_path}/{table_name}")

# Register a temp view to easily query the table
spark.read.format("hudi").load(f"{base_path}/{table_name}").createOrReplaceTempView(table_name)

On-Write Evolution Example: Add a Column & Promote Type

A. Adding a New Column

Now, imagine we need to add a new column, ride_status, to our data. Hudi can handle this seamlessly. We'll create a new DataFrame that includes this column for an existing record and upsert it into our table.

python
# Create a new DataFrame with a new column for one record
new_column_data = [
    ("2025-08-10 08:15:30", "uuid-101", "rider-X", "driver-A", 18.50, "new_york", 900, "completed")
]

new_columns_schema = StructType([
    StructField("ts", StringType(), False),
    StructField("uuid", StringType(), False),
    StructField("rider", StringType(), True),
    StructField("driver", StringType(), True),
    StructField("fare", DoubleType(), True),
    StructField("city", StringType(), True),
    StructField("trip_duration", IntegerType(), True),  # Initially int
    StructField("ride_status", StringType(), True),
])

new_column_df = spark.createDataFrame(new_column_data, new_columns_schema)
new_column_df.printSchema()
display(new_column_df)

We'll now upsert the new data into our existing Hudi table. Hudi will detect the new schema, merge the data, and update the table's schema automatically. The records that don't have a value for ride_status will have null in the new column.

python
new_column_df.write.format("hudi") \
    .options(**hudi_conf) \
    .mode("append") \
    .save(f"{base_path}/{table_name}")

# Query the updated table to see the new column and its schema
updated_df = spark.read.format("hudi").load(f"{base_path}/{table_name}")
updated_df.printSchema()
display(updated_df.select("uuid", "fare", "ride_status"))

B. Changing a Column's Data Type

Next, let's see how Hudi handles a data type change. We'll change the trip_duration column from a int to a more precise long. To ensure the write succeeds, we will first load the existing table schema and then merge it with the updated schema of the trip_update_df.

python
from pyspark.sql.types import LongType
from pyspark.sql.functions import lit

updated_data = [
    ("2025-08-10 09:30:40", "uuid-102", "rider-Y", "driver-B", 22.75, "chicago", 1350, "completed"),  # New value with long type
    ("2025-08-10 15:35:10", "uuid-106", "rider-Z", "driver-C", 14.60, "boston", 1500, "completed")    # New record
]

updated_schema = StructType([
    StructField("ts", StringType(), False),
    StructField("uuid", StringType(), False),
    StructField("rider", StringType(), True),
    StructField("driver", StringType(), True),
    StructField("fare", DoubleType(), True),
    StructField("city", StringType(), True),
    StructField("trip_duration", LongType(), True),  # upgraded to long
    StructField("ride_status", StringType(), True),
])

trips_updated_df = spark.createDataFrame(updated_data, updated_schema)

# Update Hudi config to disable schema reconciliation
hudi_conf_update = hudi_conf.copy()
hudi_conf_update.update({
    "hoodie.datasource.write.reconcile.schema": "false",
    "hoodie.datasource.write.schema.evolution.enable": "true"
})

After upserting this new data, Hudi will automatically handle the schema change. The fare column will now be a DecimalType in the table's schema, and the values for the other records will be correctly cast to the new type.

python
# Upsert with new schema
trips_updated_df.write.format("hudi") \
    .options(**hudi_conf_update) \
    .mode("append") \
    .save(f"{base_path}/{table_name}")

# Query the table and check the schema
updated_schema_df = spark.read.format("hudi").load(f"{base_path}/{table_name}")
updated_schema_df.printSchema()
display(updated_schema_df.select("uuid", "fare"))

On-Read Evolution Example

Renaming a Column

Step 1: Load the Hudi Table & Rename the Column

python
# Load the Hudi table
hudi_df = spark.read.format("hudi").load(f"{base_path}/{table_name}")

# Rename trip_duration → duration_seconds
renamed_df = hudi_df.withColumnRenamed("trip_duration", "duration_in_seconds")

Step 2: Write Back to Hudi (Upsert)

When writing back, ensure schema reconciliation is enabled so Hudi registers the new column name.

python
# Update Hudi config to disable schema reconciliation
hudi_conf_rename = hudi_conf.copy()
hudi_conf_rename.update({
    "hoodie.datasource.write.reconcile.schema": "true",
    "hoodie.schema.on.read.enable": "true",
    "hoodie.datasource.write.schema.on.read.enable": "true"
})

# Upsert with new schema
renamed_df.write.format("hudi") \
    .options(**hudi_conf_rename) \
    .mode("append") \
    .save(f"{base_path}/{table_name}")

Step 3: Verify Schema Change

python
updated_df = spark.read.format("hudi").load(f"{base_path}/{table_name}")
updated_df.printSchema()

# Should show duration_in_seconds instead of trip_duration
display(updated_df.select("uuid", "fare", "duration_in_seconds", "ride_status"))
python
stop_spark_session()