Back to Datafusion

Upgrade Guides

docs/source/library-user-guide/upgrading/50.0.0.md

53.1.011.7 KB
Original Source
<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

Upgrade Guides

DataFusion 50.0.0

ListingTable automatically detects Hive Partitioned tables

DataFusion 50.0.0 automatically infers Hive partitions when using the ListingTableFactory and CREATE EXTERNAL TABLE. Previously, when creating a ListingTable, datasets that use Hive partitioning (e.g. /table_root/column1=value1/column2=value2/data.parquet) would not have the Hive columns reflected in the table's schema or data. The previous behavior can be restored by setting the datafusion.execution.listing_table_factory_infer_partitions configuration option to false. See issue #17049 for more details.

MSRV updated to 1.86.0

The Minimum Supported Rust Version (MSRV) has been updated to 1.86.0. See #17230 for details.

ScalarUDFImpl, AggregateUDFImpl and WindowUDFImpl traits now require PartialEq, Eq, and Hash traits

To address error-proneness of ScalarUDFImpl::equals, AggregateUDFImpl::equalsand WindowUDFImpl::equals methods and to make it easy to implement function equality correctly, the equals and hash_value methods have been removed from ScalarUDFImpl, AggregateUDFImpl and WindowUDFImpl traits. They are replaced the requirement to implement the PartialEq, Eq, and Hash traits on any type implementing ScalarUDFImpl, AggregateUDFImpl or WindowUDFImpl. Please see issue #16677 for more details.

Most of the scalar functions are stateless and have a signature field. These can be migrated using regular expressions

  • search for \#\[derive\(Debug\)\](\n *(pub )?struct \w+ \{\n *signature\: Signature\,\n *\}),
  • replace with #[derive(Debug, PartialEq, Eq, Hash)]$1,
  • review all the changes and make sure only function structs were changed.

AsyncScalarUDFImpl::invoke_async_with_args returns ColumnarValue

In order to enable single value optimizations and be consistent with other user defined function APIs, the AsyncScalarUDFImpl::invoke_async_with_args method now returns a ColumnarValue instead of a ArrayRef.

To upgrade, change the return type of your implementation

rust
# /* comment to avoid running
impl AsyncScalarUDFImpl for AskLLM {
    async fn invoke_async_with_args(
        &self,
        args: ScalarFunctionArgs,
        _option: &ConfigOptions,
    ) -> Result<ColumnarValue> {
        ..
      return array_ref; // old code
    }
}
# */

To return a ColumnarValue

rust
# /* comment to avoid running
impl AsyncScalarUDFImpl for AskLLM {
    async fn invoke_async_with_args(
        &self,
        args: ScalarFunctionArgs,
        _option: &ConfigOptions,
    ) -> Result<ColumnarValue> {
        ..
      return ColumnarValue::from(array_ref); // new code
    }
}
# */

See #16896 for more details.

ProjectionExpr changed from type alias to struct

ProjectionExpr has been changed from a type alias to a struct with named fields to improve code clarity and maintainability.

Before:

rust,ignore
pub type ProjectionExpr = (Arc<dyn PhysicalExpr>, String);

After:

rust,ignore
#[derive(Debug, Clone)]
pub struct ProjectionExpr {
    pub expr: Arc<dyn PhysicalExpr>,
    pub alias: String,
}

To upgrade your code:

  • Replace tuple construction (expr, alias) with ProjectionExpr::new(expr, alias) or ProjectionExpr { expr, alias }
  • Replace tuple field access .0 and .1 with .expr and .alias
  • Update pattern matching from (expr, alias) to ProjectionExpr { expr, alias }

This mainly impacts use of ProjectionExec.

This change was done in #17398

SessionState, SessionConfig, and OptimizerConfig returns &Arc<ConfigOptions> instead of &ConfigOptions

To provide broader access to ConfigOptions and reduce required clones, some APIs have been changed to return a &Arc<ConfigOptions> instead of a &ConfigOptions. This allows sharing the same ConfigOptions across multiple threads without needing to clone the entire ConfigOptions structure unless it is modified.

Most users will not be impacted by this change since the Rust compiler typically automatically dereference the Arc when needed. However, in some cases you may have to change your code to explicitly call as_ref() for example, from

rust
# /* comment to avoid running
let optimizer_config: &ConfigOptions = state.options();
#  */

To

rust
# /* comment to avoid running
let optimizer_config: &ConfigOptions = state.options().as_ref();
#  */

See PR #16970

API Change to AsyncScalarUDFImpl::invoke_async_with_args

The invoke_async_with_args method of the AsyncScalarUDFImpl trait has been updated to remove the _option: &ConfigOptions parameter to simplify the API now that the ConfigOptions can be accessed through the ScalarFunctionArgs parameter.

You can change your code like this

rust
# /* comment to avoid running
impl AsyncScalarUDFImpl for AskLLM {
    async fn invoke_async_with_args(
        &self,
        args: ScalarFunctionArgs,
        _option: &ConfigOptions,
    ) -> Result<ArrayRef> {
        ..
    }
    ...
}
# */

To this:

rust
# /* comment to avoid running

impl AsyncScalarUDFImpl for AskLLM {
    async fn invoke_async_with_args(
        &self,
        args: ScalarFunctionArgs,
    ) -> Result<ArrayRef> {
        let options = &args.config_options;
        ..
    }
    ...
}
# */

Schema Rewriter Module Moved to New Crate

The schema_rewriter module and its associated symbols have been moved from datafusion_physical_expr to a new crate datafusion_physical_expr_adapter. This affects the following symbols:

  • DefaultPhysicalExprAdapter
  • DefaultPhysicalExprAdapterFactory
  • PhysicalExprAdapter
  • PhysicalExprAdapterFactory

To upgrade, change your imports to:

rust
use datafusion_physical_expr_adapter::{
    DefaultPhysicalExprAdapter, DefaultPhysicalExprAdapterFactory,
    PhysicalExprAdapter, PhysicalExprAdapterFactory
};

Upgrade to arrow 56.0.0 and parquet 56.0.0

This version of DataFusion upgrades the underlying Apache Arrow implementation to version 56.0.0. See the release notes for more details.

Added ExecutionPlan::reset_state

In order to fix a bug in DataFusion 49.0.0 where dynamic filters (currently only generated in the presence of a query such as ORDER BY ... LIMIT ...) produced incorrect results in recursive queries, a new method reset_state has been added to the ExecutionPlan trait.

Any ExecutionPlan that needs to maintain internal state or references to other nodes in the execution plan tree should implement this method to reset that state. See #17028 for more details and an example implementation for SortExec.

Nested Loop Join input sort order cannot be preserved

The Nested Loop Join operator has been rewritten from scratch to improve performance and memory efficiency. From the micro-benchmarks: this change introduces up to 5X speed-up and uses only 1% memory in extreme cases compared to the previous implementation.

However, the new implementation cannot preserve input sort order like the old version could. This is a fundamental design trade-off that prioritizes performance and memory efficiency over sort order preservation.

See #16996 for details.

Add as_any() method to LazyBatchGenerator

To help with protobuf serialization, the as_any() method has been added to the LazyBatchGenerator trait. This means you will need to add as_any() to your implementation of LazyBatchGenerator:

rust
# /* comment to avoid running

impl LazyBatchGenerator for MyBatchGenerator {
    fn as_any(&self) -> &dyn Any {
        self
    }

    ...
}

# */

See #17200 for details.

Refactored DataSource::try_swapping_with_projection

We refactored DataSource::try_swapping_with_projection to simplify the method and minimize leakage across the ExecutionPlan <-> DataSource abstraction layer. Reimplementation for any custom DataSource should be relatively straightforward, see #17395 for more details.

FileOpenFuture now uses DataFusionError instead of ArrowError

The FileOpenFuture type alias has been updated to use DataFusionError instead of ArrowError for its error type. This change affects the FileOpener trait and any implementations that work with file streaming operations.

Before:

rust,ignore
pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch, ArrowError>>>>;

After:

rust,ignore
pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch>>>>;

If you have custom implementations of FileOpener or work directly with FileOpenFuture, you'll need to update your error handling to use DataFusionError instead of ArrowError. The FileStreamState enum's Open variant has also been updated accordingly. See #17397 for more details.

FFI user defined aggregate function signature change

The Foreign Function Interface (FFI) signature for user defined aggregate functions has been updated to call return_field instead of return_type on the underlying aggregate function. This is to support metadata handling with these aggregate functions. This change should be transparent to most users. If you have written unit tests to call return_type directly, you may need to change them to calling return_field instead.

This update is a breaking change to the FFI API. The current best practice when using the FFI crate is to ensure that all libraries that are interacting are using the same underlying Rust version. Issue #17374 has been opened to discuss stabilization of this interface so that these libraries can be used across different DataFusion versions.

See #17407 for details.

Added PhysicalExpr::is_volatile_node

We added a method to PhysicalExpr to mark a PhysicalExpr as volatile:

rust,ignore
impl PhysicalExpr for MyRandomExpr {
  fn is_volatile_node(&self) -> bool {
    true
  }
}

We've shipped this with a default value of false to minimize breakage but we highly recommend that implementers of PhysicalExpr opt into a behavior, even if it is returning false.

You can see more discussion and example implementations in #17351.