docs/source/library-user-guide/upgrading/50.0.0.md
DataFusion 50.0.0 automatically infers Hive partitions when using the ListingTableFactory and CREATE EXTERNAL TABLE. Previously,
when creating a ListingTable, datasets that use Hive partitioning (e.g.
/table_root/column1=value1/column2=value2/data.parquet) would not have the Hive columns reflected in
the table's schema or data. The previous behavior can be
restored by setting the datafusion.execution.listing_table_factory_infer_partitions configuration option to false.
See issue #17049 for more details.
MSRV updated to 1.86.0The Minimum Supported Rust Version (MSRV) has been updated to 1.86.0.
See #17230 for details.
ScalarUDFImpl, AggregateUDFImpl and WindowUDFImpl traits now require PartialEq, Eq, and Hash traitsTo address error-proneness of ScalarUDFImpl::equals, AggregateUDFImpl::equalsand
WindowUDFImpl::equals methods and to make it easy to implement function equality correctly,
the equals and hash_value methods have been removed from ScalarUDFImpl, AggregateUDFImpl
and WindowUDFImpl traits. They are replaced the requirement to implement the PartialEq, Eq,
and Hash traits on any type implementing ScalarUDFImpl, AggregateUDFImpl or WindowUDFImpl.
Please see issue #16677 for more details.
Most of the scalar functions are stateless and have a signature field. These can be migrated
using regular expressions
\#\[derive\(Debug\)\](\n *(pub )?struct \w+ \{\n *signature\: Signature\,\n *\}),#[derive(Debug, PartialEq, Eq, Hash)]$1,AsyncScalarUDFImpl::invoke_async_with_args returns ColumnarValueIn order to enable single value optimizations and be consistent with other
user defined function APIs, the AsyncScalarUDFImpl::invoke_async_with_args method now
returns a ColumnarValue instead of a ArrayRef.
To upgrade, change the return type of your implementation
# /* comment to avoid running
impl AsyncScalarUDFImpl for AskLLM {
async fn invoke_async_with_args(
&self,
args: ScalarFunctionArgs,
_option: &ConfigOptions,
) -> Result<ColumnarValue> {
..
return array_ref; // old code
}
}
# */
To return a ColumnarValue
# /* comment to avoid running
impl AsyncScalarUDFImpl for AskLLM {
async fn invoke_async_with_args(
&self,
args: ScalarFunctionArgs,
_option: &ConfigOptions,
) -> Result<ColumnarValue> {
..
return ColumnarValue::from(array_ref); // new code
}
}
# */
See #16896 for more details.
ProjectionExpr changed from type alias to structProjectionExpr has been changed from a type alias to a struct with named fields to improve code clarity and maintainability.
Before:
pub type ProjectionExpr = (Arc<dyn PhysicalExpr>, String);
After:
#[derive(Debug, Clone)]
pub struct ProjectionExpr {
pub expr: Arc<dyn PhysicalExpr>,
pub alias: String,
}
To upgrade your code:
(expr, alias) with ProjectionExpr::new(expr, alias) or ProjectionExpr { expr, alias }.0 and .1 with .expr and .alias(expr, alias) to ProjectionExpr { expr, alias }This mainly impacts use of ProjectionExec.
This change was done in #17398
SessionState, SessionConfig, and OptimizerConfig returns &Arc<ConfigOptions> instead of &ConfigOptionsTo provide broader access to ConfigOptions and reduce required clones, some
APIs have been changed to return a &Arc<ConfigOptions> instead of a
&ConfigOptions. This allows sharing the same ConfigOptions across multiple
threads without needing to clone the entire ConfigOptions structure unless it
is modified.
Most users will not be impacted by this change since the Rust compiler typically
automatically dereference the Arc when needed. However, in some cases you may
have to change your code to explicitly call as_ref() for example, from
# /* comment to avoid running
let optimizer_config: &ConfigOptions = state.options();
# */
To
# /* comment to avoid running
let optimizer_config: &ConfigOptions = state.options().as_ref();
# */
See PR #16970
AsyncScalarUDFImpl::invoke_async_with_argsThe invoke_async_with_args method of the AsyncScalarUDFImpl trait has been
updated to remove the _option: &ConfigOptions parameter to simplify the API
now that the ConfigOptions can be accessed through the ScalarFunctionArgs
parameter.
You can change your code like this
# /* comment to avoid running
impl AsyncScalarUDFImpl for AskLLM {
async fn invoke_async_with_args(
&self,
args: ScalarFunctionArgs,
_option: &ConfigOptions,
) -> Result<ArrayRef> {
..
}
...
}
# */
To this:
# /* comment to avoid running
impl AsyncScalarUDFImpl for AskLLM {
async fn invoke_async_with_args(
&self,
args: ScalarFunctionArgs,
) -> Result<ArrayRef> {
let options = &args.config_options;
..
}
...
}
# */
The schema_rewriter module and its associated symbols have been moved from datafusion_physical_expr to a new crate datafusion_physical_expr_adapter. This affects the following symbols:
DefaultPhysicalExprAdapterDefaultPhysicalExprAdapterFactoryPhysicalExprAdapterPhysicalExprAdapterFactoryTo upgrade, change your imports to:
use datafusion_physical_expr_adapter::{
DefaultPhysicalExprAdapter, DefaultPhysicalExprAdapterFactory,
PhysicalExprAdapter, PhysicalExprAdapterFactory
};
56.0.0 and parquet 56.0.0This version of DataFusion upgrades the underlying Apache Arrow implementation
to version 56.0.0. See the release notes
for more details.
ExecutionPlan::reset_stateIn order to fix a bug in DataFusion 49.0.0 where dynamic filters (currently only generated in the presence of a query such as ORDER BY ... LIMIT ...)
produced incorrect results in recursive queries, a new method reset_state has been added to the ExecutionPlan trait.
Any ExecutionPlan that needs to maintain internal state or references to other nodes in the execution plan tree should implement this method to reset that state.
See #17028 for more details and an example implementation for SortExec.
The Nested Loop Join operator has been rewritten from scratch to improve performance and memory efficiency. From the micro-benchmarks: this change introduces up to 5X speed-up and uses only 1% memory in extreme cases compared to the previous implementation.
However, the new implementation cannot preserve input sort order like the old version could. This is a fundamental design trade-off that prioritizes performance and memory efficiency over sort order preservation.
See #16996 for details.
as_any() method to LazyBatchGeneratorTo help with protobuf serialization, the as_any() method has been added to the LazyBatchGenerator trait. This means you will need to add as_any() to your implementation of LazyBatchGenerator:
# /* comment to avoid running
impl LazyBatchGenerator for MyBatchGenerator {
fn as_any(&self) -> &dyn Any {
self
}
...
}
# */
See #17200 for details.
DataSource::try_swapping_with_projectionWe refactored DataSource::try_swapping_with_projection to simplify the method and minimize leakage across the ExecutionPlan <-> DataSource abstraction layer.
Reimplementation for any custom DataSource should be relatively straightforward, see #17395 for more details.
FileOpenFuture now uses DataFusionError instead of ArrowErrorThe FileOpenFuture type alias has been updated to use DataFusionError instead of ArrowError for its error type. This change affects the FileOpener trait and any implementations that work with file streaming operations.
Before:
pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch, ArrowError>>>>;
After:
pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch>>>>;
If you have custom implementations of FileOpener or work directly with FileOpenFuture, you'll need to update your error handling to use DataFusionError instead of ArrowError. The FileStreamState enum's Open variant has also been updated accordingly. See #17397 for more details.
The Foreign Function Interface (FFI) signature for user defined aggregate functions
has been updated to call return_field instead of return_type on the underlying
aggregate function. This is to support metadata handling with these aggregate functions.
This change should be transparent to most users. If you have written unit tests to call
return_type directly, you may need to change them to calling return_field instead.
This update is a breaking change to the FFI API. The current best practice when using the FFI crate is to ensure that all libraries that are interacting are using the same underlying Rust version. Issue #17374 has been opened to discuss stabilization of this interface so that these libraries can be used across different DataFusion versions.
See #17407 for details.
PhysicalExpr::is_volatile_nodeWe added a method to PhysicalExpr to mark a PhysicalExpr as volatile:
impl PhysicalExpr for MyRandomExpr {
fn is_volatile_node(&self) -> bool {
true
}
}
We've shipped this with a default value of false to minimize breakage but we highly recommend that implementers of PhysicalExpr opt into a behavior, even if it is returning false.
You can see more discussion and example implementations in #17351.