Back to Datafusion

Upgrade Guides

docs/source/library-user-guide/upgrading/48.0.0.md

53.1.08.3 KB
Original Source
<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

Upgrade Guides

DataFusion 48.0.0

Expr::Literal has optional metadata

The Expr::Literal variant now includes optional metadata, which allows for carrying through Arrow field metadata to support extension types and other uses.

This means code such as

rust
# /* comment to avoid running
match expr {
...
  Expr::Literal(scalar) => ...
...
}
#  */

Should be updated to:

rust
# /* comment to avoid running
match expr {
...
  Expr::Literal(scalar, _metadata) => ...
...
}
#  */

Likewise constructing Expr::Literal requires metadata as well. The lit function has not changed and returns an Expr::Literal with no metadata.

Expr::WindowFunction is now Boxed

Expr::WindowFunction is now a Box<WindowFunction> instead of a WindowFunction directly. This change was made to reduce the size of Expr and improve performance when planning queries (see details on #16207).

This is a breaking change, so you will need to update your code if you match on Expr::WindowFunction directly. For example, if you have code like this:

rust
# /* comment to avoid running
match expr {
  Expr::WindowFunction(WindowFunction {
    params:
      WindowFunctionParams {
       partition_by,
       order_by,
      ..
    }
  }) => {
    // Use partition_by and order_by as needed
  }
  _ => {
    // other expr
  }
}
# */

You will need to change it to:

rust
# /* comment to avoid running
match expr {
  Expr::WindowFunction(window_fun) => {
    let WindowFunction {
      fun,
      params: WindowFunctionParams {
        args,
        partition_by,
        ..
        },
    } = window_fun.as_ref();
    // Use partition_by and order_by as needed
  }
  _ => {
    // other expr
  }
}
#  */

The VARCHAR SQL type is now represented as Utf8View in Arrow

The mapping of the SQL VARCHAR type has been changed from Utf8 to Utf8View which improves performance for many string operations. You can read more about Utf8View in the DataFusion blog post on German-style strings

This means that when you create a table with a VARCHAR column, it will now use Utf8View as the underlying data type. For example:

sql
> CREATE TABLE my_table (my_column VARCHAR);
0 row(s) fetched.
Elapsed 0.001 seconds.

> DESCRIBE my_table;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| my_column   | Utf8View  | YES         |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.000 seconds.

You can restore the old behavior of using Utf8 by changing the datafusion.sql_parser.map_varchar_to_utf8view configuration setting. For example

sql
> set datafusion.sql_parser.map_varchar_to_utf8view = false;
0 row(s) fetched.
Elapsed 0.001 seconds.

> CREATE TABLE my_table (my_column VARCHAR);
0 row(s) fetched.
Elapsed 0.014 seconds.

> DESCRIBE my_table;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| my_column   | Utf8      | YES         |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.004 seconds.

ListingOptions default for collect_stat changed from true to false

This makes it agree with the default for SessionConfig. Most users won't be impacted by this change but if you were using ListingOptions directly and relied on the default value of collect_stat being true, you will need to explicitly set it to true in your code.

rust
# /* comment to avoid running
ListingOptions::new(Arc::new(ParquetFormat::default()))
    .with_collect_stat(true)
    // other options
# */

Processing FieldRef instead of DataType for user defined functions

In order to support metadata handling and extension types, user defined functions are now switching to traits which use FieldRef rather than a DataType and nullability. This gives a single interface to both of these parameters and additionally allows access to metadata fields, which can be used for extension types.

To upgrade structs which implement ScalarUDFImpl, if you have implemented return_type_from_args you need instead to implement return_field_from_args. If your functions do not need to handle metadata, this should be straightforward repackaging of the output data into a FieldRef. The name you specify on the field is not important. It will be overwritten during planning. ReturnInfo has been removed, so you will need to remove all references to it.

ScalarFunctionArgs now contains a field called arg_fields. You can use this to access the metadata associated with the columnar values during invocation.

To upgrade user defined aggregate functions, there is now a function return_field that will allow you to specify both metadata and nullability of your function. You are not required to implement this if you do not need to handle metadata.

The largest change to aggregate functions happens in the accumulator arguments. Both the AccumulatorArgs and StateFieldsArgs now contain FieldRef rather than DataType.

To upgrade window functions, ExpressionArgs now contains input fields instead of input data types. When setting these fields, the name of the field is not important since this gets overwritten during the planning stage. All you should need to do is wrap your existing data types in fields with nullability set depending on your use case.

Physical Expression return Field

To support the changes to user defined functions processing metadata, the PhysicalExpr trait, which now must specify a return Field based on the input schema. To upgrade structs which implement PhysicalExpr you need to implement the return_field function. There are numerous examples in the physical-expr crate.

FileFormat::supports_filters_pushdown replaced with FileSource::try_pushdown_filters

To support more general filter pushdown, the FileFormat::supports_filters_pushdown was replaced with FileSource::try_pushdown_filters. If you implemented a custom FileFormat that uses a custom FileSource you will need to implement FileSource::try_pushdown_filters. See ParquetSource::try_pushdown_filters for an example of how to implement this.

FileFormat::supports_filters_pushdown has been removed.

ParquetExec, AvroExec, CsvExec, JsonExec Removed

ParquetExec, AvroExec, CsvExec, and JsonExec were deprecated in DataFusion 46 and are removed in DataFusion 48. This is sooner than the normal process described in the API Deprecation Guidelines because all the tests cover the new DataSourceExec rather than the older structures. As we evolve DataSource, the old structures began to show signs of "bit rotting" (not working but no one knows due to lack of test coverage).

PartitionedFile added as an argument to the FileOpener trait

This is necessary to properly fix filter pushdown for filters that combine partition columns and file columns (e.g. day = username['dob']).

If you implemented a custom FileOpener you will need to add the PartitionedFile argument but are not required to use it in any way.