Back to Datafusion

Apache DataFusion 40.0.0 Changelog

dev/changelog/40.0.0.md

53.1.034.1 KB
Original Source
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

Apache DataFusion 40.0.0 Changelog

This release consists of 263 commits from 64 contributors. See credits at the end of this changelog for more information.

Breaking changes:

  • Convert StringAgg to UDAF #10945 (lewiszlw)
  • Convert bool_and & bool_or to UDAF #11009 (jcsherin)
  • Convert Average to UDAF #10942 #10964 (dharanad)
  • fix: remove the Sized requirement on ExecutionPlan::name() #11047 (waynexia)
  • Return &Arc reference to inner trait object #11103 (linhr)
  • Support COPY TO Externally Defined File Formats, add FileType trait #11060 (devinjdangelo)
  • expose table name in proto extension codec #11139 (leoyvens)
  • fix(typo): unqualifed to unqualified #11159 (waynexia)
  • Consolidate Filter::remove_aliases into Expr::unalias_nested #11001 (alamb)
  • Convert nth_value to UDAF #11287 (jcsherin)

Implemented enhancements:

  • feat: Add support for Int8 and Int16 data types in data page statistics #10931 (Weijun-H)
  • feat: add CliSessionContext trait for cli #10890 (tshauck)
  • feat(optimizer): handle partial anchored regex cases and improve doc #10977 (waynexia)
  • feat: support uint data page extraction #11018 (tshauck)
  • feat: propagate EmptyRelation for more join types #10963 (tshauck)
  • feat: Add method to add analyzer rules to SessionContext #10849 (pingsutw)
  • feat: Support duplicate column names in Joins in Substrait consumer #11049 (Blizzara)
  • feat: Add support for Timestamp data types in data page statistics. #11123 (efredine)
  • feat: Add support for Binary/LargeBinary/Utf8/LargeUtf8 data types in data page statistics #11136 (PsiACE)
  • feat: Support Map type in Substrait conversions #11129 (Blizzara)
  • feat: Conditionally allow to keep partition_by columns when using PARTITIONED BY enhancement #11107 (hveiga)
  • feat: enable "substring" as a UDF in addition to "substr" #11277 (Blizzara)

Fixed bugs:

  • fix: use total ordering in the min & max accumulator for floats #10627 (westonpace)
  • fix: Support double quotes in date_part #10833 (Weijun-H)
  • fix: Ignore nullability of list elements when consuming Substrait #10874 (Blizzara)
  • fix: Support NOT <field> IN (<subquery>) via anti join #10936 (akoshchiy)
  • fix: CTEs defined in a subquery can escape their scope #10954 (jonahgao)
  • fix: Fix the incorrect null joined rows for SMJ outer join with join filter #10892 (viirya)
  • fix: gcd returns negative results #11099 (jonahgao)
  • fix: LCM panicked due to overflow #11131 (jonahgao)
  • fix: Support dictionary type in parquet metadata statistics. #11169 (efredine)
  • fix: Ignore nullability in Substrait structs #11130 (Blizzara)
  • fix: typo in comment about FinalPhysicalPlan #11181 (c8ef)
  • fix: Support Substrait's compound names also for window functions #11163 (Blizzara)
  • fix: Incorrect LEFT JOIN evaluation result on OR conditions #11203 (viirya)
  • fix: Be more lenient in interpreting input args for builtin window functions #11199 (Blizzara)
  • fix: correctly handle Substrait windows with rows bounds (and validate executability of test plans) #11278 (Blizzara)
  • fix: When consuming Substrait, temporarily rename clashing duplicate columns #11329 (Blizzara)

Documentation updates:

  • Minor: Clarify SessionContext::state docs #10847 (alamb)
  • Minor: Update SIGMOD paper reference url #10860 (alamb)
  • docs(variance): Correct typos in comments #10844 (pingsutw)
  • Add missing code close tick in LiteralGuarantee docs #10859 (adriangb)
  • Minor: Add more docs and examples for Transformed and TransformedResult #11003 (alamb)
  • doc: Update links in the documantation #11044 (Weijun-H)
  • Minor: Examples cleanup + more docs in pruning example #11086 (alamb)
  • Minor: refine documentation pointing to examples #11110 (alamb)
  • Fix running in Docker instructions #11141 (findepi)
  • docs: add example for custom file format with COPY TO #11174 (tshauck)
  • Fix docs wordings #11226 (findepi)
  • Fix count() docs around including null values #11293 (findepi)

Other:

  • chore: Prepare 39.0.0-rc1 #10828 (andygrove)
  • Remove expr_fn::sum and replace them with function stub #10816 (jayzhan211)
  • Debug print as many fields as possible for SessionState #10818 (lewiszlw)
  • Prune Parquet RowGroup in a single call to PruningPredicate::prune, update StatisticsExtractor API #10802 (alamb)
  • Remove Built-in sum and Rename to lowercase sum #10831 (jayzhan211)
  • Convert stddev and stddev_pop to UDAF #10834 (goldmedal)
  • Introduce expr builder for aggregate function #10560 (jayzhan211)
  • chore: Improve change log generator #10841 (andygrove)
  • Support user defined ParquetAccessPlan in ParquetExec, validation to ParquetAccessPlan::select #10813 (alamb)
  • Convert VariancePopulation to UDAF #10836 (mknaw)
  • Convert approx_median to UDAF #10840 (goldmedal)
  • MINOR: use workspace deps in proto-common (upgrade object store dependency) #10848 (waynexia)
  • Minor: add Window::try_new_with_schema constructor #10850 (sadboy)
  • Add support for reading CSV files with comments #10467 (bbannier)
  • Convert approx_distinct to UDAF #10851 (Lordworms)
  • minor: add proto-common crate to release instructions #10858 (andygrove)
  • Implement TPCH substrait integration teset, support tpch_1 #10842 (Lordworms)
  • Remove unecessary passing around of suffix: &str in pruning.rs's RequiredColumns #10863 (adriangb)
  • chore: Make DFSchema::datatype_is_logically_equal function public #10867 (advancedxy)
  • Bump braces from 3.0.2 to 3.0.3 in /datafusion/wasmtest/datafusion-wasm-app #10865 (dependabot[bot])
  • Docs: Add unnest to SQL Reference #10839 (gloomweaver)
  • Support correct output column names and struct field names when consuming/producing Substrait #10829 (Blizzara)
  • Make Logical Plans more readable by removing extra aliases #10832 (MohamedAbdeen21)
  • Minor: Improve ListingTable documentation #10854 (alamb)
  • Extending join fuzz tests to support join filtering #10728 (edmondop)
  • replace and(, not()) with and_not(*) #10885 (RTEnzyme)
  • Disabling test for semi join with filters #10887 (edmondop)
  • Minor: Update min_statistics and max_statistics to be helpers, update docs #10866 (alamb)
  • Remove Interval column test // parquet extraction #10888 (marvinlanhenke)
  • Minor: SMJ fuzz tests fix for rowcounts #10891 (comphead)
  • Move Count to functions-aggregate, update MSRV to rust 1.75 #10484 (jayzhan211)
  • refactor: fetch statistics for a given ParquetMetaData #10880 (NGA-TRAN)
  • Move FileSinkExec::metrics to the correct place #10901 (joroKr21)
  • Refine ParquetAccessPlan comments and tests #10896 (alamb)
  • ci: fix clippy failures on main #10903 (jonahgao)
  • Minor: disable flaky fuzz test #10904 (comphead)
  • Remove builtin count #10893 (jayzhan211)
  • Move Regr_* functions to use UDAF #10898 (eejbyfeldt)
  • Docs: clarify when the parquet reader will read from object store when using cached metadata #10909 (alamb)
  • Minor: Fix bench.sh tpch data #10905 (alamb)
  • Minor: use venv in benchmark compare #10894 (tmi)
  • Support explicit type and name during table creation #10273 (duongcongtoai)
  • Simplify Join Partition Rules #10911 (berkaysynnada)
  • Move Literal to physical-expr-common #10910 (lewiszlw)
  • chore: update some error messages for clarity #10916 (jeffreyssmith2nd)
  • Initial Extract parquet data page statistics API #10852 (marvinlanhenke)
  • Add contains function, and support in datafusion substrait consumer #10879 (Lordworms)
  • Minor: Improve arrow_statistics tests #10927 (alamb)
  • Minor: Remove prefer_hash_join env variable for clickbench #10933 (jayzhan211)
  • Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF #10917 (goldmedal)
  • refactor: remove extra default in max rows #10941 (tshauck)
  • chore: Improve performance of Parquet statistics conversion #10932 (Weijun-H)
  • Add catalog::resolve_table_references #10876 (leoyvens)
  • Convert BitAnd, BitOr, BitXor to UDAF #10930 (dharanad)
  • refactor: improve PoolType argument handling for CLI #10940 (tshauck)
  • Minor: remove potential string copy from Column::from_qualified_name #10947 (alamb)
  • Fix: StatisticsConverter counts for missing columns #10946 (marvinlanhenke)
  • Add initial support for Utf8View and BinaryView types #10925 (XiangpengHao)
  • Use shorter aliases in CSE #10939 (peter-toth)
  • Substrait support for ParquetExec round trip for simple select #10949 (xinlifoobar)
  • Support to unparse ScalarValue::IntervalMonthDayNano to String #10956 (goldmedal)
  • Minor: Return option from row_group_row_count #10973 (marvinlanhenke)
  • Minor: Add routine to debug join fuzz tests #10970 (comphead)
  • Support to unparse ScalarValue::TimestampNanosecond to String #10984 (goldmedal)
  • build(deps-dev): bump ws from 8.14.2 to 8.17.1 in /datafusion/wasmtest/datafusion-wasm-app #10988 (dependabot[bot])
  • Minor: reuse Rows buffer in GroupValuesRows #10980 (alamb)
  • Add example for writing SQL analysis using DataFusion structures #10938 (LorrensP-2158466)
  • Push down filter for Unnest plan #10974 (jayzhan211)
  • Add parquet page stats for float{16, 32, 64} #10982 (tmi)
  • Fix file_stream_provider example compilation failure on windows #10975 (lewiszlw)
  • Stop copying LogicalPlan and Exprs in CommonSubexprEliminate (2-3% planning speed improvement) #10835 (alamb)
  • chore: Update documentation link in PhysicalOptimizerRule comment #11002 (Weijun-H)
  • Push down filter plan for unnest on non-unnest column only #10991 (jayzhan211)
  • Minor: add test for pushdown past unnest #11017 (alamb)
  • Update docs for protoc minimum installed version #11006 (jcsherin)
  • propagate error instead of panicking on out of bounds in physical-expr/src/analysis.rs #10992 (LorrensP-2158466)
  • Add drop_columns to dataframe api #11010 (Omega359)
  • Push down filter plan for non-unnest column #11019 (jayzhan211)
  • Consider timezones with UTC and +00:00 to be the same #10960 (marvinlanhenke)
  • Deprecate OptimizerRule::try_optimize #11022 (lewiszlw)
  • Relax combine partial final rule #10913 (mustafasrepo)
  • Compute gcd with u64 instead of i64 because of overflows #11036 (LorrensP-2158466)
  • Add distinct_on to dataframe api #11012 (Omega359)
  • chore: add test to show current behavior of AT TIME ZONE for string vs. timestamp #11056 (appletreeisyellow)
  • Boolean parquet get datapage stat #11054 (LorrensP-2158466)
  • Using display_name for Expr::Aggregation #11020 (Lordworms)
  • Minor: Convert Count's name to lowercase #11028 (jayzhan211)
  • Minor: Move function::Hint to datafusion-expr crate to avoid physical-expr dependency for datafusion-function crate #11061 (jayzhan211)
  • Support to unparse ScalarValue::TimestampMillisecond to String #11046 (pingsutw)
  • Support to unparse IntervalYearMonth and IntervalDayTime to String #11065 (goldmedal)
  • SMJ: fix streaming row concurrency issue for LEFT SEMI filtered join #11041 (comphead)
  • Add advanced_parquet_index.rs example of index in into parquet files #10701 (alamb)
  • Add Expr::column_refs to find column references without copying #10948 (alamb)
  • Give OptimizerRule::try_optimize default implementation and cleanup duplicated custom implementations #11059 (lewiszlw)
  • Fix FormatOptions::CSV propagation #10912 (svranesevic)
  • Support parsing SQL strings to Exprs #10995 (xinlifoobar)
  • Support dictionary data type in array_to_string #10908 (EduardoVega)
  • Implement min/max for interval types #11015 (maxburke)
  • Improve LIKE performance for Dictionary arrays #11058 (Lordworms)
  • handle overflow in gcd and return this as an error #11057 (LorrensP-2158466)
  • Convert Correlation to UDAF #11064 (pingsutw)
  • Migrate more code from Expr::to_columns to Expr::column_refs #11067 (alamb)
  • decimal support for unparser #11092 (y-f-u)
  • Improve CommonSubexprEliminate identifier management (10% faster planning) #10473 (peter-toth)
  • Change wildcard qualifier type from String to TableReference #11073 (linhr)
  • Allow access to UDTF in SessionContext #11071 (linhr)
  • Strip table qualifiers from schema in UNION ALL for unparser #11082 (phillipleblanc)
  • Update ListingTable to use StatisticsConverter #11068 (xinlifoobar)
  • to_timestamp functions should preserve timezone #11038 (maxburke)
  • Rewrite array operator to function in parser #11101 (jayzhan211)
  • Resolve empty relation opt for join types #11066 (LorrensP-2158466)
  • Add composed extension codec example #11095 (lewiszlw)
  • Minor: Avoid some repetition in to_timestamp #11116 (alamb)
  • Minor: fix ScalarValue::new_ten error message (cites one not ten) #11126 (gstvg)
  • Deprecate Expr::column_refs #11115 (alamb)
  • Overflow in negate operator #11084 (LorrensP-2158466)
  • Minor: Add Architectural Goals to the docs #11109 (alamb)
  • Fix overflow in pow #11124 (LorrensP-2158466)
  • Support to unparse Time scalar value to String #11121 (goldmedal)
  • Support to unparse TimestampSecond and TimestampMicrosecond to String #11120 (goldmedal)
  • Add standalone example for OptimizerRule #11087 (alamb)
  • Fix overflow in factorial #11134 (LorrensP-2158466)
  • Temporary Fix: Query error when grouping by case expressions #11133 (jonahgao)
  • Fix nullability of return value of array_agg #11093 (eejbyfeldt)
  • Support filter for List #11091 (jayzhan211)
  • [MINOR]: Fix some minor silent bugs #11127 (mustafasrepo)
  • Minor Fix for Logical and Physical Expr Conversions #11142 (berkaysynnada)
  • Support Date Parquet Data Page Statistics #11135 (dharanad)
  • fix flaky array query slt test #11140 (leoyvens)
  • Support Decimal and Decimal256 Parquet Data Page Statistics #11138 (Lordworms)
  • Implement comparisons on nested data types such that distinct/except would work #11117 (rtyler)
  • Minor: dont panic with bad arguments to round #10899 (tmi)
  • Minor: reduce replication for nested comparison #11149 (alamb)
  • [Minor]: Remove datafusion-functions-aggregate dependency from physical-expr crate #11158 (mustafasrepo)
  • adding config to control Varchar behavior #11090 (Lordworms)
  • minor: consolidate gcd related tests #11164 (jonahgao)
  • Minor: move batch spilling methods to lib.rs to make it reusable #11154 (comphead)
  • Move schema projection to where it's used in ListingTable #11167 (adriangb)
  • Make running in docker instruction be copy-pastable #11148 (findepi)
  • Rewrite array @> array and array <@ array in sql_expr_to_logical_expr #11155 (jayzhan211)
  • Minor: make some physical_optimizer rules public #11171 (askalt)
  • Remove pr_benchmarks.yml #11165 (alamb)
  • Optionally display schema in explain plan #11177 (alamb)
  • Minor: Add more support for ScalarValue::Float16 #11156 (Lordworms)
  • Minor: fix SQLOptions::with_allow_ddl comments #11166 (alamb)
  • Update sqllogictest requirement from 0.20.0 to 0.21.0 #11189 (dependabot[bot])
  • Support Time Parquet Data Page Statistics #11187 (dharanad)
  • Adds support for Dictionary data type statistics from parquet data pages. #11195 (efredine)
  • [Minor]: Make sort_batch public #11191 (mustafasrepo)
  • Introduce user defined SQL planner API #11180 (jayzhan211)
  • Covert grouping to udaf #11147 (Rachelint)
  • Make statistics_from_parquet_meta a sync function #11205 (adriangb)
  • Allow user defined SQL planners to be registered #11208 (samuelcolvin)
  • Recursive unnest #11062 (duongcongtoai)
  • Document how to test examples in user guide, add some more coverage #11178 (alamb)
  • Minor: Move MemoryCatalog*Provider into a module, improve comments #11183 (alamb)
  • Add standalone example of using the SQL frontend #11088 (alamb)
  • Add Optimizer Sanity Checker, improve sortedness equivalence properties #11196 (mustafasrepo)
  • Implement user defined planner for extract #11215 (xinlifoobar)
  • Move basic SQL query examples to user guide #11217 (alamb)
  • Support FixedSizedBinaryArray Parquet Data Page Statistics #11200 (dharanad)
  • Implement ScalarValue::Map #11224 (goldmedal)
  • Remove unmaintained python pre-commit configuration #11255 (findepi)
  • Enable clone_on_ref_ptr clippy lint on execution crate #11239 (lewiszlw)
  • Minor: Improve documentation about pushdown join predicates #11209 (alamb)
  • Minor: clean up data page statistics tests and fix bugs #11236 (efredine)
  • Replacing pattern matching through downcast with trait method #11257 (edmondop)
  • Update substrait requirement from 0.34.0 to 0.35.0 #11206 (dependabot[bot])
  • Enhance short circuit handling in CommonSubexprEliminate #11197 (peter-toth)
  • Add bench for data page statistics parquet extraction #10950 (marvinlanhenke)
  • Register SQL planners in SessionState constructor #11253 (dharanad)
  • Support DuckDB style struct syntax #11214 (jayzhan211)
  • Enable clone_on_ref_ptr clippy lint on expr crate #11238 (lewiszlw)
  • Optimize PushDownFilter to avoid recreating schema columns #11211 (alamb)
  • Remove outdated rewrite_expr.rs example #11085 (alamb)
  • Implement TPCH substrait integration teset, support tpch_2 #11234 (Lordworms)
  • Enable clone_on_ref_ptr clippy lint on physical-expr crate #11240 (lewiszlw)
  • Add standalone AnalyzerRule example that implements row level access control #11089 (alamb)
  • Replace println! with assert! if possible in DataFusion examples #11237 (Nishi46)
  • minor: format Expr::get_type() #11267 (jonahgao)
  • Fix hash join for nested types #11232 (eejbyfeldt)
  • Infer count() aggregation is not null #11256 (findepi)
  • Remove unnecessary qualified names #11292 (findepi)
  • Fix running examples readme #11225 (findepi)
  • Minor: Add ConstExpr::from and use in physical optimizer #11283 (alamb)
  • Implement TPCH substrait integration teset, support tpch_3 #11298 (Lordworms)
  • Implement user defined planner for position #11243 (xinlifoobar)
  • Upgrade to arrow 52.1.0 (and fix clippy issues on main) #11302 (alamb)
  • AggregateExec: Take grouping sets into account for InputOrderMode #11301 (thinkharderdev)
  • Add user_defined_sql_planners(..) to FunctionRegistry #11296 (Omega359)
  • use safe cast in propagate_constraints #11297 (Lordworms)
  • Minor: Remove clone in optimizer #11315 (jayzhan211)
  • minor: Add PhysicalSortExpr::new #11310 (andygrove)
  • Fix data page statistics when all rows are null in a data page #11295 (efredine)
  • Made UserDefinedFunctionPlanner to uniform the usages #11318 (xinlifoobar)
  • Implement user defined planner for create_struct & create_named_struct #11273 (dharanad)
  • Improve stats convert performance for Binary/String/Boolean arrays #11319 (Rachelint)
  • Fix typos in datafusion-examples/datafusion-cli/docs #11259 (lewiszlw)
  • Minor: Fix Failing TPC-DS Test #11331 (berkaysynnada)
  • HashJoin can preserve the right ordering when join type is Right #11276 (berkaysynnada)
  • Update substrait requirement from 0.35.0 to 0.36.0 #11328 (dependabot[bot])
  • Support to uparse logical plans with timestamp cast to string #11326 (sgrebnov)
  • Implement user defined planner for sql_substring_to_expr #11327 (xinlifoobar)
  • Improve volatile expression handling in CommonSubexprEliminate #11265 (peter-toth)
  • Support IS NULL and IS NOT NULL on Unions #11321 (samuelcolvin)
  • Implement TPCH substrait integration test, support tpch_4 and tpch_5 #11311 (Lordworms)
  • Enable clone_on_ref_ptr clippy lint on physical-plan crate #11241 (lewiszlw)
  • Remove any aliases in Filter::try_new rather than erroring #11307 (samuelcolvin)
  • Improve DataFrame Users Guide #11324 (alamb)
  • chore: Rename UserDefinedSQLPlanner to ExprPlanner #11338 (andygrove)
  • Revert "remove derive(Copy) from Operator (#11132)" #11341 (alamb)

Credits

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    41	Andrew Lamb
    17	Jay Zhan
    12	Lordworms
    12	张林伟
    10	Arttu
     9	Jax Liu
     9	Lorrens Pantelis
     8	Piotr Findeisen
     7	Dharan Aditya
     7	Jonah Gao
     7	Xin Li
     6	Andy Grove
     6	Marvin Lanhenke
     6	Trent Hauck
     5	Alex Huang
     5	Eric Fredine
     5	Mustafa Akur
     5	Oleks V
     5	dependabot[bot]
     4	Adrian Garcia Badaracco
     4	Berkay Şahin
     4	Kevin Su
     4	Peter Toth
     4	Ruihang Xia
     4	Samuel Colvin
     3	Bruce Ritchie
     3	Edmondo Porcu
     3	Emil Ejbyfeldt
     3	Heran Lin
     3	Leonardo Yvens
     3	jcsherin
     3	tmi
     2	Duong Cong Toai
     2	Liang-Chi Hsieh
     2	Max Burke
     2	kamille
     1	Albert Skalt
     1	Andrey Koshchiy
     1	Benjamin Bannier
     1	Bo Lin
     1	Chojan Shang
     1	Chunchun Ye
     1	Dan Harris
     1	Devin D'Angelo
     1	Eduardo Vega
     1	Georgi Krastev
     1	Hector Veiga
     1	Jeffrey Smith II
     1	Kirill Khramkov
     1	Matt Nawara
     1	Mohamed Abdeen
     1	Nga Tran
     1	Nishi
     1	Phillip LeBlanc
     1	R. Tyler Croy
     1	RT_Enzyme
     1	Sava Vranešević
     1	Sergei Grebnov
     1	Weston Pace
     1	Xiangpeng Hao
     1	advancedxy
     1	c8ef
     1	gstvg
     1	yfu

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.