Back to Datafusion

Apache DataFusion 41.0.0 Changelog

dev/changelog/41.0.0.md

53.1.032.8 KB
Original Source
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

Apache DataFusion 41.0.0 Changelog

This release consists of 245 commits from 69 contributors. See credits at the end of this changelog for more information.

Breaking changes:

  • make unparser Dialect trait Send + Sync #11504 (y-f-u)
  • Implement physical plan serialization for csv COPY plans , add as_any, Debug to FileFormatFactory #11588 (Lordworms)
  • Consistent API to set parameters of aggregate and window functions (AggregateExt --> ExprFunctionExt) #11550 (timsaucer)
  • Rename ColumnOptions to ParquetColumnOptions #11512 (alamb)
  • Rename input_type --> input_types on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs #11666 (lewiszlw)
  • Rename RepartitionExec metric repart_time to repartition_time #11703 (alamb)
  • Remove AggregateFunctionDefinition #11803 (lewiszlw)
  • Skipping partial aggregation when it is not helping for high cardinality aggregates #11627 (korowa)
  • Optionally create name of aggregate expression from expressions #11776 (lewiszlw)

Performance related:

  • feat: Optimize CASE expression for "column or null" use case #11534 (andygrove)
  • feat: Optimize CASE expression for usage where then and else values are literals #11553 (andygrove)
  • perf: Optimize IsNotNullExpr #11586 (andygrove)

Implemented enhancements:

  • feat: Add fail_on_overflow option to BinaryExpr #11400 (andygrove)
  • feat: add UDF to_local_time() #11347 (appletreeisyellow)
  • feat: switch to using proper Substrait types for IntervalYearMonth and IntervalDayTime #11471 (Blizzara)
  • feat: support UDWFs in Substrait #11489 (Blizzara)
  • feat: support unnest in GROUP BY clause #11469 (JasonLi-cn)
  • feat: support COUNT() #11229 (tshauck)
  • feat: consume and produce Substrait type extensions #11510 (Blizzara)
  • feat: Error when a SHOW command is passed in with an accompanying non-existant variable #11540 (itsjunetime)
  • feat: support Map literals in Substrait consumer and producer #11547 (Blizzara)
  • feat: add bounds for unary math scalar functions #11584 (tshauck)
  • feat: Add support for cardinality function on maps #11801 (Weijun-H)
  • feat: support Utf8View type in starts_with function #11787 (tshauck)
  • feat: Expose public method for optimizing physical plans #11879 (andygrove)

Fixed bugs:

  • fix: Fix eq properties regression from #10434 #11363 (suremarc)
  • fix: make sure JOIN ON expression is boolean type #11423 (jonahgao)
  • fix: regexp_replace fails when pattern or replacement is a scalar NULL #11459 (Weijun-H)
  • fix: unparser generates wrong sql for derived table with columns #11505 (y-f-u)
  • fix: make UnKnownColumns not equal to others physical exprs #11536 (jonahgao)
  • fix: fixes trig function order by #11559 (tshauck)
  • fix: CASE with NULL #11542 (Weijun-H)
  • fix: panic and incorrect results in LogFunc::output_ordering() #11571 (jonahgao)
  • fix: expose the fluent API fn for approx_distinct instead of the module #11644 (Michael-J-Ward)
  • fix: dont try to coerce list for regex match #11646 (tshauck)
  • fix: regr_count now returns Uint64 #11731 (Michael-J-Ward)
  • fix: set null_equals_null to false when convert_cross_join_to_inner_join #11738 (jonahgao)
  • fix: Add additional required expression for natural join #11713 (Lordworms)
  • fix: hash join tests with forced collisions #11806 (korowa)
  • fix: collect_columns quadratic complexity #11843 (crepererum)

Documentation updates:

  • Minor: Add link to blog to main DataFusion website #11356 (alamb)
  • Add to_local_time() in function reference docs #11401 (appletreeisyellow)
  • Minor: Consolidate specification doc sections #11427 (alamb)
  • Combine the Roadmap / Quarterly Roadmap sections #11426 (alamb)
  • Minor: Add an example for backtrace pretty print #11450 (goldmedal)
  • Docs: Document creating new extension APIs #11425 (alamb)
  • Minor: Clarify which parquet options are used for reading/writing #11511 (alamb)
  • Support newlines_in_values CSV option #11533 (connec)
  • chore: Minor cleanup simplify_demo() example #11576 (kavirajk)
  • Move Datafusion Query Optimizer to library user guide #11563 (devesh-2002)
  • Fix typo in doc of Partitioning #11612 (waruto210)
  • Doc: A tiny typo in scalar function's doc #11620 (2010YOUY01)
  • Change default Parquet writer settings to match arrow-rs (except for compression & statistics) #11558 (wiedld)
  • Rename functions-array to functions-nested #11602 (goldmedal)
  • Add parser option enable_options_value_normalization #11330 (xinlifoobar)
  • Add reference to #comet channel in Arrow Rust Discord server #11637 (ajmarcus)
  • Extract catalog API to separate crate, change TableProvider::scan to take a trait rather than SessionState #11516 (findepi)
  • doc: why nullable of list item is set to true #11626 (jcsherin)
  • Docs: adding explicit mention of test_utils to docs #11670 (edmondop)
  • Ensure statistic defaults in parquet writers are in sync #11656 (wiedld)
  • Merge string-view2 branch: reading from parquet up to 2x faster for some ClickBench queries (not on by default) #11667 (alamb)
  • Doc: Add Sail to known users list #11791 (shehabgamin)
  • Move min and max to user defined aggregate function, remove AggregateFunction / AggregateFunctionDefinition::BuiltIn #11013 (edmondop)
  • Change name of MAX/MIN udaf to lowercase max/min #11795 (edmondop)
  • doc: Add support for map and make_map functions #11799 (Weijun-H)
  • Improve readme page in crates.io #11809 (lewiszlw)
  • refactor: remove unneed mut for session context #11864 (sunng87)

Other:

  • Prepare 40.0.0 Release #11343 (andygrove)
  • Support NULL literals in where clause #11266 (xinlifoobar)
  • Implement TPCH substrait integration test, support tpch_6, tpch_10, t… #11349 (Lordworms)
  • Fix bug when pushing projection under joins #11333 (jonahgao)
  • Minor: some cosmetics in filter.rs, fix clippy due to logical conflict #11368 (comphead)
  • Update prost-derive requirement from 0.12 to 0.13 #11355 (dependabot[bot])
  • Minor: update dashmap 6.0.1 #11335 (alamb)
  • Improve and test dataframe API examples in docs #11290 (alamb)
  • Remove redundant unalias_nested calls for creating Filter's #11340 (alamb)
  • Enable clone_on_ref_ptr clippy lint on optimizer #11346 (lewiszlw)
  • Update termtree requirement from 0.4.1 to 0.5.0 #11383 (dependabot[bot])
  • Introduce resources_err! error macro #11374 (comphead)
  • Enable clone_on_ref_ptr clippy lint on common #11384 (lewiszlw)
  • Track parquet writer encoding memory usage on MemoryPool #11345 (wiedld)
  • Minor: remove clones and unnecessary Arcs in from_substrait_rex #11337 (alamb)
  • Minor: Change no-statement error message to be clearer #11394 (itsjunetime)
  • Change array_agg to return null on no input rather than empty list #11299 (jayzhan211)
  • Minor: return "not supported" for COUNT DISTINCT with multiple arguments #11391 (jonahgao)
  • Enable clone_on_ref_ptr clippy lint on sql #11380 (lewiszlw)
  • Move configuration information out of example usage page #11300 (alamb)
  • chore: reuse a single function to create the Substrait TPCH consumer test contexts #11396 (Blizzara)
  • refactor: change error type for "no statement" #11411 (crepererum)
  • Implement prettier SQL unparsing (more human readable) #11186 (MohamedAbdeen21)
  • Move overlay planning toExprPlanner #11398 (dharanad)
  • Coerce types for all union children plans when eliminating nesting #11386 (gruuya)
  • Add customizable equality and hash functions to UDFs #11392 (joroKr21)
  • Implement ScalarFunction MAKE_MAP and MAP #11361 (goldmedal)
  • Improve CommonSubexprEliminate rule with surely and conditionally evaluated stats #11357 (peter-toth)
  • fix(11397): surface proper errors in ParquetSink #11399 (wiedld)
  • Minor: Add note about SQLLancer fuzz testing to docs #11430 (alamb)
  • Trivial: use arrow csv writer's timestamp_tz_format #11407 (tmi)
  • Improved unparser documentation #11395 (alamb)
  • Avoid calling shutdown after failed write of AsyncWrite #11415 (joroKr21)
  • Short term way to make AggregateStatistics still work when min/max is converted to udaf #11261 (Rachelint)
  • Implement TPCH substrait integration test, support tpch_13, tpch_14,16 #11405 (Lordworms)
  • Minor: fix giuthub action labeler rules #11428 (alamb)
  • Minor: change internal error to not supported error for nested field … #11446 (alamb)
  • Minor: change Datafusion --> DataFusion in docs #11439 (alamb)
  • Support serialization/deserialization for custom physical exprs in proto #11387 (lewiszlw)
  • remove termtree dependency #11416 (Kev1n8)
  • Add SessionStateBuilder and extract out the registration of defaults #11403 (Omega359)
  • integrate consumer tests, implement tpch query 18 to 22 #11462 (Lordworms)
  • Docs: Explain the usage of logical expressions for create_aggregate_expr #11458 (jayzhan211)
  • Return scalar result when all inputs are constants in map and make_map #11461 (Rachelint)
  • Enable clone_on_ref_ptr clippy lint on functions* #11468 (lewiszlw)
  • minor: non-overlapping repart_time and send_time metrics #11440 (korowa)
  • Minor: rename row_groups.rs to row_group_filter.rs #11481 (alamb)
  • Support alternate formats for unparsing datetime to timestamp and interval #11466 (y-f-u)
  • chore: Add criterion benchmark for CaseExpr #11482 (andygrove)
  • Initial support for StringView, merge changes from string-view development branch #11402 (alamb)
  • Replace to_lowercase with to_string in sql example #11486 (lewiszlw)
  • Minor: Make execute_input_stream Accessible for Any Sinking Operators #11449 (berkaysynnada)
  • Enable clone_on_ref_ptr clippy lints on proto #11465 (lewiszlw)
  • upgrade sqlparser 0.47 -> 0.48 #11453 (MohamedAbdeen21)
  • Add extension hooks for encoding and decoding UDAFs and UDWFs #11417 (joroKr21)
  • Remove element's nullability of array_agg function #11447 (jayzhan211)
  • Get expr planners when creating new planner #11485 (jayzhan211)
  • Support alternate format for Utf8 unparsing (CHAR) #11494 (sgrebnov)
  • implement retract_batch for xor accumulator #11500 (drewhayward)
  • Refactor: more clearly delineate between TableParquetOptions and ParquetWriterOptions #11444 (wiedld)
  • chore: fix typos of common and core packages #11520 (JasonLi-cn)
  • Move spill related functions to spill.rs #11509 (findepi)
  • Add tests that show the different defaults for ArrowWriter and TableParquetOptions #11524 (wiedld)
  • Create datafusion-physical-optimizer crate #11507 (lewiszlw)
  • Minor: Assert test_enabled_backtrace requirements to run #11525 (comphead)
  • Move handlign of NULL literals in where clause to type coercion pass #11491 (xinlifoobar)
  • Update parquet page pruning code to use the StatisticsExtractor #11483 (alamb)
  • Enable SortMergeJoin LeftAnti filtered fuzz tests #11535 (comphead)
  • chore: fix typos of expr, functions, optimizer, physical-expr-common,… #11538 (JasonLi-cn)
  • Minor: Remove clone in PushDownFilter #11532 (jayzhan211)
  • Minor: avoid a clone in type coercion #11530 (alamb)
  • Move array ArrayAgg to a UserDefinedAggregate #11448 (jayzhan211)
  • Move MAKE_MAP to ExprPlanner #11452 (goldmedal)
  • chore: fix typos of sql, sqllogictest and substrait packages #11548 (JasonLi-cn)
  • Prevent bigger files from being checked in #11508 (findepi)
  • Add dialect param to use double precision for float64 in Postgres #11495 (Sevenannn)
  • Minor: move SessionStateDefaults into its own module #11566 (alamb)
  • refactor: rewrite mega type to an enum containing both cases #11539 (LorrensP-2158466)
  • Move sql_compound_identifier_to_expr to ExprPlanner #11487 (dharanad)
  • Support SortMergeJoin spilling #11218 (comphead)
  • Fix unparser invalid sql for query with order #11527 (y-f-u)
  • Provide DataFrame API for map and move map to functions-array #11560 (goldmedal)
  • Move OutputRequirements to datafusion-physical-optimizer crate #11579 (xinlifoobar)
  • Minor: move Column related tests and rename column.rs #11573 (jonahgao)
  • Fix SortMergeJoin antijoin flaky condition #11604 (comphead)
  • Improve Union Equivalence Propagation #11506 (mustafasrepo)
  • Migrate OrderSensitiveArrayAgg to be a user defined aggregate #11564 (jayzhan211)
  • Minor:Disable flaky SMJ antijoin filtered test until the fix #11608 (comphead)
  • support Decimal256 type in datafusion-proto #11606 (leoyvens)
  • Chore/fifo tests cleanup #11616 (ozankabak)
  • Fix Internal Error for an INNER JOIN query #11578 (xinlifoobar)
  • test: get file size by func metadata #11575 (zhuliquan)
  • Improve unparser MySQL compatibility #11589 (sgrebnov)
  • Push scalar functions into cross join #11528 (lewiszlw)
  • Remove ArrayAgg Builtin in favor of UDF #11611 (jayzhan211)
  • refactor: simplify DFSchema::field_with_unqualified_name #11619 (jonahgao)
  • Minor: Use upstream concat_batches from arrow-rs #11615 (alamb)
  • Fix : signum function bug when 0.0 input #11580 (getChan)
  • Enforce uniqueness of named_struct field names #11614 (dharanad)
  • Minor: unecessary row_count calculation in CrossJoinExec and NestedLoopsJoinExec #11632 (alamb)
  • ExprBuilder for Physical Aggregate Expr #11617 (jayzhan211)
  • Minor: avoid copying order by exprs in planner #11634 (alamb)
  • Unify CI and pre-commit hook settings for clippy #11640 (findepi)
  • Parsing SQL strings to Exprs with the qualified schema #11562 (Lordworms)
  • Add some zero column tests covering LIMIT, GROUP BY, WHERE, JOIN, and WINDOW #11624 (Kev1n8)
  • Refactor/simplify window frame utils #11648 (ozankabak)
  • Minor: use ready! macro to simplify FilterExec #11649 (alamb)
  • Temporarily pin toolchain version to avoid clippy errors #11655 (findepi)
  • Fix clippy errors for Rust 1.80 #11654 (findepi)
  • Add CsvExecBuilder for creating CsvExec #11633 (connec)
  • chore(deps): update sqlparser requirement from 0.48 to 0.49 #11630 (dependabot[bot])
  • Add support for USING to SQL unparser #11636 (wackywendell)
  • Run CI with latest (Rust 1.80), add ticket references to commented out tests #11661 (alamb)
  • Use AccumulatorArgs::is_reversed in NthValueAgg #11669 (jcsherin)
  • Implement physical plan serialization for json Copy plans #11645 (Lordworms)
  • Minor: improve documentation on SessionState #11642 (alamb)
  • Add LimitPushdown optimization rule and CoalesceBatchesExec fetch #11652 (alihandroid)
  • Update to arrow/parquet 52.2.0 #11691 (alamb)
  • Minor: Rename RepartitionMetrics::repartition_time to RepartitionMetrics::repart_time to match metric #11478 (alamb)
  • Update cache key used in rust CI script #11641 (findepi)
  • Fix bug in remove_join_expressions #11693 (jonahgao)
  • Initial changes to support using udaf min/max for statistics and opti… #11696 (edmondop)
  • Handle nulls in approx_percentile_cont #11721 (Dandandan)
  • Reduce repetition in try_process_group_by_unnest and try_process_unnest #11714 (JasonLi-cn)
  • Minor: Add example for ScalarUDF::call #11727 (alamb)
  • Use cargo release in bench.sh #11722 (alamb)
  • expose some fields on session state #11716 (waynexia)
  • Make DefaultSchemaAdapterFactory public #11709 (adriangb)
  • Check hashes first during probing the aggr hash table #11718 (Rachelint)
  • Implement physical plan serialization for parquet Copy plans #11735 (Lordworms)
  • Support cross-timezone timestamp comparison via coercsion #11711 (jeffreyssmith2nd)
  • Minor: Improve documentation for AggregateUDFImpl::state_fields #11740 (lewiszlw)
  • Do not push down Sorts if it violates the sort requirements #11678 (alamb)
  • Use upstream StatisticsConverter from arrow-rs in DataFusion #11479 (alamb)
  • Fix plan_to_sql: Add wildcard projection to SELECT statement if no projection was set #11744 (LatrecheYasser)
  • Use upstream DataType::from_str in arrow-cast #11254 (alamb)
  • Fix documentation warnings, make CsvExecBuilder and Unparsed pub #11729 (alamb)
  • [Minor] Add test for only nulls (empty) as input in APPROX_PERCENTILE_CONT #11760 (Dandandan)
  • Add TrackedMemoryPool with better error messages on exhaustion #11665 (wiedld)
  • Derive Debug for logical plan nodes #11757 (lewiszlw)
  • Minor: add "clickbench extended" queries to slt tests #11763 (alamb)
  • Minor: Add comment explaining rationale for hash check #11750 (alamb)
  • Fix bug that COUNT(DISTINCT) on StringView panics #11768 (XiangpengHao)
  • [Minor] Refactor approx_percentile #11769 (Dandandan)
  • minor: always time batch_filter even when the result is an empty batch #11775 (andygrove)
  • Improve OOM message when a single reservation request fails to get more bytes. #11771 (wiedld)
  • [Minor] Short circuit ApplyFunctionRewrites if there are no function rewrites #11765 (gruuya)
  • Fix #11692: Improve doc comments within macros #11694 (Rafferty97)
  • Extract CoalesceBatchesStream to a struct #11610 (alamb)
  • refactor: move ExecutionPlan and related structs into dedicated mod #11759 (waynexia)
  • Minor: Add references to github issue in comments #11784 (findepi)
  • Add docs and rename param for Signature::numeric #11778 (matthewmturner)
  • Support planning Map literal #11780 (goldmedal)
  • Support LogicalPlan Debug differently than Display #11774 (lewiszlw)
  • Remove redundant Aggregate when DISTINCT & GROUP BY are in the same query #11781 (mertak-synnada)
  • Minor: add ticket reference and fmt #11805 (alamb)
  • Improve MSRV CI check to print out problems to log #11789 (alamb)
  • Improve log func tests stability #11808 (lewiszlw)
  • Add valid Distinct case for aggregation #11814 (mertak-synnada)
  • Don't implement create_sliding_accumulator repeatedly #11813 (lewiszlw)
  • chore(deps): update rstest requirement from 0.21.0 to 0.22.0 #11811 (dependabot[bot])
  • Minor: Update exected output due to logical conflict #11824 (alamb)
  • Pass scalar to eq inside nullif #11697 (simonvandel)
  • refactor: move aggregate_statistics to datafusion-physical-optimizer #11798 (Weijun-H)
  • Minor: refactor probe check into function should_skip_aggregation #11821 (alamb)
  • Minor: consolidate path_partition test into core_integration #11831 (alamb)
  • Move optimizer integration tests to core_integration #11830 (alamb)
  • Bump deprecated version of SessionState::new_with_config_rt to 41.0.0 #11839 (kezhuw)
  • Fix partial aggregation skipping with Decimal aggregators #11833 (alamb)
  • Fix bug with zero-sized buffer for StringViewArray #11841 (XiangpengHao)
  • Reduce clone of Statistics in ListingTable and PartitionedFile #11802 (Rachelint)
  • Add LogicalPlan::CreateIndex #11817 (lewiszlw)
  • Update object_store to 0.10.2 #11860 (danlgrca)
  • Add skipped_aggregation_rows metric to aggregate operator #11706 (alamb)
  • Cast Utf8View to Utf8 to support || from StringViewArray #11796 (dharanad)
  • Improve nested loop join code #11863 (lewiszlw)
  • [Minor]: Refactor to use Result.transpose() #11882 (djanderson)
  • support ANY() op #11849 (samuelcolvin)

Credits

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    48	Andrew Lamb
    20	张林伟
     9	Jay Zhan
     9	Jonah Gao
     8	Andy Grove
     8	Lordworms
     8	Piotr Findeisen
     8	wiedld
     7	Oleks V
     6	Jax Liu
     5	Alex Huang
     5	Arttu
     5	JasonLi
     5	Trent Hauck
     5	Xin Li
     4	Dharan Aditya
     4	Edmondo Porcu
     4	dependabot[bot]
     4	kamille
     4	yfu
     3	Daniël Heres
     3	Eduard Karacharov
     3	Georgi Krastev
     2	Chris Connelly
     2	Chunchun Ye
     2	June
     2	Marco Neumann
     2	Marko Grujic
     2	Mehmet Ozan Kabak
     2	Michael J Ward
     2	Mohamed Abdeen
     2	Ruihang Xia
     2	Sergei Grebnov
     2	Xiangpeng Hao
     2	jcsherin
     2	kf zheng
     2	mertak-synnada
     1	Adrian Garcia Badaracco
     1	Alexander Rafferty
     1	Alihan Çelikcan
     1	Ariel Marcus
     1	Berkay Şahin
     1	Bruce Ritchie
     1	Devesh Rahatekar
     1	Douglas Anderson
     1	Drew Hayward
     1	Jeffrey Smith II
     1	Kaviraj Kanagaraj
     1	Kezhu Wang
     1	Leonardo Yvens
     1	Lorrens Pantelis
     1	Matthew Cramerus
     1	Matthew Turner
     1	Mustafa Akur
     1	Namgung Chan
     1	Ning Sun
     1	Peter Toth
     1	Qianqian
     1	Samuel Colvin
     1	Shehab Amin
     1	Simon Vandel Sillesen
     1	Tim Saucer
     1	Wendell Smith
     1	Yasser Latreche
     1	Yongting You
     1	danlgrca
     1	tmi
     1	waruto
     1	zhuliquan

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.