Back to Datafusion

5.0.0

dev/changelog/5.0.0.md

53.1.037.8 KB
Original Source
<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

5.0.0 (2021-08-10)

Full Changelog

Breaking changes:

  • Box ScalarValue:Lists, reduce size by half size #788 (alamb)
  • JOIN conditions are order dependent #778 (seddonm1)
  • Show the result of all optimizer passes in EXPLAIN VERBOSE #759 (alamb)
  • #723 Datafusion add option in ExecutionConfig to enable/disable parquet pruning #749 (lvheyang)
  • Update API for extension planning to include logical plan #643 (alamb)
  • Rename MergeExec to CoalescePartitionsExec #635 (andygrove)
  • fix 593, reduce cloning by taking ownership in logical planner's from fn #610 (Jimexist)
  • fix join column handling logic for On and Using constraints #605 (houqp)
  • Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2) #426 (alamb)
  • Support reading from NdJson formatted data sources #404 (heymind)
  • Add metrics to RepartitionExec #398 (andygrove)
  • Use 4.x arrow-rs from crates.io rather than git sha #395 (alamb)
  • Return Vec<bool> from PredicateBuilder rather than an Fn #370 (alamb)
  • Refactor: move RowGroupPredicateBuilder into its own module, rename to PruningPredicateBuilder #365 (alamb)
  • [Datafusion] NOW() function support #288 (msathis)
  • Implement select distinct #262 (Dandandan)
  • Refactor datafusion/src/physical_plan/common.rs build_file_list to take less param and reuse code #253 (Jimexist)
  • Support qualified columns in queries #55 (houqp)
  • Read CSV format text from stdin or memory #54 (heymind)
  • Use atomics for SQLMetric implementation, remove unused name field #25 (returnString)

Implemented enhancements:

  • Allow extension nodes to correctly plan physical expressions with relations #642
  • Filters aren't passed down to table scans in a union #557
  • Support pruning for boolean columns #490
  • Implement SQLMetrics for RepartitionExec #397
  • DataFusion benchmarks should show executed plan with metrics after query completes #396
  • Use published versions of arrow rather than github shas #393
  • Add Compare to GroupByScalar #364
  • Reusable "row group pruning" logic #363
  • Add an Order Preserving merge operator #362
  • Implement Postgres compatible now() function #251
  • COUNT DISTINCT does not support dictionary types #249
  • Use standard make_null_array for CASE #222
  • Implement date_trunc() function #203
  • COUNT DISTINCT does not support for Float64 #199
  • Update SQLMetric to use atomics rather than a Mutex #30
  • Implement PartialOrd for ScalarValue #838 (viirya)
  • Support date datatypes in max/min #820 (viirya)
  • Implement vectorized hashing for DictionaryArray types #812 (alamb)
  • Convert unsupported conditions in left right join to filters #796 [sql] (Dandandan)
  • Implement streaming versions of Dataframe.collect methods #789 (andygrove)
  • impl from str for column and scalar #762 (Jimexist)
  • impl fmt::Display for PlanType #752 (Jimexist)
  • Remove unnecessary projection in logical plan optimization phase #747 (waynexia)
  • Support table columns alias #735 (Dandandan)
  • Derive PartialEq for datasource enums #734 (alamb)
  • Allow filetype to be lowercase, Implement FromStr for FileType #728 (Jimexist)
  • Update to use arrow 5.0 #721 (alamb)
  • #554: Lead/lag window function with offset and default value arguments #687 (jgoday)
  • dedup using join column in wildcard expansion #678 (houqp)
  • Implement metrics for HashJoinExec #664 (andygrove)
  • Show physical plan with metrics in benchmark #662 (andygrove)
  • Allow non-equijoin filters in join condition #660 (Dandandan)
  • Add End-to-end test for parquet pruning + metrics for ParquetExec #657 (alamb)
  • Add support for leading field in interval #647 (Dandandan)
  • Remove hard-coded PartitionMode from Ballista serde #637 (andygrove)
  • Ballista: Implement scalable distributed joins #634 (andygrove)
  • implement rank and dense_rank function and refactor built-in window function evaluation #631 (Jimexist)
  • Improve "field not found" error messages #625 (andygrove)
  • Support modulus op #577 (gangliao)
  • implement std::default::Default for execution config #570 (Jimexist)
  • to_timestamp_millis(), to_timestamp_micros(), to_timestamp_seconds() #567 (velvia)
  • Filter push down for Union #559 (Dandandan)
  • Implement window functions with partition_by clause #558 (Jimexist)
  • support table alias in join clause #547 (houqp)
  • Not equal predicate in physical_planning pruning #544 (jgoday)
  • add error handling and boundary checking for window frames #530 (Jimexist)
  • Implement window functions with order_by clause #520 (Jimexist)
  • support group by column positions #519 [sql] (jychen7)
  • Implement constant folding for CAST #513 (msathis)
  • Add window frame constructs - alternative #506 (Jimexist)
  • Add partition by constructs in window functions and modify logical planning #501 (Jimexist)
  • Add support for boolean columns in pruning logic #500 (alamb)
  • #215 resolve aliases for group by exprs #485 (jychen7)
  • Support anti join #482 (Dandandan)
  • Support semi join #470 (Dandandan)
  • add order by construct in window function and logical plans #463 (Jimexist)
  • Remove reundant filters (e.g. c> 5 AND c>5 --> c>5) #436 (jgoday)
  • fix: display the content of debug explain #434 (NGA-TRAN)
  • implement lead and lag built-in window function #429 (Jimexist)
  • add support for ndjson for datafusion-cli #427 (Jimexist)
  • add first_value, last_value, and nth_value built-in window functions #403 (Jimexist)
  • export both now and random functions #389 (Jimexist)
  • Function to create ArrayRef from an iterator of ScalarValues #381 (alamb)
  • Sort preserving merge (#362) #379 (tustvold)
  • Add support for multiple partitions with SortExec (#362) #378 (tustvold)
  • add window expression stream, delegated window aggregation to aggregate functions, and implement row_number #375 (Jimexist)
  • Add PartialOrd and Ord to GroupByScalar (#364) #368 (tustvold)
  • Implement readable explain plans for physical plans #337 (alamb)
  • Add window expression part 1 - logical and physical planning, structure, to/from proto, and explain, for empty over clause only #334 (Jimexist)
  • Use NullArray to Pass row count to ScalarFunctions that take 0 arguments #328 (Jimexist)
  • add --quiet/-q flag and allow timing info to be turned on/off #323 (Jimexist)
  • Implement hash partitioned aggregation #320 (Dandandan)
  • Support COUNT(DISTINCT timestamps) #319 (charlibot)
  • add random SQL function #303 (Jimexist)
  • allow datafusion cli to take -- comments #296 (Jimexist)
  • Add json print format mode to datafusion cli #295 (Jimexist)
  • Add print format param with support for tsv print format to datafusion cli #292 (Jimexist)
  • Add print format param and support for csv print format to datafusion cli #289 (Jimexist)
  • allow datafusion-cli to take a file param #285 (Jimexist)
  • add param validation for datafusion-cli #284 (Jimexist)
  • [breaking change] fix 265, log should be log10, and add ln #271 (Jimexist)
  • Implement count distinct for dictionary arrays #256 (alamb)
  • Count distinct floats #252 (pjmore)
  • Add rule to eliminate LIMIT 0 and replace it with an EmptyRelation #213 (Dandandan)
  • Allow table providers to indicate their type for catalog metadata #205 (returnString)
  • Use arrow eq kernels in CaseWhen expression evaluation #52 (Dandandan)
  • Re-export Arrow and Parquet crates from DataFusion #39 (returnString)
  • [DataFusion] Optimize hash join inner workings, null handling fix #24 (Dandandan)
  • [ARROW-12441] [DataFusion] Cross join implementation #11 (Dandandan)

Fixed bugs:

  • Projection pushdown removes unqualified column names even when they are used #617
  • Panic while running join datatypes/schema.rs:165:10 #601
  • Indentation is incorrect for joins in formatted physical plans #345
  • Error while running COUNT DISTINCT (timestamp): 'Unexpected DataType for list #314
  • When joining two tables, get Error: Plan("Schema contains duplicate unqualified field name 'xxx'") #311
  • Incorrect answers with SELECT DISTINCT queries #250
  • Intermitent failure in CI join_with_hash_collision #227
  • Concat from Dataframe API no longer accepts multiple expressions #226
  • Fix right, full join handling when having multiple non-matching rows at the left side #845 (Dandandan)
  • Qualified field resolution too strict #810 [sql] (seddonm1)
  • Better join order resolution logic #797 [sql] (seddonm1)
  • Produce correct answers for Group BY NULL (Option 1) #793 (alamb)
  • Use consistent version of string_to_timestamp_nanos in DataFusion #767 (alamb)
  • #723 limit pruning rule to simple expression #764 (lvheyang)
  • #699 fix return type conflict when calling builtin math fuctions #716 (lvheyang)
  • Fix Date32 and Date64 parquet row group pruning #690 (alamb)
  • Remove qualifiers on pushed down predicates / Fix parquet pruning #689 (alamb)
  • use Weak ptr to break catalog list <> info schema cyclic reference #681 (crepererum)
  • honor table name for csv/parquet scan in ballista plan serde #629 (houqp)
  • fix 621, where unnamed window functions shall be differentiated by partition and order by clause #622 (Jimexist)
  • RFC: Do not prune out unnecessary columns with unqualified references #619 (alamb)
  • [fix] select * on empty table #613 (rdettai)
  • fix 592, support alias in window functions #607 (Jimexist)
  • RepartitionExec should not error if output has hung up #576 (alamb)
  • Fix pruning on not equal predicate #561 (alamb)
  • hash float arrays using primitive usigned integer type #556 (houqp)
  • Return errors properly from RepartitionExec #521 (alamb)
  • refactor sort exec stream and combine batches #515 (Jimexist)
  • Fix display of execution time in datafusion-cli #514 (Dandandan)
  • Wrong aggregation arguments error. #505 (jgoday)
  • fix window aggregation with alias and add integration test case #454 (Jimexist)
  • fix: don't duplicate existing filters #409 (e-dard)
  • Fixed incorrect logical type in GroupByScalar. #391 (jorgecarleitao)
  • Fix indented display for multi-child nodes #358 (alamb)
  • Fix SQL planner to support multibyte column names #357 (agatan)
  • Fix wrong projection 'optimization' #268 (Dandandan)
  • Fix Left join implementation is incorrect for 0 or multiple batches on the right side #238 (Dandandan)
  • Count distinct boolean #230 (pjmore)
  • Fix Filter / where clause without column names is removed in optimization pass #225 (Dandandan)

Documentation updates:

Performance improvements:

  • Speed up inlist for strings and primitives #813 (Dandandan)
  • perf: improve performance of SortPreservingMergeExec operator #722 (e-dard)
  • Optimize min/max queries with table statistics #719 (b41sh)
  • perf: Improve materialisation performance of SortPreservingMergeExec #691 (e-dard)
  • Optimize count(*) with table statistics #620 (Dandandan)
  • optimize window function's find_ranges_in_range #595 (Jimexist)
  • Collapse sort into window expr and do sort within logical phase #571 (Jimexist)
  • Use repartition in window functions to speed up #569 (Jimexist)
  • Constant fold / optimize to_timestamp function during planning #387 (msathis)
  • Speed up create_batch_from_map #339 (Dandandan)
  • Simplify math expression code (use unary kernel) #309 (Dandandan)

Closed issues:

  • Confirm git tagging strategy for releases #770
  • arrow::util::pretty::pretty_format_batches missing #769
  • move the assert_batches_eq! macros to a non part of datafusion #745
  • fix an issue where aliases are not respected in generating downstream schemas in window expr #592
  • make the planner to print more succinct and useful information in window function explain clause #526
  • move window frame module to be in logical_plan #517
  • use a more rust idiomatic way of handling nth_value #448
  • create a test with more than one partition for window functions #435
  • COUNT DISTINCT does not support for Boolean #202
  • Read CSV format text from stdin or memory #198
  • Fix null handling hash join #195
  • Allow TableProviders to indicate their type for the information schema #191
  • Make DataFrame extensible #190
  • TPC-H Query 19 #170
  • TPC-H Query 7 #161
  • Upgrade hashbrown to 0.10 #151
  • Implement vectorized hashing for hash aggregate #149
  • More efficient LEFT join implementation #143
  • Implement vectorized hashing #142
  • RFC Roadmap for 2021 (DataFusion) #140
  • Implement hash partitioning #131
  • Grouping by column position #110
  • [Datafusion] GROUP BY with a high cardinality doesn't seem to finish #107
  • [Rust] Add support for JSON data sources #103
  • [Rust] Implement metrics framework #95
  • Publically export Arrow crate from datafusion #36
  • Implement hash-partitioned hash aggregate #27
  • Consider using GitHub pages for DataFusion/Ballista documentation #18
  • Update "repository" in Cargo.toml #16

Merged pull requests:

  • Use RawTable API in hash join #827 (Dandandan)
  • Add test for window functions on dictionary #823 (alamb)
  • Update dependencies: prost to 0.8 and tonic to 0.5 #818 (alamb)
  • Move hash_array into hash_utils.rs #807 (alamb)
  • Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy #786 (alamb)
  • fix 226, make concat, concat_ws, and random work with Python crate #761 (Jimexist)
  • Test for parquet pruning disabling #754 (alamb)
  • Add explain verbose with limit push down #751 (Jimexist)
  • Move assert_batches_eq! macros to test_utils.rs #746 (alamb)
  • Show optimized physical and logical plans in EXPLAIN #744 (alamb)
  • update python crate to support latest pyo3 syntax and gil sematics #741 (Jimexist)
  • update python crate dependencies #740 (Jimexist)
  • provide more details on required .parquet file extension error message #729 (Jimexist)
  • split up windows functions into a dedicated module with separate files #724 (Jimexist)
  • Use pytest in integration test #715 (Jimexist)
  • replace once iter chain with array::IntoIter #704 (houqp)
  • avoid iterator materialization in column index lookup #703 (houqp)
  • Fix build with 1.52.1 #696 (alamb)
  • Fix test output due to logical merge conflict #694 (alamb)
  • add more integration tests #668 (Jimexist)
  • Bump arrow and parquet versions to 4.4 #654 (toddtreece)
  • Add query 15 to TPC-H queries #645 (Dandandan)
  • Improve error message and comments #641 (alamb)
  • add integration tests for rank, dense_rank, fix last_value evaluation with rank #638 (Jimexist)
  • round trip TPCH queries in tests #630 (houqp)
  • use Into<String> as argument type wherever applicable #615 (houqp)
  • reuse alias map in aggregate logical planning and refactor position resolution #606 (Jimexist)
  • fix clippy warnings #581 (Jimexist)
  • Add benchmarks to window function queries #564 (Jimexist)
  • reuse code for now function expr creation #548 (houqp)
  • turn on clippy rule for needless borrow #545 (Jimexist)
  • Refactor hash aggregates's planner building code #539 (Jimexist)
  • Cleanup Repartition Exec code #538 (alamb)
  • reuse datafusion physical planner in ballista building from protobuf #532 (Jimexist)
  • remove redundant into_iter() calls #527 (Jimexist)
  • Fix 517 - move window_frames module to logical_plan #518 (Jimexist)
  • Refactor window aggregation, simplify batch processing logic #516 (Jimexist)
  • Add datafusion::test_util, resolve test data paths without env vars #498 (mluts)
  • Avoid warnings in tests when compiling without default features #489 (alamb)
  • update cargo.toml in python crate and fix unit test due to hash joins #483 (Jimexist)
  • use prettier check in CI #453 (Jimexist)
  • Optimize nth_value, remove first_value, last_value structs and use idiomatic rust style #452 (Jimexist)
  • Fixed typo / logical merge conflict #433 (jorgecarleitao)
  • include test data and add aggregation tests in integration test #425 (Jimexist)
  • Add some padding around the logo #411 (parthsarthy)
  • Benchmark subcommand to distinguish between DataFusion and Ballista #402 (jgoday)
  • refactor datafusion/scalar_value to use more macro and avoid dup code #392 (Jimexist)
  • Update TPC-H benchmark to show physical plan when debug mode is enabled #386 (andygrove)
  • Update arrow dependencies again #341 (alamb)
  • Update arrow-rs deps #317 (alamb)
  • Update PR template by commenting out instructions #315 (alamb)
  • fix clippy warning #286 (Jimexist)
  • add integration test to compare datafusion-cli against psql #281 (Jimexist)
  • Update arrow deps #269 (alamb)
  • Use multi-stage build dockerfile in datafusion-cli and reduce image size from 2.16GB to 89.9MB #266 (Jimexist)
  • Enable redundant_field_names clippy lint #261 (Dandandan)
  • fix clippy lint #259 (alamb)
  • Move datafusion-cli to new crate #231 (Dandandan)
  • Make test join_with_hash_collision deterministic #229 (Dandandan)
  • Update arrow-rs deps (to fix build due to flatbuffers update) #224 (alamb)
  • Use standard make_null_array for CASE #223 (alamb)
  • update arrow-rs deps to latest master #216 (alamb)
  • MINOR: Remove empty rust dir #61 (andygrove)