Back to Datafusion

16.0.0

dev/changelog/16.0.0.md

53.1.051.5 KB
Original Source
<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

16.0.0 (2023-01-12)

Full Changelog

Breaking changes:

  • Remove unused ExecutionPlan::relies_input_order (has been replaced with required_input_ordering) #4856 (alamb)
  • Add DataFrame::into_view instead of implementing TableProvider (#2659) #4778 (tustvold)

Implemented enhancements:

  • Support custom window frame with AVG aggregate function #4845
  • add sqllogicaltest for tpch and remove some duplicated test. #4801
  • Catalog Snapshot Isolation #4697
  • Support select .. FROM 'parquet.file' in datafusion-cli #4580

Fixed bugs:

  • Regression: write_csv result has incorrect formatting #4876
  • Incorrect results for join condition against current master branch #4844
  • Match Postgres for stddev and variance on less than 3 values #4843
  • JOIN ... USING (columns) works incorrectly with multiple columns (joined-over columns are missing in the output) #4674
  • ROW_NUMBER window function inconsistent across partitions in multi-threaded runtime #4673
  • SELECT ... FROM (tbl1 UNION tbl2) wrongly works like SELECT DISTINCT ... FROM (tbl1 UNION tbl2) #4667
  • DataFrame TableProvider Circular Reference #2659

Documentation updates:

Closed issues:

  • Remove tests from sql_integration that were ported to sqllogictest #4498
  • How to register a http url to the object_store #4491
  • optimizer: support unsigned <-> decimal for unwrap_cast_in_comparion rule #4287
  • Add SQL support for NATURAL JOIN #117
  • [Datafusion] Datafusion queries involving a column name that begins with a number produces unexpected results #108

Merged pull requests:

  • docs: improve Column::normalize_with_schemas docs #4871 (crepererum)
  • Skip EliminateCrossJoin rule when meet non-empty join filter #4869 (ygf11)
  • Support for SQL Natural Join #4863 [sql] (Jefffrey)
  • Minor: Move test data into datafusion/core/tests/data #4855 (alamb)
  • Covariance single row input & null skipping #4852 (korowa)
  • Document ability to select directly from files in datafusion-cli #4851 (alamb)
  • Fix push_down_projection through a distinct #4849 (Jefffrey)
  • Support using var/var_pop/stddev/stddev_pop in window expressions with custom frames #4848 (jonmmease)
  • Update variance/stddev to work with single values #4847 (jonmmease)
  • Implement retract_batch for AvgAccumulator #4846 (jonmmease)
  • Support wildcard select on multiple column using joins #4840 [sql] (Jefffrey)
  • Orthogonalize distribution and sort enforcement rules into EnforceDistribution and EnforceSorting #4839 (mustafasrepo)
  • support select .. FROM 'parquet.file' in datafusion-cli #4838 (unconsolable)
  • Remove tests from sql_integration that were ported to sqllogictest #4836 (matthewwillian)
  • add tpch sqllogicaltest and remove some duplicated test #4802 (jackwener)

16.0.0-rc1 (2023-01-07)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Move the ExtractEquijoinPredicate behind the SubqueryFilterToJoin #4759
  • Remove the config datafusion.execution.coalesce_target_batch_size #4756
  • SimplifyExpressions will fail when rebuild equijoin with alias #4754
  • Provide a constructor for the ConfigOptions with HashMap<String, String> #4752
  • Non-deprecated support for planning SQL without DDL #4720
  • Add regression tests for planning TPC-DS queries #4718
  • Move the extracting join keys logic to optimizer #4710
  • Support compression in IPCWriter #4708
  • Support prepared statement parameter type inference #4700
  • PruningPredicate Use Physical not Logical Predicate #4695
  • Support for executing infinite files #4692
  • Add a sort rule to remove unnecessary SortExecs from physical plan #4686
  • Install protoc automatically when building datafusion/proto crate #4684
  • Make DfSchema wrap SchemaRef #4680
  • Reorder the physical plan optimizer rules #4678
  • Inconsistent behavior with PostgreSQL to decide Window Expressions ordering #4641
  • Returns error too late when parsing invalid file compression type. #4636
  • Make OptimizerConfig a Trait #4631
  • Move Optimize onto DataFrame #4626
  • Make LogicalPlanBuilder Consuming #4622
  • Make DataFrame Consuming #4621
  • rules don't need to recursion inside themselves #4613
  • [window function] support min max with self define sliding window. #4603
  • Add try_optimize for all_rules #4598
  • Refine the physical plan serialization and deserialization #4597
  • Normalize datafusion configuration names #4595
  • Add need_data_exchange in the ExecutionPlan to indicate whether a physical operator needs data exchange #4585
  • Bump Datafusion sql-parser dependency to 0.28 #4573
  • tpch test exist duplicated #4563
  • user-defined aggregate function as window function #4552
  • Convert a Prepare Logical Plan into a Logical Plan with all parameters replaced with values #4550
  • FileStream requires fake ObjectStore when ParquetFileReaderFactory is used #4533
  • Avoid reading the entire file in ChunkedStore #4524
  • Enrich filter statistics predictions with estimated column boundaries #4518
  • Show window frame info in physical plan #4509
  • Add sqllogictest auto labeler #4507
  • Optimize is_distinct_from / is_not_distinct_from #4482
  • Add window func related logic plan to proto ability. #4480
  • Make window function related struct public. #4479
  • Improve partition file explain plan display to show groupings #4466
  • Add support for non-column key for equijoin when eliminating cross join to inner join #4442
  • Remove the schema checking from CrossJoinExec::try_new #4431
  • Initial support for prepared statement #4426
  • Add support for NTILE built-in Window Function #4403
  • Add Support for MIN, MAX Aggregate Functions when run with custom window frames #4402
  • Support INSERT INTO statement #4397
  • Enhancement: split the SQL planner into smaller modules #4392
  • Proposal: Improve the join keys of logical plan #4389
  • Add MergeSubqueryAlias rule #4383
  • Optimizer rule support subqueryAlias #4381
  • Rewrite simple regex expressions #4370
  • Revisit get_statistics_with_limit() method in datasource mod #4323
  • Support for type coercion for a (Timestamp, Utf8) pair #4311
  • replace the operation about decimal to the arrow-rs kernel #4289
  • change date_part return types to f64 #3997
  • Better api for setting ConfigOptions from SessionContext #3908
  • Make ConfigOptions easier to work with #3886
  • An asynchronous version of CatalogList/CatalogProvider/SchemaProvider #3777
  • Allow configs to be set with string values #3500
  • support scientific notation for SQL literals #3448
  • Adopt physical plan serde from arrow-ballista #3257
  • Improve codebase readability and error messages by and consistently handle downcasting #3152
  • Re-enable where_clauses_object_safety #3081
  • optimize/simplify the literal data type and remove unnecessary cast、try_cast #3031
  • Move datafusion-substrait crate into arrow-datafusion repo #2646
  • [enhancement] rules don't need to recursion inside themselves #2620
  • Add support for GROUPING SETS syntax in SQL planner #2469
  • Optimize EXISTS subquery expressions by rewriting as semi-join #2351
  • Add Delta Lake TableProvider #525
  • Support window functions with window frame #361

Fixed bugs:

  • PushdownFilter rule exist bug will cause filter change wrong #4822
  • Unlimited memory consumption in RepartitionExec #4816
  • Physical Optimizer Config Mutation Doesn't Take Effect #4806
  • cargo test failed error: linking with cc failed: exit status: 1 #4790
  • Parquet files generated by DataFusion cannot be read by Apache Spark #4782
  • datafusion-physical-expr doesn't compile when blake3/traits-preview is enabled #4781
  • Multiple ways to express like / ilike / not like / not ilike #4765
  • SessionState::optimize and SessionState::create_physical_plan Don't Update Query Start Time #4747
  • Page Filtering Incorrectly Handles Pages with Different Row Counts #4744
  • cargo test failing on master due to tpcds_logical_q41 stackoverflow #4728
  • PruningPredicate Different Evaluation Context from Query #4693
  • Skipping optimizer rule due to create_name not supporting wildcard #4681
  • Create physical plan bug: got Arrow schema with 1 and DataFusion schema with 0 #4677
  • Timestamp <-> Date32 compare doesn't work #4672
  • Wrongly use the function clamp #4654
  • Fix the clippy errors #4653
  • Filter Null Keys Update Not Taking Effect #4638
  • Should not generate duplicate sort keys from Window expr's partition by keys #4635
  • common_sub_expression_eliminate exists bug #4575
  • Confusing "Bare" in doesn't exist messages #4571
  • having shouldn't include alias in projection #4556
  • wrong comment about having #4554
  • drop view t1, t2, ... and drop table t1, t2, ... silently ignores arguments past the first #4531
  • Extract from timestamp doesn't support nanosecond #4528
  • prepare_select_exprs don't need outer_query_schema #4526
  • Table names with periods are not handled correctly #4513
  • Push_down_projection push redundant column. #4486
  • Planner don't generate SubqueryAlias #4483
  • Planner generate replicated Projection | SubqueryAlias #4481
  • apply_table_alias will ignore alias_name when columns is empty. #4454
  • Fix output_ordering of WindowAggExec #4438
  • Incorrect error for plus/minus operations over timestamps and dates #4420
  • Optimization rule filter_push_down causes FieldNotFound error #4401
  • Should not convert a normal non-inner join to Cross Join when there are non-equal Join conditions #4363
  • MemoryConsumer::try_grow Underflow #4328
  • Potential MemoryManager Deadlock #4325
  • create external table should fail to parse if syntax is incorrect #4262
  • Nullif func states support for Boolean type, but fails if this is attempted #4205
  • ProjectionPushDown rule don't consider the alias in projection. #4174
  • Stack overflow planning complex query #4065
  • Can not use extract <part> on the value of now() #3980
  • Bug with intervals and logical and/or #3944
  • CoalesceBatches doesn't provide correct elapsed_compute info in metrics #3894
  • Paniced at to_timestamp_micros function when the timestamp is too large. #3832
  • Optimizer casts decimals to different values on different platforms #3791
  • CSV inference reads in the whole file to memory, regardless of row limit #3658
  • after type coercion CommonSubexprEliminate will produce invalid projection #3635
  • panic at attempt to multiply with overflow when doing math on Decimal128 columns #3437
  • Precedence bug with date comparison to date plus interval #3408
  • Median aggregation using DataFrame panics: "AggregateState is not a scalar aggregate" #3105
  • date_part does't work for now() #3096
  • hash_join panics when join keys have different data types #2877
  • Memory manager triggers unnecessary spills #2829
  • Address performance/execution plan of TPCH query 9 #77

Documentation updates:

  • Add a new open source project that is use DataFusion as query engine #4768 (francis-du)

Closed issues:

  • move the tests in planner #4798
  • Make it easier to update sqltestlogic test expected output ("test script completion mode") #4570
  • Make ConfigOption names into an Enum #4517
  • Implement null / empty string handling for sqllogictest #4500
  • Write a blog about parquet predicate pushdown #3464
  • Ensure column names are equivalent with or without optimization #1123

Merged pull requests:

  • Bump tokio from 1.23.0 to 1.23.1 in /datafusion-cli #4835 (dependabot[bot])
  • Fix a few links in roadmap.md #4833 (romanz)
  • DataFusion 16.0.0 release prep: Update version + add changelog #4831 [sql] (andygrove)
  • feat: use arrow row format for hash-group-by #4830 (crepererum)
  • refactor: split relation of planner into one part. #4829 [sql] (jackwener)
  • bugfix: remove cnf_rewrite in push_down_filter #4825 (jackwener)
  • minor: add some comments to row group pruning tests #4823 (alamb)
  • Handle trailing tbl column in TPCH benchmarks #4821 (tustvold)
  • fix: account for memory in RepartitionExec #4820 (crepererum)
  • Fix clippy #4817 (tustvold)
  • Add test cases: row group filter with missing statistics for decimal data type #4810 (liukun4515)
  • Move default catalog and schema onto ConfigOptions (#3887) #4805 (tustvold)
  • remove duplicated test #4800 (jackwener)
  • Update sqlparser requirement from 0.29 to 0.30 #4799 [sql] (dependabot[bot])
  • rewrite the function ensure_any_column_reference_is_unambiguous #4797 [sql] (HaoYang670)
  • Uncomment nanoseconds tests after sql parser upgrade #4789 (comphead)
  • fix: ListingSchemaProvider directory paths (related: #4204) #4788 (cfraz89)
  • Minimize stack space required to plan deeply nested binary expressions #4787 [sql] (alamb)
  • Minor: Refactor some sql planning code into functions #4785 [sql] (alamb)
  • Make datafusion-physical-expr compatible with blake3/traits-preview feature. #4784 (BoredPerson)
  • refactor: split expression pf planner into one part. #4783 [sql] (jackwener)
  • Fix Stack overflow in sql planning in debug builds #4779 [sql] (alamb)
  • Pipeline-friendly Bounded Memory Window Executor #4777 (mustafasrepo)
  • Implement OptimizerConfig for SessionState #4775 (tustvold)
  • refactor: extract parse_value #4774 [sql] (jackwener)
  • Structify ConfigOptions (#4517) #4771 (tustvold)
  • Update sqlparser to 29.0.0 #4770 [sql] (alamb)
  • Refactor extract_join_keys and move the ExtractEquijoinPredicate rule #4760 (ygf11)
  • Remove the config datafusion.execution.coalesce_target_batch_size and use datafusion.execution.batch_size instead #4757 (yahoNanJing)
  • Add alias check for equijoin in from_plan #4755 (ygf11)
  • Take the top level schema into account when creating UnionExec #4753 (HaoYang670)
  • Set query_execution_start_time on snapshot from SessionContext (#4747) #4750 (tustvold)
  • minor: Improve docstrings #4748 [sql] (alamb)
  • Append generated column to the schema instead of prepending for WindowAggExec #4746 (mustafasrepo)
  • Minor: comments about coercion in physical planner #4745 (alamb)
  • Simplify parquet filter predicate test, fix Page Filtering Incorrectly Handles Pages with Different Row Counts #4743 (tustvold)
  • support byte array for decimal in parquet page and row group filters #4742 (liukun4515)
  • revert some code for #4726 / remove unnecessary coercion in physical plans #4741 (liukun4515)
  • Cleanup InformationSchema plumbing #4740 (tustvold)
  • Minor: use a common method to check the validate of equijoin predicate #4739 (ygf11)
  • minor: Support more data type for null_counts in the PruningStatistics #4738 (liukun4515)
  • Extended datatypes & signatures support for NULLIF function #4737 (korowa)
  • minor: improve debug logging for pruning predicates #4736 (alamb)
  • refactor: parallelize parquet_exec test case single_file #4735 (waynexia)
  • fix: add one more projection to recover output schema #4733 (waynexia)
  • remove SubqueryFilterToJoin #4731 (jackwener)
  • Create writer with arrow::ipc::IPCWriteOptions #4730 (askoa)
  • Implement cast between Date and Timestamp #4726 (comphead)
  • Dynamic information_schema configuration and port more tests #4722 (alamb)
  • Add TPC-DS query planning regression tests #4719 (andygrove)
  • Minor: refactor streaming CSV inference code #4717 (alamb)
  • Reorder the physical plan optimizer rules, extract GlobalSortSelection, make Repartition optional #4714 (yahoNanJing)
  • Eagerly construct PagePruningPredicate #4713 (tustvold)
  • Move the extract_join_keys to optimizer #4711 [sql] (ygf11)
  • Avoid to bypass try_new/new() to build plan directly and cleanup filter #4702 (jackwener)
  • MINOR: Remove where_clause_object_safety clippy ignore (#3081) #4696 (tustvold)
  • Support for executing infinite files and boundedness-aware join reordering rule #4694 (metesynnada)
  • Unnecessary SortExec removal rule from Physical Plan #4691 (mustafasrepo)
  • minor: rename the github actions #4689 (jackwener)
  • FOLLOWUP: remove more recursion in optimizer rules. #4687 (jackwener)
  • Add line that prevents display_name from being called on Wildcard #4682 (andre-cc-natzka)
  • Deprecate SessionContext::create_logical_plan (#4617) #4679 (tustvold)
  • Support NTILE window function #4676 (berkaycpp)
  • Support min max aggregates in window functions with sliding windows #4675 (berkaycpp)
  • Refactor Expr::AggregateFunction and Expr::WindowFunction to use struct #4671 [sql] (Jefffrey)
  • Support type coercion for equijoin #4666 (ygf11)
  • Add --complete auto completion mode to sqllogictests #4665 (alamb)
  • Fix CoalesceBatches elasped_compute metric #4664 (Jefffrey)
  • Refactor Expr::Sort to use struct #4663 [sql] (Jefffrey)
  • More descriptive error for plus/minus between timestamps/dates #4662 (Jefffrey)
  • Stream CSV file during schema inference #4661 (Jefffrey)
  • Refine the logical and physical plan serialization and deserialization #4659 (yahoNanJing)
  • Use thiserror in sqllogictest erorr #4657 (xudong963)
  • fix cargo clippy warning #4652 [sql] (jackwener)
  • Improve group by hash performance: avoid group-key/-state clones for hash-groupby #4651 (crepererum)
  • remove recursion in optimizer rules #4650 (jackwener)
  • replace the arithmetic op for decimal array op decimal array using arrow kernel #4648 (liukun4515)
  • simplify regex expressions #4646 (crepererum)
  • Avoid generate duplicate sort Keys from Window Expressions, fix bug when decide Window Expressions ordering #4643 [sql] (mingmwang)
  • Refactor Expr::TryCast to use a struct #4642 [sql] (ygf11)
  • add ILIKE support #4639 (crepererum)
  • Detect invalid (unsupported) compression types when parsing #4637 [sql] (HaoYang670)
  • unwrap_cast_in_comparison.rs: support unint <-> decimal #4634 (liukun4515)
  • MINOR: Fix incorrect config definitions #4623 (andygrove)
  • FOLLOWUP: remove optimize() #4619 (jackwener)
  • Optimizer: avoid every rule must recursive children in optimizer #4618 (jackwener)
  • fix: run logical optimizer rules for TableScan expressions #4614 (crepererum)
  • refactor: relax the signature of register_* in SessionContext #4612 (waynexia)
  • Remove the function consume_token from the parser #4609 [sql] (HaoYang670)
  • Make SchemaProvider::table async #4607 (tustvold)
  • Lazy system tables #4606 (tustvold)
  • Refactor: Change equijoin keys from column to expression in logical join #4602 [sql] (ygf11)
  • refactor: extract assert_optimized_plan_eq from UT. #4600 (jackwener)
  • add try_optimize() for all rules. #4599 (jackwener)
  • Normalize datafusion configuration names #4596 (yahoNanJing)
  • Fix the bugs in parsing COMPRESSION TYPE #4590 [sql] (HaoYang670)
  • Minor: Remove datafusion-core dev dependency from datafusion-sql #4589 [sql] (alamb)
  • Improve error handling for array downcasting #4588 (retikulum)
  • Update to arrow v29 #4587 [sql] (tustvold)
  • Add need_data_exchange in the ExecutionPlan to indicate whether a physical operator needs data exchange #4586 (yahoNanJing)
  • Move subset of select tests to sqllogic #4583 (ajayaa)
  • bugfix: just allow having use expr in groupby or aggr #4579 [sql] (jackwener)
  • Output sqllogictests with arrow display rather than CSV writer #4578 (alamb)
  • Minor: Add test case for reduce cross join #4577 (ygf11)
  • refactor: remove redundant outer_query_schema #4576 [sql] (jackwener)
  • Preserve the TryCast expression in columnize_expr #4574 [sql] (byteink)
  • Remove Confusing "Bare" in does not exist messages #4572 [sql] (alamb)
  • Minor: Add tests for date interval predicate handling #4569 (alamb)
  • Update sqlparser requirement from 0.27 to 0.28 #4568 [sql] (alamb)
  • Avoid materializing local varaibles when creating sortMergeJoinExec #4566 (HaoYang670)
  • Minor: Fix logical conflict #4565 (alamb)
  • feat: support nested loop join with the initial version #4562 [sql] (liukun4515)
  • feat: prepare logical plan to logical plan without params/placeholders #4561 [sql] (NGA-TRAN)
  • Write faster kernel for is_distinct #4560 (comphead)
  • refactor code about query -> plan for subqueries #4559 [sql] (jackwener)
  • fix: remove wrong comment about having #4555 [sql] (jackwener)
  • feat: user-defined aggregate function(UDAF) as window function #4553 [sql] (MichaelScofield)
  • Fix date_part/extract functions to support now() #4548 (comphead)
  • bump sqllogictest to 0.9.0 #4547 (xxchan)
  • minor: Remove more clones from the planner #4546 [sql] (alamb)
  • Add tests for coercion of timestamps to strings #4545 (alamb)
  • MINOR: move sqllogictest to dev-dependencies #4544 (alamb)
  • MINOR: add some comments about intended use of ChunkedStore #4541 (alamb)
  • fix: remove TODOs linked to arrow#3147 #4540 (crepererum)
  • refactor: remove redundant build_join_schema() #4538 (jackwener)
  • Move some create/drop tests to ddl.slt #4535 (alamb)
  • Minor: Avoid cloning as many Ident during SQL planning #4534 [sql] (alamb)
  • shouldn't add outer_query_schema in sql_select_to_rex #4527 [sql] (jackwener)
  • Avoid reading the entire file in ChunkedStore #4525 (metesynnada)
  • Simplify MemoryManager #4522 (tustvold)
  • Fix limited statistic collection accross files with no stats #4521 (isidentical)
  • refactor: make Ctes a struct to also store data types provided by prepare stmt #4520 [sql] (NGA-TRAN)
  • Enrich filter statistics with known column boundaries #4519 (isidentical)
  • Remove Option from window frame #4516 [sql] (mustafasrepo)
  • Make nightly clippy happy #4515 [sql] (xudong963)
  • Remove interior mutability of MemTable #4514 (xudong963)
  • Make window function related struct public for ballista. #4511 (Ted-Jiang)
  • minor: rename push_down_limit #4510 (jackwener)
  • Add get_window_frame in window_expr, show frame info in window_agg_exec #4508 (Ted-Jiang)
  • Add sqllogictest auto labeler #4506 (mvanschellebeeck)
  • Add some more aggregate sqllogictests and remove rust tests #4505 (mvanschellebeeck)
  • Remove sqllogictests CI run #4504 (mvanschellebeeck)
  • Refactor code for insert in sqllogictest #4503 (xudong963)
  • Add empty string normalization to sqllogictests #4501 (alamb)
  • sqllogictest: A logging and command line filter #4497 (alamb)
  • Support insert into statement in sqllogictest #4496 (xudong963)
  • Improve error handling for array downcasting #4493 (retikulum)
  • Unify most of SessionConfig settings into ConfigOptions #4492 (alamb)
  • feat: support prepare statement #4490 [sql] (NGA-TRAN)
  • Minor: Update docstrings and comments to aggregate code #4489 (alamb)
  • Fix panic in median "AggregateState is not a scalar aggregate" #4488 (alamb)
  • fix push_down_projection push redundant columns. #4487 (jackwener)
  • Add window func related logic plan to proto ability. #4485 (Ted-Jiang)
  • fix Planner don't generate SubqueryAlias and generate duplicated SubqueryAlias #4484 [sql] (jackwener)
  • Improve parquet partition_file output display #4467 (alamb)
  • minor: remove redundant unwrap() #4463 (jackwener)
  • Fix Cte in from clause with duplicated cte name #4461 [sql] (xudong963)
  • Replace &Option<T> with Option<&T> part 2 #4458 (askoa)
  • Fix output_partitioning(), output_ordering(), equivalence_properties() in WindowAggExec, shift the Column indexes #4455 (mingmwang)
  • fix push_down_filter for pushing filters on grouping columns rather than aggregate columns #4447 (jackwener)
  • Add support for non-column key for equijoin when eliminating cross join to inner join #4443 [sql] (ygf11)
  • Remove the schema checking when creating CrossJoinExec #4432 (HaoYang670)
  • date_part support fractions of second #4385 (comphead)
  • Minor: use upstream RowSelection code from arrow intersect_row_selection #4340 (alamb)
  • Support type coercion for timestamp and utf8 #4312 (andre-cc-natzka)