Back to Datafusion

15.0.0

dev/changelog/15.0.0.md

53.1.042.7 KB
Original Source
<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

15.0.0 (2022-12-01)

Full Changelog

Breaking changes:

  • Expose remaining parquet config options into ConfigOptions (try 2) #4427 (alamb)
  • Config Cleanup: Remove TaskProperties and KV structure, keep key=value serialization #4382 (alamb)
  • add {TDigest,ScalarValue,Accumulator}::size #4342 (crepererum)
  • API-break: Support SubqueryAlias and remove Alias in Projection #4333 [sql] (jackwener)
  • split try_new_with_schema_alias from original code #4284 (jackwener)
  • Collapse statistics in normal explain plan #4157 (alamb)
  • Linearize binary expressions to reduce proto tree complexity #4115 (isidentical)
  • support SET Timezone #4107 [sql] (waitingkuo)

Implemented enhancements:

  • Refactor Built-in, Aggregate window functions to increase code reuse. #4440
  • Helper to get "root" error #4435
  • Do NOT convert intermediate/source errors to strings. #4434
  • Estimate the total_byte_size of the filter expression's result when selectivity is available #4374
  • refactor the code of the HashJoin #4356
  • CoalesceBatchesExec reports no ordering #4331
  • Introduce tournament tree to achieve better k-way sort-merging #4300
  • Add a checker to confirm physical optimizer rules will keep the physical plan schema immutable #4299
  • Remove the macro rule unary_scalar_expr from expr_fn.rs #4298
  • Remove Alias-in-Projection, replace it with SubqueryAlias #4291
  • reimplement reduce_outer_join #4270
  • Reimplement filter_push_down #4266
  • Reimplement eliminate_limit #4264
  • Reimplement limit_push_down #4263
  • Make a data driven SQL testing tool (so we can reuse duckdb test suite, example) #4248
  • upgrade chrono to 0.4.23 #4224
  • support scan non-string columns partitioned parquet files #4218
  • Allow optimizer rules to skip optimizing plans #4209
  • Supporting specifying schema when create tables #4183
  • Improve ergonomics of creating ListingOptions #4178
  • Add ability to specify external sort information for ParquetExec #4169
  • Add another method to collect referenced columns from an expression #4152
  • Improve EXPLAIN ANALYZE output for parquet exec #4144
  • TableProviderFactory::create should have Optional<DFSchemaRef> parameter #4142
  • Support more expressions in equality join #4140
  • JoinSelection Rule to choose physical join implementation: HashJoin(Partitioned or CollectLeft) or SortMergeJoin base on Stats #4139
  • Allow TPCH tooling to create a combined result for easier processing by outside tools #4127
  • Allow additional options when creating an external table #4125
  • reuse code utils::optimize_children instead of redundant implementation #4120
  • Add test field to PR template #4113
  • Allow for automatic registration of ListingTables #4111
  • Add CI check that configs.md is up-to-date #4108
  • Support SET timezone to non-UTC time zone #4106
  • Parquet predicates contains and true expressions #4091
  • Replace RwLock<HashMap> and Mutex<HashMap> by using DashMap #4077
  • add support for .xz compressed files #4074
  • add a feature gate to make support for compressed files optional #4073
  • Support serializing more deeply nested AND / OR expressions #4066
  • Use f64::total_cmp instead of OrderedFloat #4051
  • Add documentation to make it clear that decimal support is still experimental #4036
  • Simplify Pushed Down Predicates #4020
  • Improve HashJoinExec metrics #4009
  • Move physical plan serde from Ballista to DataFusion #3949
  • Support SubqueryAlias better in planner #3927
  • A framework for expression boundary analysis (and statistics) #3898
  • Replace Filter: Boolean(false) with EmptyRelation #3864
  • Implement statistics estimation for FilterExec #3845
  • Support parquet page filtering for more types: String, Binary(Decimal), Int96 #3833
  • Allow configuring parquet filter pushdown dynamically #3821
  • Unable to register tables in non-cloud S3 servers #3640
  • support more data type in prune for cast/try_cast #3442
  • Disable spill to disk globally #3264
  • Consider to categorize Operator #3216
  • Replace Projection.alias with SubqueryAlias #2212
  • [Optimizer] Eliminate the distinct #2045
  • beautify datafusion's site: https://datafusion.apache.org/ #1819
  • split datafusion-logical-plan sub-module #1755
  • convert outer join to inner join to improve performance #1585
  • Add sqllogictest for datafusion #1453
  • Add additional simplification rules #1406
  • support more subqueries #1209
  • Add baseline metrics for remaining execution plan nodes #1019
  • Make ExecutionPlan implementations immutable #987
  • Architecture overview may be insufficient in README #980
  • Add a separate configuration setting for parallelism of scanning parquet files #924
  • Support hash repartion elimination #41

Fixed bugs:

  • pyarrow CI failed #4448
  • UnwrapCastInComparison exist bug #4430
  • The CLI panics when passing an invalid explain query #4378
  • HashJoin should return Err when the right side input stream produce Err #4362
  • Optimizer check errors if resulting schema has different metadata #4346
  • Panic with function to_hex #4339
  • LimitPushDown pushdown into limit, result is wrong #4308
  • DESCRIBE statement issue with qualified table references #4303
  • Panic with window function LAST_VALUE #4297
  • CI failed in Compare to postgres #4294
  • Field alias can't work in where clause #4288
  • Some valid filters are not pushed down to parquet scan #4282
  • The type renaming pub type NullColumnarValue = ColumnarValue makes no sense #4271
  • Current limit_push_down can't support cross_join #4256
  • Cargo test fail #4253
  • RightSemi/RightAnti HashJoin has bug, the left_indices is never populated, causing failure to apply join filters. #4247
  • Clippy failures #4245
  • Cannot query s3 data from datafusion-cli #4239
  • Bug parsing interval with negative values #4237
  • cargo test reports errors on the master branch. #4236
  • Doc of the expression functionlog2 is incorrect #4231
  • HashJoin with mode PartitionMode:CollectLeft has bug and can produce wrong result #4230
  • Add ambiguous check when generate projection plan #4210
  • What happened for NDJSON support on CLI? #4198
  • Add ambiguous check when generate join plan #4197
  • Clippy failing on master : error: use of deprecated associated function chrono::NaiveDate::from_ymd: use from_ymd_opt() instead #4187
  • Reimplement the eliminate_cross_join #4176
  • Incorrect handling of column names #4166
  • Update release scripts to support datafusion-benchmarks #4134
  • Bug in interpreting correctly parsed SQL with aliases #4123
  • The percentile argument for ApproxPercentileCont must be Float64, not Decimal128(2, 1) #4103
  • Panic when using array_agg #4080
  • Wrong result for FIRST_VALUE AND LAST_VALUE window functions #4076
  • Round error when casting float to decimal #4071
  • Predicate still has cast when comparing Timestamp(Nano, None) to a timestamp literal, so can't be pushed down or used for pruning #3938
  • Revisit required_child_distribution(), output_partitioning(), output_ordering() implementations in ExecutionPlan's implementations #3653
  • Can't push down projection after do type coercion #3583
  • In some circumstances cast expression is not working #3499
  • output_partitioning() and output_ordering() implementations are wrong in some physical plan implementations with alias #3400
  • Interval Literal doesn't work for timeunit less than millisecond #3204
  • INTERVAL literal with duplicated interval types should raise error #3183
  • Error occurs when only using partition columns in query #1999
  • regex_match does not compile using the g flag #1429
  • between with NULL literals does not work: can't be evaluated because there isn't a common type to coerce the types to #1193
  • [Datafusion] Error with CAST: Unsupported SQL type Time #193

Closed issues:

  • SQL level coverage for when memory limit is exceeded #4404
  • Throw error (not panic) if a listing table specifies an missing partition column #4350
  • Page index pruning fail on complex_expr #4317
  • optimize limit-full join in the limit push down rule #4275
  • infer_schema function is not working with s3 Urls or http endpoints #4269
  • Add support binary boolean operators with nulls #4241
  • Add additional testing to parquet predicate pushdown integration tests #4087
  • Add metrics for parquet page level skipping #4086
  • Add parquet page index pushdown metrics #4058
  • Throw a runtime error if the memory allocated to GroupByHash exceeds a limit #3940
  • support unsigned numeric data type in UnwrapCastInBinaryComparison rule #3702
  • Support type cast in union #2125
  • [EPIC] Memory Limited Sort (Externalized / Spill) #1568
  • Maintain partition information in Union #189
  • Add coercion support for NULL literals #185

Merged pull requests:

  • Make datafusion-sql depend on arrow-schema instead of arrow #4456 [sql] (mbrobbel)
  • replace the comparator for decimal array op scalar using arrow kernel #4453 (liukun4515)
  • Fix pyarrow test #4450 (mvanschellebeeck)
  • Replace &Option<T> with Option<&T> #4446 [sql] (askoa)
  • Improve error handling for array downcasting #4445 (retikulum)
  • Refactor Builtin Window Function Implementation #4441 (mustafasrepo)
  • feat: DataFusionError::find_root #4437 (crepererum)
  • fix: do NOT convert errors to strings but keep the type #4436 (crepererum)
  • The CLI panics when passing an invalid explain query #4429 (comphead)
  • [minor] use arrow kernel concat_batches instead combine_batches #4423 (Ted-Jiang)
  • fix panic on to_hex function for negative numbers #4422 (retikulum)
  • Optimize filter executor in pull-based executor #4421 (xudong963)
  • optimize limit push for join case #4411 (liukun4515)
  • Add integration test for erroring when memory limits are hit #4406 (alamb)
  • feat: ResourceExhausted for memory limit in AggregateStream #4405 (crepererum)
  • Update to arrow 28 #4400 [sql] (tustvold)
  • Update rstest requirement from 0.15.0 to 0.16.0 #4399 (dependabot[bot])
  • Add sqllogictests (v0) #4395 (mvanschellebeeck)
  • improve hashjoin execution metrics #4394 (AssHero)
  • Add with_new_inputs for LogicalPlan #4393 (jackwener)
  • Clean the code in limit.rs. #4391 (HaoYang670)
  • Move physical plan serde from Ballista to DataFusion #4390 (Kikkon)
  • Fix page index pruning fail on complex_expr #4387 (Ted-Jiang)
  • Add check for nested types in equivalent names and types #4380 (alamb)
  • refine the code of build schema for ambiguous check, factor this out into a function #4379 [sql] (AssHero)
  • Refactor the Hash Join #4377 (liukun4515)
  • Minor: Fix typos in the documentation #4376 (martin-g)
  • Include byte size estimates in the filter statistics #4375 (isidentical)
  • HashJoin should return Err when the right side input stream produce Err, add more join UTs to cover different join types #4373 [sql] (mingmwang)
  • feat: ResourceExhausted for memory limit in GroupedHashAggregateStream #4371 (crepererum)
  • Use limit() function instead of show_limit() in the first example #4369 (martin-g)
  • Update env_logger requirement from 0.9 to 0.10 #4367 (dependabot[bot])
  • reimplement push_down_filter to remove global-state #4365 (jackwener)
  • Support to use Schedular in tpch benchmark #4361 (xudong963)
  • Adding more dataframe example to read csv files #4360 (DataPsycho)
  • minor: correct name and typo #4359 (jackwener)
  • Do not log error if page index can not be evaluated #4358 (alamb)
  • Clean the expr_fn - use scalar_expr to create unary scalar expr functions, remove macro unary_scalar_functions #4357 (HaoYang670)
  • Throw error (not panic) if a listing table specifies an missing partition column #4354 (doki23)
  • Improve error handling and add some more types for proper downcasting #4352 (retikulum)
  • Add check to avoid underflow in memory manager #4351 (askoa)
  • Improve error messages when memory is exhausted while sorting #4348 (alamb)
  • Do not error in optimizer if resulting schema has different metadata #4347 (alamb)
  • minor: improve optimizer logging and do not repeat rule name #4345 (alamb)
  • minor: fix typos in test names #4344 [sql] (alamb)
  • Minor: Add docstrings to EliminateOuterJoins optimizer pass #4343 (alamb)
  • Minor: refactor: isolate common memory accounting utils #4341 (crepererum)
  • minor: make plan_from_tables return one plan instead of Vec #4336 [sql] (jackwener)
  • enhancement: when fetch == 0, pushdown limit 0 instead skip+fetch. #4334 (jackwener)
  • Teach optimizer that CoalesceBatchesExec does not destroy output order #4332 (alamb)
  • Add ability to disable DiskManager #4330 (tustvold)
  • Update cli.md #4329 (psvri)
  • fix bug: right semi join can't support the filter #4327 (liukun4515)
  • reimplment eliminate_limit to remove global-state. #4324 (jackwener)
  • Refine Err propagation and avoid unwrap in transform closures #4318 (mingmwang)
  • Add a checker to confirm physical optimizer rules will keep the physical plan schema immutable #4316 (mingmwang)
  • Refactor downcasting functions with downcastvalue macro and improve error handling of ListArray downcasting #4313 (retikulum)
  • minor: add another test case to cover join ambiguous check #4305 [sql] (ygf11)
  • Fix DESCRIBE statement qualified table issue #4304 [sql] (gruuya)
  • Use tournament loser tree for k-way sort-merging, increase merge speed by 50% #4301 (richox)
  • Pin Python setuptools in the CI to fix integration tests #4296 (isidentical)
  • Support SubqueryAlias in optimizer, physcial planner. #4293 (jackwener)
  • minor: avoid a clone into string when checking ambiguous #4292 [sql] (ygf11)
  • replace the comparison op for decimal array op using the arrow-rs kernel #4290 (liukun4515)
  • MINOR: replace {..} with (_), typo, remove outdated TODO #4286 (jackwener)
  • Reduce Expr copies in ParquetExec #4283 (alamb)
  • Fix issue in filter pushdown with overloaded projection index #4281 (thinkharderdev)
  • Skip useless pruning predicates in ParquetExec #4280 (alamb)
  • Push down more predicates into ParquetExec #4279 (alamb)
  • Fix EXPLAIN plan for ParquetExec to show pruning_predicate #4278 (alamb)
  • reimplement limit_push_down to remove global-state, enhance optimize and simplify code. #4276 (jackwener)
  • Bump actions/labeler from 4.0.2 to 4.1.0 #4274 (dependabot[bot])
  • Remove the type alias NullColumnarValue #4273 (HaoYang670)
  • reimplement eliminate_outer_join #4272 (jackwener)
  • Fix bugs in parsing with header row and partitioned by #4268 [sql] (HaoYang670)
  • improve error messages while downcasting UInt32Array, UInt64Array and BooleanArray #4261 (retikulum)
  • add ambiguous check for projection #4260 [sql] (AssHero)
  • Add ambiguous check for join #4258 [sql] (ygf11)
  • support cross_join in limit_push_down #4257 (jackwener)
  • Support parquet page filtering on min_max for decimal128 and string columns #4255 (Ted-Jiang)
  • fix conflict and UT, cleanup redundant legacy code #4252 (jackwener)
  • Minor: remove unecessary clone() in planner #4249 [sql] (alamb)
  • Fix nightly clippy failures #4246 (mvanschellebeeck)
  • Improve Error Handling and Readibility for downcasting Float32Array, Float64Array, StringArray #4244 (retikulum)
  • Use defaults for ListingOptions builder #4243 (mvanschellebeeck)
  • Support binary boolean operators with nulls #4242 (Ted-Jiang)
  • Fixing doc of the expression #4240 (Creampanda)
  • Fix negative interval parsing bug #4238 (Jefffrey)
  • remove duplicate or redundant code #4235 (jackwener)
  • add a checker to confirm optimizer can keep plan schema immutable. #4233 (jackwener)
  • Fix the percentile argument for ApproxPercentileCont must be Float64, not Decimal128(2, 1) #4228 (comphead)
  • refactor how we create listing tables #4227 (timvw)
  • Update sqlparser requirement from 0.26 to 0.27 #4226 [sql] (alamb)
  • upgrade required chrono version to 0.4.23 #4225 (waitingkuo)
  • Support types other than String for partition columns on ListingTables #4221 (doki23)
  • [CBO] JoinSelection Rule, select HashJoin Partition Mode based on the Join Type and available statistics, option for SortMergeJoin #4219 (mingmwang)
  • Remove alias in Union #4212 (jackwener)
  • Add try_optimize method #4208 (andygrove)
  • Provide a builder for ListingOptions with fixups #4207 (alamb)
  • Avoid error with empty iterators used for ScalarValue::iter_to_array #4206 (GrandChaman)
  • Improve error message for regexp_match 'g' flag #4203 (Jefffrey)
  • Return ResourceExhausted errors when memory limit is exceed in GroupedHashAggregateStreamV2 (Row Hash) #4202 (crepererum)
  • Add additional expr boolean simplification rules #4200 (Jefffrey)
  • Update to arrow and parquet 27.0.0 #4199 [sql] (tustvold)
  • Support create table with explicit column definitions #4194 [sql] (doki23)
  • Support all equality predicates in equality join #4193 [sql] (ygf11)
  • add propagate_empty_relation optimizer rule #4192 (jackwener)
  • fix clippy #4190 [sql] (jackwener)
  • Fix clippy by avoiding deprecated functions in chrono #4189 (alamb)
  • Disallow duplicate interval types during parsing #4188 (Jefffrey)
  • Parse nanoseconds for intervals #4186 (Jefffrey)
  • Add rule to reimplement Eliminate cross join and remove it in planner #4185 [sql] (jackwener)
  • [FOLLOWUP] Enforcement Rule: resolve review comments, refactor adjust_input_keys_ordering() #4184 (mingmwang)
  • Simplify boolean parquet pushdown predicate #4182 (Jefffrey)
  • Minor: consolidate parquet custom_reader integration test into parquet_exec #4175 (alamb)
  • minor: remove redundant println and cleanup #4173 (jackwener)
  • Add ability to specify external sort information for ListingTables #4170 (alamb)
  • Improve Error Handling and Readibility for downcasting Decimal128Array #4168 (retikulum)
  • Minor: Remove completed comment on parquet row group pruning #4167 (alamb)
  • Update hashbrown requirement from 0.12 to 0.13 #4164 (dependabot[bot])
  • MINOR: enable dyn_cmp_dict feature on arrow for physical expr crate #4163 (isidentical)
  • Derive filter statistic estimates from the predicate expression #4162 (isidentical)
  • Minor: pass ParquetFileMetrics to build_row_filter in parquet #4161 (alamb)
  • Minor: Extract parquet row group pruning code into its own module #4160 (alamb)
  • Full support for time32 and time64 literal values (ScalarValue) #4156 (andre-cc-natzka)
  • Window frame GROUPS mode support #4155 (zembunia)
  • Improve error messages while downcasting Int64Array #4154 (retikulum)
  • Add another method to collect referenced columns from an expression #4153 [sql] (ygf11)
  • Remove BoxedAsyncFileReader #4150 (tustvold)
  • Support unsigned integers in unwrap_cast_in_comparison Optimizer rule #4149 (alamb)
  • Add support for DataType::Timestamp casts in unwrap_cast_in_comparison optimizer pass #4148 (alamb)
  • Add additional testing for unwrap_cast_in_comparison #4147 (alamb)
  • improve error messages while downcasting Int32Array #4146 (retikulum)
  • Minor: Update docstring on unwrap_cast_in_comparison #4145 (alamb)
  • add schema parameter to table provider factory create method #4143 (milenkovicm)
  • fix: shouldn't pass alias through into subquery. #4141 [sql] (jackwener)
  • Preserve the Cast expression in columnize_expr #4137 [sql] (HaoYang670)
  • Set versions to dependencies with path in benchmarks Cargo.toml file #4136 (ArkashaJavelin)
  • Fix links #4135 (mvanschellebeeck)
  • Use f64::total_cmp instead of OrderedFloat #4133 (comphead)
  • Add parquet integration tests for explicitly smaller page sizes, page pruning #4131 (alamb)
  • Consolidate ParquetExec tests in parquet_exec integration test #4130 (alamb)
  • Minor: Use upstream BooleanArray::true_count #4129 (alamb)
  • Combined TPCH runs & uniformed summaries for benchmarks #4128 (isidentical)
  • Enable TableProviderFactories to receive additional options when creating an external table #4126 [sql] (timvw)
  • Add CI check that configs.md is up-to-date #4124 (mvanschellebeeck)
  • [Part3] Partition and Sort Enforcement, Enforcement rule implementation #4122 (mingmwang)
  • reuse code utils::optimize_children but affect inline. #4121 (jackwener)
  • reuse code utils::optimize_children instead of redundant implementation #4119 (jackwener)
  • Allow listing tables to be created via TableFactories #4112 (avantgardnerio)
  • Update SQL reference to state that decimal support is currently experimental #4109 (andygrove)
  • Add metrics for parquet page level skipping #4105 (Ted-Jiang)
  • Add parser option for parsing SQL numeric literals as decimal #4102 [sql] (andygrove)
  • Replace RwLock<HashMap> and Mutex<HashMap> by using DashMap #4079 (yahoNanJing)
  • Custom window frame support extended to built-in window functions #4078 (mustafasrepo)
  • Enable tests for page index filtering in parquet filter pushdown test #4062 (alamb)
  • [Part2] Partition and Sort Enforcement, ExecutionPlan enhancement #4043 (mingmwang)
  • add support for xz file compression and compression feature #3993 [sql] (Jimexist)
  • Expression boundary analysis framework #3912 (isidentical)