Back to Datafusion

6.0.0

dev/changelog/6.0.0.md

53.1.024.4 KB
Original Source
<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

6.0.0 (2021-11-13)

Full Changelog

Breaking changes:

  • Removed deprecated with_concurrency #1200 (rdettai)
  • File partitioning for ListingTable #1141 (rdettai)
  • Add function volatility to Signature #1071 [sql] (pjmore)
  • fix: allow duplicate field names in table join, fix output with duplicated names #1023 (houqp)
  • Make TableProvider.scan() and PhysicalPlanner::create_physical_plan() async #1013 (rdettai)
  • Reorganize table providers by table format #1010 (rdettai)
  • Make Metrics::labels() public #999 (alamb)
  • Rename NthValue::{first_value,last_value,nth_value} to satisfy clippy in Rust 1.55 #986 (alamb)
  • Move CBOs and Statistics to physical plan #965 (rdettai)
  • Update to sqlparser v 0.10.0 #934 [sql] (alamb)
  • FilePartition and PartitionedFile for scanning flexibility #932 [sql] (yjshen)
  • Improve SQLMetric APIs, port existing metrics #908 (alamb)
  • Add support for EXPLAIN ANALYZE #858 [sql] (alamb)
  • Rename concurrency to target_partitions #706 (andygrove)

Implemented enhancements:

  • Add booleans support to the CASE statement #1156
  • Implement General Purpose Constant Folding with the Expression Evaluator #1070
  • Mark volatility categories of functions #1069
  • Add "show" support to DataFrame API #937
  • Add support for TRIM BOTH/LEADING/TRAILING #935
  • Add "baseline" metrics to all built in operators #866
  • Add SQL support for referencing fields in structs #119
  • add filename completer for create table statement #1278 (Jimexist)
  • Add drop table support #1266 [sql] (viirya)
  • Dataframe supports except and update readme #1261 (xudong963)
  • Implement EXCEPT & EXCEPT DISTINCT #1259 [sql] (xudong963)
  • Add DataFrame support for INTERSECT and update readme #1258 (xudong963)
  • use arrow 6.1.0 #1255 (Jimexist)
  • fix 1250, add editor support for datafusion cli with validation #1251 (Jimexist)
  • Add support for create table as via MemTable #1243 [sql] (Dandandan)
  • Add cli show columns command to describe tables #1231 (Jimexist)
  • datafusion-cli to add list table command #1229 (Jimexist)
  • datafusion cli to handle EoF and interrupt signal #1225 (Jimexist)
  • add \q as quit command and add ? for help #1224 (Jimexist)
  • Add algebraic simplifications to constant_folding #1208 (matthewmturner)
  • Improve GetIndexedFieldExpr adding utf8 key based access for struct v… #1204 [sql] (Igosuki)
  • Fix between in select query #1202 [sql] (capkurmagati)
  • Move code to fold Stable functions like now() from Simplifier to ConstEvaluator #1176 (alamb)
  • DataFrame supports window function #1167 [sql] (xudong963)
  • add values list expression #1165 [sql] (Jimexist)
  • Add booleans support to the CASE statement #1161 (xudong963)
  • Improve error messages when operations are not supported #1158 (alamb)
  • Generic constant expression evaluation #1153 (alamb)
  • python lit function to support bool and byte vec #1152 (Jimexist)
  • [nit] simplify datafusion optimizer module codes #1146 (panarch)
  • Add ScalarValue support for arbitrary list elements #1142 (jonmmease)
  • Multiple files per partitions for CSV Avro Json #1138 (rdettai)
  • Implement INTERSECT & INTERSECT DISTINCT #1135 [sql] (xudong963)
  • Simplify file struct abstractions #1120 (rdettai)
  • Implement is [not] distinct from #1117 [sql] (Dandandan)
  • Clean up spawned task on drop for RepartitionExec, SortPreservingMergeExec, WindowAggExec #1112 (crepererum)
  • add hyperloglog implementation (add and count) #1095 (Jimexist)
  • Add ScalarValue::Struct variant #1091 (jonmmease)
  • add digest(utf8, method) function and refactor all current hash digest functions #1090 (Jimexist)
  • [crypto] add blake3 algorithm to digest function #1086 (Jimexist)
  • [crypto] add blake2b and blake2s functions #1081 (Jimexist)
  • [nit] make schema qualifier error message in field lookup more readable #1079 (Jimexist)
  • [window function] add percent_rank window function #1077 (Jimexist)
  • [window function] add cume_dist implementation #1076 (Jimexist)
  • Add a LogicalPlanBuilder::schema() function #1075 (alamb)
  • Add support for UNION [DISTINCT] sql #1068 [sql] (xudong963)
  • fix: fix joins on Float32/Float64 columns bug #1054 (francis-du)
  • Update sqlparser-rs to 0.11 #1052 [sql] (alamb)
  • Support querying CSV files without providing the schema #1050 [sql] (xudong963)
  • remove hard coded partition count in ballista logicalplan deserialization #1044 (xudong963)
  • feat: add lit_timestamp_nanosecond #1030 (NGA-TRAN)
  • Ignore metadata on schema merge #1024 (Smurphy000)
  • add ExecutionConfig.with_optimizer_rules #1022 (seddonm1)
  • Add baseline execution stats to WindowAggExec and UnionExec, and fixup CoalescePartitionsExec #1018 (alamb)
  • Derive PartialOrd for Expr #1015 (alamb)
  • Indexed field access for List #1006 [sql] (Igosuki)
  • Add metrics for Limit and Projection, and CoalesceBatches #1004 (alamb)
  • Update DataFusion to arrow 6.0 #984 (alamb)
  • Implement Display for Expr, improve operator display #971 [sql] (matthewmturner)
  • Add metrics for FilterExec #960 (alamb)
  • Change compound column field name rules #952 (waynexia)
  • ObjectStore API to read from remote storage systems #950 (yjshen)
  • Add baseline metrics to SortPreservingMergeExec #948 (alamb)
  • Add support for TRIM LEADING/TRAILING/BOTH syntax #947 [sql] (adsharma)
  • fixes #933 replace placeholder fmt_as fr ExecutionPlan impls #939 (tiphaineruy)
  • Add metrics for SortExect + HashAggregateExec #938 (alamb)
  • Add some additional asserts in utils::from_plan #930 (alamb)
  • Avro Table Provider #910 [sql] (Igosuki)
  • Add BaselineMetrics, Timestamp metrics, add for CoalescePartitionsExec, rename output_time -> elapsed_compute #909 (alamb)
  • add cross join support to ballista #891 (houqp)
  • Add Ballista support to DataFusion CLI #889 (andygrove)
  • support like on DictionaryArray #876 (b41sh)
  • Register table based on known schema without file IO #872 (Dandandan)
  • Add support for PostgreSQL regex match #870 [sql] (b41sh)
  • Include planning time in datafusion-cli printing #860 (Dandandan)
  • Implement basic common subexpression eliminate optimization #792 (waynexia)
  • Impl ops::Not for expr #763 (Jimexist)

Fixed bugs:

  • Can not use between in the select list: #1196
  • ORDER BY does not work with literals: Sort operation is not applicable to scalar value 'foo' #1195
  • window functions with NULL literals in partition by and order by do not work: Internal("Sort operation is not applicable to scalar value NULL") #1194
  • Operation name not included in internal errors -- Internal("Data type Boolean not supported for binary operation on dyn arrays") #1157
  • Physical plan explain UNION query says "ExecutionPlan(PlaceHolder)" #933
  • Can not use LIKE on DictionaryArray encoded strings #815
  • physical_plan::repartition::tests::repartition_with_dropping_output_stream failing locally #614
  • Fix some BuiltinScalarFunction panics with zero arguments #1249 (capkurmagati)
  • fix: not do boolean folding on NULL and/or expr #1245 (NGA-TRAN)
  • ignore case of with header row in sql when creating external table #1237 [sql] (lichuan6)
  • fix: Min/Max aggregation data type should not be dictionary #1235 (NGA-TRAN)
  • Fix build with --no-default-features #1219 (alamb)
  • Prevent "future cannot be sent between threads safely" compilation error #1155 (jonmmease)
  • Clean up spawned task on drop for AnalyzeExec, CoalescePartitionsExec, HashAggregateExec #1121 (crepererum)
  • Clean up spawned task on SortStream drop #1105 (crepererum)
  • fix UNION ALL bug: thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', ./src/datatypes/schema.rs:165:10 #1088 (xudong963)
  • python: fix generated table name in dataframe creation #1078 (houqp)
  • fix subquery alias #1067 [sql] (xudong963)
  • fix pattern handling in regexp_match function #1065 (houqp)
  • fix: joins on Timestamp columns #1055 (francis-du)
  • Fix metric name typo #943 (alamb)
  • EXPLAIN ANALYZE should run all Optimizer passes #929 (alamb)

Documentation updates:

Performance improvements:

  • Improve avro reader performance by avoiding some cloning on avro_rs::Value #1206 (Igosuki)
  • optimize build profile for datafusion python binding, cli and ballista #1137 (houqp)
  • Avoid stack overflow by reducing stack usage of BinaryExpr::evaluate in debug builds #1047 (alamb)
  • Add ScalarValue::eq_array optimized comparison function #844 (alamb)
  • Rework GroupByHash to for faster performance and support grouping by nulls #808 (alamb)

Closed issues:

  • InList expr with NULL literals do not work #1190
  • update the homepage README to include values, approx_distinct, etc. #1171
  • [Python]: Inconsistencies with Python package name #1011
  • Wanting to contribute to project where to start? #983
  • delete redundant code #973
  • How to build DataFusion python wheel #853
  • Add support for partition pruning #204
  • [Datafusion] Support joins on TimestampMillisecond columns #187
  • TPC-H Query 21 #173
  • TPC-H Query 13 #164
  • TPC-H Query 8 #162
  • implement split_part(string, delimiter, position) #157
  • Join Statement: Schema contains duplicate unqualified field name #155
  • ParquetTable should avoid scanning all files twice #136
  • Add support for reading partitioned Parquet files #133
  • Add support for Parquet schema merging #132
  • Catalog abstraction #126
  • Optimizer rules should work with qualified column names #125
  • Add optional qualifier to Expr::Column #121
  • Implement modulus expression #99
  • [Rust] Add constant folding to expressions during logically planning #98
  • [Rust] Implement pretty print for physical query plan #93
  • Can not group by boolean columns (add boolean to valid keys of groupBy) #91
  • improve performance of building literal arrays #90
  • [rust][datafusion] optimize count(*) queries on parquet sources #89
  • Produce a design for a metrics framework #21

Merged pull requests:

  • Add timezome string to stablize test #1265 (viirya)
  • numerical_coercion pattern match optimize #1256 (Jimexist)
  • fix and update window function sql tests #1059 (Jimexist)
  • reduce ScalarValue from trait boilerplate with macro #989 (houqp)