Back to Datafusion

7.0.0

dev/changelog/7.0.0.md

53.1.040.7 KB
Original Source
<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->

7.0.0 (2022-02-14)

Full Changelog

Breaking changes:

  • Consolidate various configurations options, remove unrelated batch_size #1565
  • Extract logical plans in LogicalPlan as independent struct #1228
  • Update ExecutionPlan to know about sortedness and repartitioning optimizer pass respect the invariants #1776 (alamb)
  • Update to arrow 8.0.0 #1673 (alamb)
  • Remove non idiomatic DataFusionError::into_arrow_external_error in favor of From conversion #1645 (alamb)
  • Remove Accumulator::update and Accumulator::merge #1582 (Jimexist)
  • implement Hash for various types and replace PartialOrd #1580 (Jimexist)
  • Replace DatafusionError with GenericError in ObjectStore interface #1541 (matthewmturner)
  • Make FLOAT SQL type map to Float32 rather than Float64 #1423 [sql] (liukun4515)
  • Map REAL SQL type to Float32 rather than Float64 to be consistent with pg #1390 [sql] (hntd187)

Implemented enhancements:

  • Create new datafusion_expr crate #1753
  • Create new datafusion_common crate #1752
  • API to get Expr's type and nullability without a DFSchema #1725
  • Cleaner API to create Expr::ScalarFunction programatically #1718
  • Introduce a Vec<u8> based row-wise representation for DataFusion #1708
  • Simplify creating new ListingTable #1705
  • Implement TableProvider for DataFrameImpl to allow registration of logical plans #1698
  • Public Expr simplification API #1694
  • Query Optimizer: Add OUTER --> INNER join conversion #1670
  • Support reading from CSV, Avro and Json files that have mergeable/compatible, but not identical schemas #1669
  • Remove DataFusionError::into_arrow_external_error in favor of From conversion #1644
  • Include join type in display implementation for logical plan #1620
  • Switch datafusion to using eq_dyn_scalar, etc kernels #1610
  • Proposal: Remove Accumulator::update and Accumulator::merge #1549
  • Replace DataFusionError/Result with impl Error for ObjectStore and Reader #1540
  • Add approx_quantile support #1538
  • support sorting decimal data type #1522
  • Keep all datafusion's packages up to date with Dependabot #1472
  • ExecutionContext support init ExecutionContextState with new(state: Arc<Mutex<ExecutionContextState>>) method #1439
  • support the decimal scalar value #1393
  • Documentation for using scalar functions with the DataFrame API #1364
  • Support boolean == boolean and boolean != boolean operators #1159
  • Support DataType::Decimal(15, 2) in TPC-H benchmark #174
  • Make MemoryStream public #150
  • Add support for Parquet schema merging #132
  • Add SQL support for IN expression #118
  • Add logging to datafusion-cli #1789 (alamb)
  • Add approx_median() aggregate function #1729 (realno)
  • Add join type for logical plan display #1674 [sql] (xudong963)
  • Fix null comparison for Parquet pruning predicate #1595 (viirya)
  • Add corr aggregate function #1561 (realno)
  • Add covar, covar_pop and covar_samp aggregate functions #1551 (realno)
  • Add approx_quantile() aggregation function #1539 (domodwyer)
  • Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526 (yjshen)
  • Add stddev and variance #1525 (realno)
  • Add rem operation for Expr #1467 (liukun4515)
  • support decimal data type in create table #1431 [sql] (liukun4515)
  • Ordering by index in select expression #1419 [sql] (hntd187)
  • Add support for ORDER BY on unprojected columns #1415 (viirya)
  • Support decimal for min and max aggregate #1407 (liukun4515)
  • Consolidate ConstantFolding and SimplifyExpression #1375 (alamb)
  • Datafusion cli quiet mode command to contain option bool #1345 (Jimexist)
  • Implement array_agg aggregate function #1300 (viirya)
  • Add a command to switch output format in cli #1284 (capkurmagati)
  • Support =, <, <=, >, >=, !=, is distinct from, is not distinct from for BooleanArray #1163 (alamb)

Fixed bugs:

  • Unsupported data type in hasher: Timestamp(Second, None) #1768
  • SQL column identifiers should be converted to lowercase when unquoted #1746
  • Data type Dictionary(Int32, Utf8) not supported for binary operation 'eq' on dyn arrays #1605
  • datafusion doesn't process predicate pushdown correctly when there is outer join #1586
  • casting Int64 to Float64 unsuccessfully caused tpch8 to fail #1576
  • CTE/WITH .. UNION ALL confuses name resolution in WHERE #1509
  • ORDER BY min(x) results in error Plan("No field named 'foo.x'. Valid fields are 'MIN(foo.x)'.") #1479
  • Sort discards field metadata on the output schema #1476
  • Datafusion should not strip out timezone information from existing types #1454
  • Error on some queries: "column types must match schema types, expected XXX but found YYY" #1447
  • Query failing to return any results when filter is an equality check on strings (bad statistics in parquet) #1433
  • Field names containing period such as f.c1 cannot be named in SQL query #1432
  • Select * returns an unexpected result #1412
  • Turn off unused default features of chrono and ahash #1398
  • real data type is float32 in PG database, but in the datafusion it is as float64 #1380
  • TPC-H q10 performance regression (expression for filter with added alias is not pushed down) #1367
  • ProjectionExec Loses Field Metadata #1361
  • Support Filter on unprojected columns #1351
  • NULLS ORDER is inconsistent with postgres #1343
  • Fix bug while merging RecordBatch, add SortPreservingMerge fuzz tester #1678 (alamb)
  • fix a cte block with same name for many times #1639 [sql] (xudong963)
  • fix: casting Int64 to Float64 unsuccessfully caused tpch8 to fail #1601 (xudong963)
  • Fix single_distinct_to_groupby for arbitrary expressions #1519 (james727)
  • Fix SortExec discards field metadata on the output schema #1477 (alamb)
  • fix calculate in many_to_many_hash_partition test. #1463 (Ted-Jiang)
  • Add Timezone to Scalar::Time* types, and better timezone awareness to Datafusion's time types #1455 (maxburke)
  • Support identifiers with . in them #1449 [sql] (alamb)
  • Fixes for working with functions in dataframes, additional documentation #1430 (tobyhede)
  • [Minor] Fix send_time metric for hash-repartition #1421 (Dandandan)
  • fix: Select * returns an unexpected result #1413 [sql] (xudong963)
  • Make cli handle multiple whitespaces #1388 (capkurmagati)
  • Metadata is kept in projections for non-derived columns #1378 (hntd187)
  • Fix Predicate Pushdown: split_members should be able to split aliased predicate #1368 (viirya)
  • Change the arg names and make parameters more meaningful #1357 (liukun4515)
  • collect table stats by default for listing table #1347 (houqp)
  • fix: make nulls-order consistent with postgres #1344 [sql] (xudong963)
  • Avoid changing expression names during constant folding #1319 (viirya)
  • improve error message for invalid create table statement #1294 [sql] (houqp)
  • Forbid creating the table with the same name #1288 (liukun4515)

Documentation updates:

Performance improvements:

  • Parquet pruning predicate for IS NULL #1591
  • Fix predicate pushdown for outer joins #1618 (james727)
  • fix: sql planner creates cross join instead of inner join from select predicates #1566 [sql] (xudong963)
  • Split fetch_metadata into fetch_statistics and fetch_schema #1365 (Dandandan)
  • Optimize the performance queries with a single distinct aggregate #1315 (ic4y)
  • Left join could use bitmap for left join instead of Vec<bool> #1291 (boazberman)

Closed issues:

  • Add release compile to CI #1728
  • DiskManager and TempFiles getting created several times per query #1690
  • Add a test for the pyarrow feature in CI #1635
  • SQL tests for when sorting exceeded available memory and had to spill to disk #1573
  • Consolidate the N-way merging code and SortPreservingMergeStream (which has quite good tests of what is often quite tricky code, and it will be performance critical) #1572
  • Consolidate the SortExec code (so there is only a single sort operator that does in memory sorting if it has enough memory budget but then spills to disk if needed). #1571
  • Track memory usage in Non Limited Operators #1569
  • [Question] Why does ballista store tables in the client instead of in the SchedulerServer #1473
  • Consolidate Projection for Schema and RecordBatch #1425
  • Support Sort on unprojected columns #1372
  • Unused code in hash_aggregate #1362
  • Why use the expr types before coercion to get the result type? #1358
  • A problem about the projection_push_down optimizer gathers valid columns #1312
  • apply constant folding to LogicalPlan::Values #1170
  • reduce usage of IntoIterator<Item = Expr> in logical plan builder window fn #372
  • Why does DataFusion throw a Tokio 0.2 runtime error? #176
  • TPC-H Query 14 #165
  • Length kernel returns bytes not character length #156
  • Split the logical operators out into separate source files #115

Merged pull requests:

  • Fixup some doc warnings #1811 (alamb)
  • Ensure most of links in docs are correct #1808 [sql] (HaoYang670)
  • Update CHANGELOG.md, update release scripts #1807 (alamb)
  • Update versions for split crates #1803 (matthewmturner)
  • Improve the error message and UX of tpch benchmark program #1800 (alamb)
  • rename references of expr in logical plan module after datafusion-expr split #1797 (Jimexist)
  • Update to sqlparser 0.14 #1796 [sql] (alamb)
  • [split/13] move rest of expr to expr_fn in datafusion-expr module #1794 (Jimexist)
  • Update datafusion versions #1793 (matthewmturner)
  • Less verbose plans in debug logging #1787 (alamb)
  • [split/11] split expr type and null info to be expr-schemable #1784 (Jimexist)
  • Introduce Row format backed by raw bytes #1782 (yjshen)
  • rewrite predicates before pushing to union inputs #1781 (korowa)
  • Update datafusion to use arrow 9.0.0 #1775 (alamb)
  • [split/10] split up expr for rewriting, visiting, and simplification traits #1774 [sql] (Jimexist)
  • #1768 Support TimeUnit::Second in hasher #1769 (jychen7)
  • TPC-H benchmark can optionally write JSON output file with benchmark summary #1766 (andygrove)
  • [split/8] move Accumulator and ColumnarValue to datafusion-expr #1765 (Jimexist)
  • [split/7] move built-in scalar function to datafusion-expr #1764 (Jimexist)
  • [split/6] move signature, type signature, volatility to datafusion-expr #1763 (Jimexist)
  • [split/9+12] move udf, udaf, Expr to datafusion-expr module #1762 [sql] (Jimexist)
  • [split/5] move window frame and operator to datafusion-expr module #1761 (Jimexist)
  • [split/4] move scalar value to datafusion-common #1760 (Jimexist)
  • [split/3] split datafusion expr module and move aggregate and window function expr #1759 (Jimexist)
  • [split/2] move column and dfschema to datafusion-common module #1758 (Jimexist)
  • Use ordered-float 2.10 #1756 (andygrove)
  • [split/1] split datafusion-common module #1751 (Jimexist)
  • use clap 3 style args parsing for datafusion cli #1749 (Jimexist)
  • fix: Case insensitive unquoted identifiers in SQL #1747 [sql] (mkmik)
  • Move more tests out of context.rs #1743 (alamb)
  • Move optimize test out of context.rs #1742 (alamb)
  • Fix typos in crate documentation #1739 (r4ntix)
  • add cargo check --release to ci #1737 (xudong963)
  • Update parking_lot requirement from 0.11 to 0.12 #1735 (dependabot[bot])
  • Create built-in scalar functions programmatically #1734 (HaoYang670)
  • Prevent repartitioning of certain operator's direct children (#1731) #1732 (tustvold)
  • API to get Expr's type and nullability without a DFSchema #1726 (alamb)
  • minor: fix cargo run --release error #1723 (xudong963)
  • substitute parking_lot::Mutex for std::sync::Mutex #1720 (xudong963)
  • Convert boolean case expressions to boolean logic #1719 (tustvold)
  • Add Expression Simplification API #1717 (alamb)
  • Create ListingTableConfig which includes file format and schema inference #1715 (matthewmturner)
  • make select_to_plan clearer #1714 [sql] (xudong963)
  • Add upper bound for public function signature #1713 (HaoYang670)
  • Add tests and CI for optional pyarrow module #1711 (wjones127)
  • Create SchemaAdapter trait to map table schema to file schemas #1709 (thinkharderdev)
  • refine test in repartition.rs & coalesce_batches.rs #1707 (xudong963)
  • Fuzz test for spillable sort #1706 (yjshen)
  • Support create_physical_expr and ExecutionContextState or DefaultPhysicalPlanner for faster speed #1700 (alamb)
  • Implement TableProvider for DataFrameImpl #1699 (cpcloud)
  • Move timestamp related tests out of context.rs and into sql integration test #1696 (alamb)
  • Lazy TempDir creation in DiskManager #1695 (alamb)
  • Add MemTrackingMetrics to ease memory tracking for non-limited memory consumers #1691 (yjshen)
  • (minor) Reduce memory manager and disk manager logs from info! to debug! #1689 (alamb)
  • Make SortPreservingMergeStream stable on input stream order #1687 (alamb)
  • Incorporate dyn scalar kernels #1685 (matthewmturner)
  • Move information_schema tests out of execution/context.rs to sql_integration tests #1684 (alamb)
  • Add a new metric type: Gauge + CurrentMemoryUsage to metrics #1682 (yjshen)
  • refactor array_agg to not to have update and merge #1681 (Jimexist)
  • Use NamedTempFile rather than String in DiskManager #1680 (alamb)
  • upgrade clap to version 3 #1672 (Jimexist)
  • Improve configuration and resource use of MemoryManager and DiskManager #1668 (alamb)
  • feat: Support quarter granularity in date_trunc function #1667 (ovr)
  • Fix can not load parquet table form spark in datafusion-cli. #1665 (Ted-Jiang)
  • Make MemoryManager and MemoryStream public #1664 (yjshen)
  • [Cleanup] Move AggregatedMetricsSet to metrics for further reuse #1663 (yjshen)
  • fix: substr - correct behaivour with negative start pos #1660 (ovr)
  • suppport bitwise and as an example #1653 [sql] (liukun4515)
  • refine match pattern related code #1650 (xudong963)
  • update md-5, sha2, blake2 #1647 (xudong963)
  • Add DataFusionError -> ArrowError conversion #1643 (alamb)
  • Add spill_count and spilled_bytes to BaselineMetrics, test sort with spill #1641 (yjshen)
  • support hash decimal array and group by #1640 (liukun4515)
  • Consolidate Schema and RecordBatch projection #1638 (alamb)
  • Update hashbrown requirement from 0.11 to 0.12 #1631 (dependabot[bot])
  • Update pyo3 requirement from 0.14 to 0.15 #1627 (dependabot[bot])
  • Optimize SortPreservingMergeStream to avoid SortKeyCursor sharing #1624 (yjshen)
  • Handle merging of evolved schemas in ParquetExec #1622 (thinkharderdev)
  • feat: Support Substring(str [from int] [for int]) #1621 [sql] (ovr)
  • feat: Support complex interval via IntervalMonthDayNano #1615 [sql] (ovr)
  • consolidate binary_expr coercion rule code into binary_rule.rs module #1607 (alamb)
  • Fix comparison of dictionary arrays #1606 (alamb)
  • add test for decimal to decimal #1603 (liukun4515)
  • update nightly version #1597 (Jimexist)
  • Consolidate sort and external_sort #1596 (yjshen)
  • support from_slice for binary, string, and boolean array types #1589 (Jimexist)
  • add from_slice trait to ease arrow2 migration #1588 (Jimexist)
  • Implement ARRAY_AGG(DISTINCT ...) #1579 (james727)
  • Rename sql integration tests from mod to sql_integration #1575 (alamb)
  • minor: improve the benchmark readme #1567 (xudong963)
  • Consolidate batch_size configuration in ExecutionConfig, RuntimeConfig and PhysicalPlanConfig #1562 (yjshen)
  • Update to rust 1.58 #1557 (xudong963)
  • support mathematics operation for decimal data type #1554 (liukun4515)
  • Address clippy warnings #1553 (sergey-melnychuk)
  • enhance arithmetic operation for array with scalar #1552 (liukun4515)
  • Remove unused update and merge implementations from Aggregates and supporting ScalarValue arithmetic #1550 (alamb)
  • Add batch operations to stddev #1547 (realno)
  • Mark ARRAY_AGG(DISTINCT ...) not implemented #1534 (james727)
  • Update to arrow-7.0.0 #1523 (alamb)
  • Fix ORDER BY on aggregate #1506 (viirya)
  • Add example on how to query multiple parquet files #1497 (nitisht)
  • Refactor testing modules #1491 (hntd187)
  • add rfcs for datafusion #1490 (xudong963)
  • support comparison for decimal data type and refactor the binary coercion rule #1483 (liukun4515)
  • Minor: Rename predicate_builder --> pruning_predicate for consistency #1481 (alamb)
  • Tests for support try_cast/cast decimal to numeric #1465 (liukun4515)
  • Avoid send empty batches for Hash partitioning. #1459 (Ted-Jiang)
  • Planner code cleanup #1450 [sql] (alamb)
  • Fix bug in projection: "column types must match schema types, expected XXX but found YYY" #1448 (alamb)
  • Update arrow-rs to 6.4.0 and replace boolean comparison in datafusion with arrow compute kernel #1446 (xudong963)
  • support cast/try_cast for decimal: signed numeric to decimal #1442 (liukun4515)
  • Consolidate decimal error checking and improve error messages #1438 [sql] (alamb)
  • use 0.13 sql parser #1435 (Jimexist)
  • Minor Code cleanups #1428 (alamb)
  • Clarify communication on bi-weekly sync #1427 (alamb)
  • support sum/avg agg for decimal, change sum(float32) --> float64 #1408 [sql] (liukun4515)
  • Fix bugs with nullability during rewrites: Combine simplify and Simplifier #1401 (alamb)
  • Minimize features #1399 (carols10cents)
  • Update rust vesion to 1.57 #1395 [sql] (xudong963)
  • support decimal scalar value #1394 (liukun4515)
  • Add coercion rules for AggregateFunctions #1387 (liukun4515)
  • upgrade the arrow-rs version #1385 (liukun4515)
  • add array agg name #1382 (liukun4515)
  • Make tests for simplify and Simplifer consistent #1376 (alamb)
  • Refactor: Consolidate expression simplification code in simplify_expression.rs #1374 (alamb)
  • remove unused code in hash_aggregate #1370 (ic4y)
  • Use BufReader for LocalFileReader to revert performance regression in parquet reading #1366 (Dandandan)
  • Add unit test for constant folding on values #1355 (viirya)
  • Extract logical plan: rename the plan name (follow up) #1354 [sql] (liukun4515)
  • Moved aggr_test_schema to test_utils #1338 (rdettai)
  • upgrade arrow-rs to 6.2.0 #1334 (liukun4515)
  • Update release instructions #1331 (alamb)
  • #1268: allow datafusion-cli to toggle quiet flag within CLI #1330 (jgoday)
  • Extract Aggregate, Sort, and Join to struct from AggregatePlan #1326 (matthewmturner)
  • Extract EmptyRelation, Limit, Values from LogicalPlan #1325 (liukun4515)
  • Extract CrossJoin, Repartition, Union in LogicalPlan #1322 (liukun4515)
  • Fifth batch of updating sql tests to use assert_batches_eq #1318 (matthewmturner)
  • Extract Explain, Analyze, Extension in LogicalPlan as independent struct #1317 [sql] (xudong963)
  • Extract CreateMemoryTable, DropTable, CreateExternalTable in LogicalPlan as independent struct #1311 [sql] (liukun4515)
  • Extract Projection, Filter, Window in LogicalPlan as independent struct #1309 (ic4y)
  • Add PSQL comparison tests for except, intersect #1292 (mrob95)
  • Extract logical plans in LogicalPlan as independent struct: TableScan #1290 (xudong963)
  • Add statement helper command to cli #1285 (matthewmturner)
  • Python bindings for window functions #819 [sql] (jgoday)