7.0.0 - Datafusion — ContextQMD

7.0.0 (2022-02-14)

Full Changelog

Breaking changes:

Consolidate various configurations options, remove unrelated batch_size #1565
Extract logical plans in LogicalPlan as independent struct #1228
Update ExecutionPlan to know about sortedness and repartitioning optimizer pass respect the invariants #1776 (alamb)
Update to arrow 8.0.0 #1673 (alamb)
Remove non idiomatic DataFusionError::into_arrow_external_error in favor of From conversion #1645 (alamb)
Remove Accumulator::update and Accumulator::merge #1582 (Jimexist)
implement Hash for various types and replace PartialOrd #1580 (Jimexist)
Replace DatafusionError with GenericError in ObjectStore interface #1541 (matthewmturner)
Make FLOAT SQL type map to Float32 rather than Float64 #1423 [sql] (liukun4515)
Map REAL SQL type to Float32 rather than Float64 to be consistent with pg #1390 [sql] (hntd187)

Implemented enhancements:

Create new datafusion_expr crate #1753
Create new datafusion_common crate #1752
API to get Expr's type and nullability without a DFSchema #1725
Cleaner API to create Expr::ScalarFunction programatically #1718
Introduce a Vec<u8> based row-wise representation for DataFusion #1708
Simplify creating new ListingTable #1705
Implement TableProvider for DataFrameImpl to allow registration of logical plans #1698
Public Expr simplification API #1694
Query Optimizer: Add OUTER --> INNER join conversion #1670
Support reading from CSV, Avro and Json files that have mergeable/compatible, but not identical schemas #1669
Remove DataFusionError::into_arrow_external_error in favor of From conversion #1644
Include join type in display implementation for logical plan #1620
Switch datafusion to using eq_dyn_scalar, etc kernels #1610
Proposal: Remove Accumulator::update and Accumulator::merge #1549
Replace DataFusionError/Result with impl Error for ObjectStore and Reader #1540
Add approx_quantile support #1538
support sorting decimal data type #1522
Keep all datafusion's packages up to date with Dependabot #1472
ExecutionContext support init ExecutionContextState with new(state: Arc<Mutex<ExecutionContextState>>) method #1439
support the decimal scalar value #1393
Documentation for using scalar functions with the DataFrame API #1364
Support boolean == boolean and boolean != boolean operators #1159
Support DataType::Decimal(15, 2) in TPC-H benchmark #174
Make MemoryStream public #150
Add support for Parquet schema merging #132
Add SQL support for IN expression #118
Add logging to datafusion-cli #1789 (alamb)
Add approx_median() aggregate function #1729 (realno)
Add join type for logical plan display #1674 [sql] (xudong963)
Fix null comparison for Parquet pruning predicate #1595 (viirya)
Add corr aggregate function #1561 (realno)
Add covar, covar_pop and covar_samp aggregate functions #1551 (realno)
Add approx_quantile() aggregation function #1539 (domodwyer)
Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526 (yjshen)
Add stddev and variance #1525 (realno)
Add rem operation for Expr #1467 (liukun4515)
support decimal data type in create table #1431 [sql] (liukun4515)
Ordering by index in select expression #1419 [sql] (hntd187)
Add support for ORDER BY on unprojected columns #1415 (viirya)
Support decimal for min and max aggregate #1407 (liukun4515)
Consolidate ConstantFolding and SimplifyExpression #1375 (alamb)
Datafusion cli quiet mode command to contain option bool #1345 (Jimexist)
Implement array_agg aggregate function #1300 (viirya)
Add a command to switch output format in cli #1284 (capkurmagati)
Support =, <, <=, >, >=, !=, is distinct from, is not distinct from for BooleanArray #1163 (alamb)

Fixed bugs:

Unsupported data type in hasher: Timestamp(Second, None) #1768
SQL column identifiers should be converted to lowercase when unquoted #1746
Data type Dictionary(Int32, Utf8) not supported for binary operation 'eq' on dyn arrays #1605
datafusion doesn't process predicate pushdown correctly when there is outer join #1586
casting Int64 to Float64 unsuccessfully caused tpch8 to fail #1576
CTE/WITH .. UNION ALL confuses name resolution in WHERE #1509
ORDER BY min(x) results in error Plan("No field named 'foo.x'. Valid fields are 'MIN(foo.x)'.") #1479
Sort discards field metadata on the output schema #1476
Datafusion should not strip out timezone information from existing types #1454
Error on some queries: "column types must match schema types, expected XXX but found YYY" #1447
Query failing to return any results when filter is an equality check on strings (bad statistics in parquet) #1433
Field names containing period such as f.c1 cannot be named in SQL query #1432
Select * returns an unexpected result #1412
Turn off unused default features of chrono and ahash #1398
real data type is float32 in PG database, but in the datafusion it is as float64 #1380
TPC-H q10 performance regression (expression for filter with added alias is not pushed down) #1367
ProjectionExec Loses Field Metadata #1361
Support Filter on unprojected columns #1351
NULLS ORDER is inconsistent with postgres #1343
Fix bug while merging RecordBatch, add SortPreservingMerge fuzz tester #1678 (alamb)
fix a cte block with same name for many times #1639 [sql] (xudong963)
fix: casting Int64 to Float64 unsuccessfully caused tpch8 to fail #1601 (xudong963)
Fix single_distinct_to_groupby for arbitrary expressions #1519 (james727)
Fix SortExec discards field metadata on the output schema #1477 (alamb)
fix calculate in many_to_many_hash_partition test. #1463 (Ted-Jiang)
Add Timezone to Scalar::Time* types, and better timezone awareness to Datafusion's time types #1455 (maxburke)
Support identifiers with . in them #1449 [sql] (alamb)
Fixes for working with functions in dataframes, additional documentation #1430 (tobyhede)
[Minor] Fix send_time metric for hash-repartition #1421 (Dandandan)
fix: Select * returns an unexpected result #1413 [sql] (xudong963)
Make cli handle multiple whitespaces #1388 (capkurmagati)
Metadata is kept in projections for non-derived columns #1378 (hntd187)
Fix Predicate Pushdown: split_members should be able to split aliased predicate #1368 (viirya)
Change the arg names and make parameters more meaningful #1357 (liukun4515)
collect table stats by default for listing table #1347 (houqp)
fix: make nulls-order consistent with postgres #1344 [sql] (xudong963)
Avoid changing expression names during constant folding #1319 (viirya)
improve error message for invalid create table statement #1294 [sql] (houqp)
Forbid creating the table with the same name #1288 (liukun4515)

Documentation updates:

Clarify docs about Accumulator::update and Accumulator::update_batch #1542 (alamb)
Fix duplicated cargo run --example parquet_sql #1482 (sergey-melnychuk)
add documentation to Datafusion cli's new commands #1348 (liukun4515)
fix some clippy warnings from nightly channel #1277 [sql] (Jimexist)

Performance improvements:

Parquet pruning predicate for IS NULL #1591
Fix predicate pushdown for outer joins #1618 (james727)
fix: sql planner creates cross join instead of inner join from select predicates #1566 [sql] (xudong963)
Split fetch_metadata into fetch_statistics and fetch_schema #1365 (Dandandan)
Optimize the performance queries with a single distinct aggregate #1315 (ic4y)
Left join could use bitmap for left join instead of Vec<bool> #1291 (boazberman)

Closed issues:

Add release compile to CI #1728
DiskManager and TempFiles getting created several times per query #1690
Add a test for the pyarrow feature in CI #1635
SQL tests for when sorting exceeded available memory and had to spill to disk #1573
Consolidate the N-way merging code and SortPreservingMergeStream (which has quite good tests of what is often quite tricky code, and it will be performance critical) #1572
Consolidate the SortExec code (so there is only a single sort operator that does in memory sorting if it has enough memory budget but then spills to disk if needed). #1571
Track memory usage in Non Limited Operators #1569
[Question] Why does ballista store tables in the client instead of in the SchedulerServer #1473
Consolidate Projection for Schema and RecordBatch #1425
Support Sort on unprojected columns #1372
Unused code in hash_aggregate #1362
Why use the expr types before coercion to get the result type? #1358
A problem about the projection_push_down optimizer gathers valid columns #1312
apply constant folding to LogicalPlan::Values #1170
reduce usage of IntoIterator<Item = Expr> in logical plan builder window fn #372
Why does DataFusion throw a Tokio 0.2 runtime error? #176
TPC-H Query 14 #165
Length kernel returns bytes not character length #156
Split the logical operators out into separate source files #115

Merged pull requests:

Fixup some doc warnings #1811 (alamb)
Ensure most of links in docs are correct #1808 [sql] (HaoYang670)
Update CHANGELOG.md, update release scripts #1807 (alamb)
Update versions for split crates #1803 (matthewmturner)
Improve the error message and UX of tpch benchmark program #1800 (alamb)
rename references of expr in logical plan module after datafusion-expr split #1797 (Jimexist)
Update to sqlparser 0.14 #1796 [sql] (alamb)
[split/13] move rest of expr to expr_fn in datafusion-expr module #1794 (Jimexist)
Update datafusion versions #1793 (matthewmturner)
Less verbose plans in debug logging #1787 (alamb)
[split/11] split expr type and null info to be expr-schemable #1784 (Jimexist)
Introduce Row format backed by raw bytes #1782 (yjshen)
rewrite predicates before pushing to union inputs #1781 (korowa)
Update datafusion to use arrow 9.0.0 #1775 (alamb)
[split/10] split up expr for rewriting, visiting, and simplification traits #1774 [sql] (Jimexist)
#1768 Support TimeUnit::Second in hasher #1769 (jychen7)
TPC-H benchmark can optionally write JSON output file with benchmark summary #1766 (andygrove)
[split/8] move Accumulator and ColumnarValue to datafusion-expr #1765 (Jimexist)
[split/7] move built-in scalar function to datafusion-expr #1764 (Jimexist)
[split/6] move signature, type signature, volatility to datafusion-expr #1763 (Jimexist)
[split/9+12] move udf, udaf, Expr to datafusion-expr module #1762 [sql] (Jimexist)
[split/5] move window frame and operator to datafusion-expr module #1761 (Jimexist)
[split/4] move scalar value to datafusion-common #1760 (Jimexist)
[split/3] split datafusion expr module and move aggregate and window function expr #1759 (Jimexist)
[split/2] move column and dfschema to datafusion-common module #1758 (Jimexist)
Use ordered-float 2.10 #1756 (andygrove)
[split/1] split datafusion-common module #1751 (Jimexist)
use clap 3 style args parsing for datafusion cli #1749 (Jimexist)
fix: Case insensitive unquoted identifiers in SQL #1747 [sql] (mkmik)
Move more tests out of context.rs #1743 (alamb)
Move optimize test out of context.rs #1742 (alamb)
Fix typos in crate documentation #1739 (r4ntix)
add cargo check --release to ci #1737 (xudong963)
Update parking_lot requirement from 0.11 to 0.12 #1735 (dependabot[bot])
Create built-in scalar functions programmatically #1734 (HaoYang670)
Prevent repartitioning of certain operator's direct children (#1731) #1732 (tustvold)
API to get Expr's type and nullability without a DFSchema #1726 (alamb)
minor: fix cargo run --release error #1723 (xudong963)
substitute parking_lot::Mutex for std::sync::Mutex #1720 (xudong963)
Convert boolean case expressions to boolean logic #1719 (tustvold)
Add Expression Simplification API #1717 (alamb)
Create ListingTableConfig which includes file format and schema inference #1715 (matthewmturner)
make select_to_plan clearer #1714 [sql] (xudong963)
Add upper bound for public function signature #1713 (HaoYang670)
Add tests and CI for optional pyarrow module #1711 (wjones127)
Create SchemaAdapter trait to map table schema to file schemas #1709 (thinkharderdev)
refine test in repartition.rs & coalesce_batches.rs #1707 (xudong963)
Fuzz test for spillable sort #1706 (yjshen)
Support create_physical_expr and ExecutionContextState or DefaultPhysicalPlanner for faster speed #1700 (alamb)
Implement TableProvider for DataFrameImpl #1699 (cpcloud)
Move timestamp related tests out of context.rs and into sql integration test #1696 (alamb)
Lazy TempDir creation in DiskManager #1695 (alamb)
Add MemTrackingMetrics to ease memory tracking for non-limited memory consumers #1691 (yjshen)
(minor) Reduce memory manager and disk manager logs from info! to debug! #1689 (alamb)
Make SortPreservingMergeStream stable on input stream order #1687 (alamb)
Incorporate dyn scalar kernels #1685 (matthewmturner)
Move information_schema tests out of execution/context.rs to sql_integration tests #1684 (alamb)
Add a new metric type: Gauge + CurrentMemoryUsage to metrics #1682 (yjshen)
refactor array_agg to not to have update and merge #1681 (Jimexist)
Use NamedTempFile rather than String in DiskManager #1680 (alamb)
upgrade clap to version 3 #1672 (Jimexist)
Improve configuration and resource use of MemoryManager and DiskManager #1668 (alamb)
feat: Support quarter granularity in date_trunc function #1667 (ovr)
Fix can not load parquet table form spark in datafusion-cli. #1665 (Ted-Jiang)
Make MemoryManager and MemoryStream public #1664 (yjshen)
[Cleanup] Move AggregatedMetricsSet to metrics for further reuse #1663 (yjshen)
fix: substr - correct behaivour with negative start pos #1660 (ovr)
suppport bitwise and as an example #1653 [sql] (liukun4515)
refine match pattern related code #1650 (xudong963)
update md-5, sha2, blake2 #1647 (xudong963)
Add DataFusionError -> ArrowError conversion #1643 (alamb)
Add spill_count and spilled_bytes to BaselineMetrics, test sort with spill #1641 (yjshen)
support hash decimal array and group by #1640 (liukun4515)
Consolidate Schema and RecordBatch projection #1638 (alamb)
Update hashbrown requirement from 0.11 to 0.12 #1631 (dependabot[bot])
Update pyo3 requirement from 0.14 to 0.15 #1627 (dependabot[bot])
Optimize SortPreservingMergeStream to avoid SortKeyCursor sharing #1624 (yjshen)
Handle merging of evolved schemas in ParquetExec #1622 (thinkharderdev)
feat: Support Substring(str [from int] [for int]) #1621 [sql] (ovr)
feat: Support complex interval via IntervalMonthDayNano #1615 [sql] (ovr)
consolidate binary_expr coercion rule code into binary_rule.rs module #1607 (alamb)
Fix comparison of dictionary arrays #1606 (alamb)
add test for decimal to decimal #1603 (liukun4515)
update nightly version #1597 (Jimexist)
Consolidate sort and external_sort #1596 (yjshen)
support from_slice for binary, string, and boolean array types #1589 (Jimexist)
add from_slice trait to ease arrow2 migration #1588 (Jimexist)
Implement ARRAY_AGG(DISTINCT ...) #1579 (james727)
Rename sql integration tests from mod to sql_integration #1575 (alamb)
minor: improve the benchmark readme #1567 (xudong963)
Consolidate batch_size configuration in ExecutionConfig, RuntimeConfig and PhysicalPlanConfig #1562 (yjshen)
Update to rust 1.58 #1557 (xudong963)
support mathematics operation for decimal data type #1554 (liukun4515)
Address clippy warnings #1553 (sergey-melnychuk)
enhance arithmetic operation for array with scalar #1552 (liukun4515)
Remove unused update and merge implementations from Aggregates and supporting ScalarValue arithmetic #1550 (alamb)
Add batch operations to stddev #1547 (realno)
Mark ARRAY_AGG(DISTINCT ...) not implemented #1534 (james727)
Update to arrow-7.0.0 #1523 (alamb)
Fix ORDER BY on aggregate #1506 (viirya)
Add example on how to query multiple parquet files #1497 (nitisht)
Refactor testing modules #1491 (hntd187)
add rfcs for datafusion #1490 (xudong963)
support comparison for decimal data type and refactor the binary coercion rule #1483 (liukun4515)
Minor: Rename predicate_builder --> pruning_predicate for consistency #1481 (alamb)
Tests for support try_cast/cast decimal to numeric #1465 (liukun4515)
Avoid send empty batches for Hash partitioning. #1459 (Ted-Jiang)
Planner code cleanup #1450 [sql] (alamb)
Fix bug in projection: "column types must match schema types, expected XXX but found YYY" #1448 (alamb)
Update arrow-rs to 6.4.0 and replace boolean comparison in datafusion with arrow compute kernel #1446 (xudong963)
support cast/try_cast for decimal: signed numeric to decimal #1442 (liukun4515)
Consolidate decimal error checking and improve error messages #1438 [sql] (alamb)
use 0.13 sql parser #1435 (Jimexist)
Minor Code cleanups #1428 (alamb)
Clarify communication on bi-weekly sync #1427 (alamb)
support sum/avg agg for decimal, change sum(float32) --> float64 #1408 [sql] (liukun4515)
Fix bugs with nullability during rewrites: Combine simplify and Simplifier #1401 (alamb)
Minimize features #1399 (carols10cents)
Update rust vesion to 1.57 #1395 [sql] (xudong963)
support decimal scalar value #1394 (liukun4515)
Add coercion rules for AggregateFunctions #1387 (liukun4515)
upgrade the arrow-rs version #1385 (liukun4515)
add array agg name #1382 (liukun4515)
Make tests for simplify and Simplifer consistent #1376 (alamb)
Refactor: Consolidate expression simplification code in simplify_expression.rs #1374 (alamb)
remove unused code in hash_aggregate #1370 (ic4y)
Use BufReader for LocalFileReader to revert performance regression in parquet reading #1366 (Dandandan)
Add unit test for constant folding on values #1355 (viirya)
Extract logical plan: rename the plan name (follow up) #1354 [sql] (liukun4515)
Moved aggr_test_schema to test_utils #1338 (rdettai)
upgrade arrow-rs to 6.2.0 #1334 (liukun4515)
Update release instructions #1331 (alamb)
#1268: allow datafusion-cli to toggle quiet flag within CLI #1330 (jgoday)
Extract Aggregate, Sort, and Join to struct from AggregatePlan #1326 (matthewmturner)
Extract EmptyRelation, Limit, Values from LogicalPlan #1325 (liukun4515)
Extract CrossJoin, Repartition, Union in LogicalPlan #1322 (liukun4515)
Fifth batch of updating sql tests to use assert_batches_eq #1318 (matthewmturner)
Extract Explain, Analyze, Extension in LogicalPlan as independent struct #1317 [sql] (xudong963)
Extract CreateMemoryTable, DropTable, CreateExternalTable in LogicalPlan as independent struct #1311 [sql] (liukun4515)
Extract Projection, Filter, Window in LogicalPlan as independent struct #1309 (ic4y)
Add PSQL comparison tests for except, intersect #1292 (mrob95)
Extract logical plans in LogicalPlan as independent struct: TableScan #1290 (xudong963)
Add statement helper command to cli #1285 (matthewmturner)
Python bindings for window functions #819 [sql] (jgoday)