docs/en/using_starrocks/skew_join_v2.md
Skew Join V2 is an advanced optimization feature in StarRocks that addresses data skew issues in JOIN operations by broadcasting skew values. This feature significantly improves query performance when dealing with heavily skewed data distributions.
Data skew occurs when certain values in join columns appear much more frequently than others, leading to uneven data distribution across nodes and causing performance bottlenecks. Skew Join V2 solves this problem by:
Skew Join V2 creates a hybrid execution plan that combines:
By default, the Skew Join V2 optimization is disabled, and its predecessor, Skew Join V1, is enabled. If you find that the performance of Skew Join V1 is not satisfactory, you must disable Skew Join V1 before enabling Skew Join V2.
-- Disable Skew Join V1
SET enable_optimize_skew_join_v1 = false;
-- Enable Skew Join V2
SET enable_optimize_skew_join_v2 = true;
After enabling Skew Join V2, you can then update your queries based on the syntax with explicitly specified skew values.
:::note Currently, Skew Join V2 does not support automatic plan rewrite based on statistics. Only hint-based manual SQL rewrite is supported. :::
Compared to the query rewrite method of Skew Join V1, Skew Join V2 offers a new syntax that allows you to explicitly specify the skew values with Broadcast.
Syntax:
SELECT select_list FROM
table1 JOIN [skew|table1.column(skew_value1, skew_value2, ...)] table2
ON join_condition
[WHERE where_clause]
Parameters:
[skew|table1.column(skew_value1, skew_value2, ...)]: The skew hint. It includes:
table1.column: The column from the left table that contains skew values.skew_value1, skew_value2, ...: A comma-separated list of values that cause data skew.:::note Do not omit the brackets in the skew hint. :::
Create test tables.
CREATE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
amount DECIMAL(10,2)
) DUPLICATE KEY(order_id)
DISTRIBUTED BY HASH(order_id) BUCKETS 8;
CREATE TABLE customers (
customer_id INT,
customer_name VARCHAR(100),
city VARCHAR(50)
) DUPLICATE KEY(customer_id)
DISTRIBUTED BY HASH(customer_id) BUCKETS 8;
Insert sample data with skew.
INSERT INTO orders VALUES
(1, 1001, '2024-01-01', 100.00),
(2, 1001, '2024-01-02', 200.00),
(3, 1001, '2024-01-03', 150.00),
(4, 1002, '2024-01-01', 300.00),
(5, 1003, '2024-01-01', 250.00);
INSERT INTO customers VALUES
(1001, 'John Doe', 'New York'),
(1002, 'Jane Smith', 'Los Angeles'),
(1003, 'Bob Johnson', 'Chicago');
Query data with Skew Join V2.
SELECT o.order_id, c.customer_name, o.amount
FROM orders o
JOIN [skew|o.customer_id(1001)] c
ON o.customer_id = c.customer_id;
Skew Join V2 supports multiple skew values.
SELECT o.order_id, c.customer_name, o.amount
FROM orders o
JOIN [skew|o.customer_id(1001, 1002, 1003)] c
ON o.customer_id = c.customer_id;
Skew Join V2 supports various data types for skew values.
-- String values
SELECT t1.id, t2.name
FROM table1 t1
JOIN [skew|t1.category('electronics', 'clothing', 'books')] t2
ON t1.category = t2.category;
-- Date values
SELECT t1.id, t2.event_name
FROM events t1
JOIN [skew|t1.event_date('2024-01-01', '2024-01-02')] t2
ON t1.event_date = t2.event_date;
-- Numeric values
SELECT t1.id, t2.region
FROM sales t1
JOIN [skew|t1.region_id(1, 2, 3)] t2
ON t1.region_id = t2.region_id;
Skew Join V2 supports complex Join conditions.
-- Join condition with complex expressions
SELECT t1.id, t2.value
FROM table1 t1
JOIN [skew|t1.key(abs(t1.key))] t2
ON abs(t1.key) = abs(t2.key);
-- Multiple Join conditions
SELECT t1.id, t2.name
FROM table1 t1
JOIN [skew|t1.category('electronics')] t2
ON t1.category = t2.category AND t1.region = t2.region;
Skew Join V2 supports various Join types.
-- Left Join
SELECT t1.id, t2.name
FROM table1 t1
LEFT JOIN [skew|t1.category('electronics')] t2
ON t1.category = t2.category;
-- Left Semi Join
SELECT t1.id
FROM table1 t1
LEFT SEMI JOIN [skew|t1.category('electronics')] t2
ON t1.category = t2.category;
-- Left Anti Join
SELECT t1.id
FROM table1 t1
LEFT ANTI JOIN [skew|t1.category('electronics')] t2
ON t1.category = t2.category;
Before using Skew Join V2, identify the values that cause data skew:
-- Analyze data distribution
SELECT customer_id, COUNT(*) as cnt
FROM orders
GROUP BY customer_id
ORDER BY cnt DESC
LIMIT 10;
Follow these rules to choose the appropriate skew values:
skew hint will prevent the optimizer from reordering the Join. The Join will be executed in the order specified in the SQL, and the optimizer will not attempt to change the Join order or swap the left and right tables of the Join node containing the hint.Execute EXPLAIN VERBOSE against the query to collect its execution plan:
EXPLAIN VERBOSE
SELECT ... FROM ... JOIN [skew|...] ...;
Check the following fields in the plan:
SplitCastDataSink: Data splitting.BROADCAST and SHUFFLE: Distribution type.UNION: Union results of both Join types.