docs/en/best_practices/query_tuning/schema_tuning.md
This document provides practical tips and best practices for optimizing query performance in StarRocks through effective schema design and foundational table choices. By understanding how different table types, keys, and distribution strategies impact query execution, you can significantly improve both speed and resource efficiency. Use these guidelines to make informed decisions when designing schemas, selecting table types, and tuning your StarRocks environment for high-performance analytics.
StarRocks supports four table types: Duplicate Key table, Aggregate table, Unique Key table, and Primary Key table. All of them are sorted by KEY.
AGGREGATE KEY: When records with the same AGGREGATE KEY is loaded into StarRocks, the old and new records are aggregated. Currently, Aggregate tables supports the following aggregate functions: SUM, MIN, MAX, and REPLACE. Aggregate tables support aggregating data in advance, facilitating business statements and multi-dimensional analyses.DUPLICATE KEY: You only need to specify the sort key for a DUPLICATE KEY table. Records with the same DUPLICATE KEY exist at the same time. It is suitable for analyses that do not involve aggregating data in advance.UNIQUE KEY: When records with the same UNIQUE KEY is loaded into StarRocks, the new record overwrites the old one. A UNIQUE KEY tables is similar to an Aggregate table with REPLACE function. Both are suitable for analyses involving constant updates.PRIMARY KEY: Primary Key tables guarantee the uniqueness of records, and allow you to perform realtime updating.CREATE TABLE site_visit
(
siteid INT,
city SMALLINT,
username VARCHAR(32),
pv BIGINT SUM DEFAULT '0'
)
AGGREGATE KEY(siteid, city, username)
DISTRIBUTED BY HASH(siteid);
CREATE TABLE session_data
(
visitorid SMALLINT,
sessionid BIGINT,
visittime DATETIME,
city CHAR(20),
province CHAR(20),
ip varchar(32),
browser CHAR(20),
url VARCHAR(1024)
)
DUPLICATE KEY(visitorid, sessionid)
DISTRIBUTED BY HASH(sessionid, visitorid);
CREATE TABLE sales_order
(
orderid BIGINT,
status TINYINT,
username VARCHAR(32),
amount BIGINT DEFAULT '0'
)
UNIQUE KEY(orderid)
DISTRIBUTED BY HASH(orderid);
CREATE TABLE sales_order
(
orderid BIGINT,
status TINYINT,
username VARCHAR(32),
amount BIGINT DEFAULT '0'
)
PRIMARY KEY(orderid)
DISTRIBUTED BY HASH(orderid);
To speed up queries, tables with the same distribution can use a common bucketing column. In that case, data can be joined locally without being transferred across the cluster during the join operation.
CREATE TABLE colocate_table
(
visitorid SMALLINT,
sessionid BIGINT,
visittime DATETIME,
city CHAR(20),
province CHAR(20),
ip varchar(32),
browser CHAR(20),
url VARCHAR(1024)
)
DUPLICATE KEY(visitorid, sessionid)
DISTRIBUTED BY HASH(sessionid, visitorid)
PROPERTIES(
"colocate_with" = "group1"
);
For more information about colocate join and replica management, see Colocate join
StarRocks supports star schema, which is more flexible in modelling than flat tables. You can create a view to replace flat tables during modelling and then query data from multiple tables to accelerate queries.
Flat tables have the following drawbacks:
If you have high requirements on query concurrency or low latency, you can still use flat tables.
StarRocks supports two levels of partitioning: the first level is RANGE partition and the second level is HASH bucket.
RANGE partition: RANGE partition is used to divide data into different intervals (can be understood as dividing the original table into multiple sub-tables). Most users choose to set partitions by time, which has the following advantages:
HASH bucket: Divides data into different buckets according to the hash value.
StarRocks stores data in an ordered manner and builds sparse indexes at a granularity of 1024 rows.
StarRocks selects a fixed-length prefix (currently 36 bytes) in the schema as the sparse index.
When creating a table, it is recommended to place common filter fields at the beginning of the schema declaration. Fields with the highest differentiation and query frequency must be placed first.
A VARCHAR field must placed at the end of a sparse index because the index gets truncated from the VARCHAR field. If the VARCHAR field appears first, the index may be less than 36 bytes.
Use the above site_visit table as an example. The table has four columns: siteid, city, username, pv. The sort key contains three columns siteid, city, username, which occupy 4, 2, and 32 bytes respectively. So the prefix index (sparse index) can be the first 30 bytes of siteid + city + username.
In addition to sparse indexes, StarRocks also provides bloomfilter indexes, which are effective for filtering columns with high discrimination. If you want to place VARCHAR fields before other fields, you can create bloomfilter indexes.
StarRocks adopts Bitmap Indexing technology to support inverted indexes that can be applied to all columns of the Duplicate Key table and the key column of the Aggregate table and Unique Key table. Bitmap Index is suitable for columns with a small value range, such as gender, city, and province. As the range expands, the bitmap index expands in parallel.
A rollup is essentially a materialized index of the original table (base table). When creating a rollup, only some columns of the base table can be selected as the schema, and the order of the fields in the schema can be different from that of the base table. Below are some use cases of using a rollup:
Data aggregation in the base table is not high, because the base table has fields with high differentiation. In this case, you may consider selecting some columns to create rollups. Use the above site_visit table as an example:
site_visit(siteid, city, username, pv)
siteid may lead to poor data aggregation. If you need to frequently calculate PVs by city, you can create a rollup with only city and pv.
ALTER TABLE site_visit ADD ROLLUP rollup_city(city, pv);
The prefix index in the base table cannot be hit, because the way the base table is built cannot cover all the query patterns. In this case, you may consider creating a rollup to adjust the column order. Use the above session_data table as an example:
session_data(visitorid, sessionid, visittime, city, province, ip, browser, url)
If there are cases where you need to analyze visits by browser and province in addition to visitorid, you can create a separate rollup:
ALTER TABLE session_data
ADD ROLLUP rollup_browser(browser,province,ip,url)
DUPLICATE KEY(browser,province);
There are three ways to change schemas in StarRocks: sorted schema change, direct schema change, and linked schema change.
Sorted schema change: Change the sorting of a column and reorder the data. For example, deleting a column in a sorted schema leads to data reorder.
ALTER TABLE site_visit DROP COLUMN city;
Direct schema change: Transform the data instead of reordering it, for example, changing the column type or adding a column to a sparse index.
ALTER TABLE site_visit MODIFY COLUMN username varchar(64);
Linked schema change: Complete changes without transforming data, for example, adding columns.
ALTER TABLE site_visit ADD COLUMN click bigint SUM default '0';
It is recommended to choose an appropriate schema when you create tables to accelerate schema changes.