docs/en/best_practices/partitioning.md
Fast analytics in StarRocks begin with a table layout that matches your query patterns. This guide distills hands‑on experience into clear rules for partitioning, helping you:
Keep this advice close when modeling a new table or refactoring an old one—each section gives purpose‑driven criteria, design heuristics, and operational guard‑rails so you avoid costly re‑partitioning down the road.
Understanding the distinction between partitioning and bucketing is fundamental when designing performant StarRocks tables. While both help manage large datasets, they serve different purposes:
| Aspect | Partitioning | Bucketing (Hash/Random) |
|---|---|---|
| Primary goal | Coarse‑grain data pruning and lifecycle control(TTL, archiving). | Fine‑grain parallelism and data locality inside each partition. |
| Planner visibility | Partitions are catalog objects; FE can skip them via predicates. | Only equality predicates support bucket pruning |
| Lifecycle ops | DROP PARTITION is metadata‑only—ideal for GDPR deletes, monthly roll‑off. | Buckets can’t be dropped; they change only with ALTER TABLE … MODIFY DISTRIBUTED BY. |
| Typical count | 10^2–10^4 per table (days, weeks, tenants). | 10–120 per partition; StarRocks BUCKETS xxx tunes this. |
| Skew handling | Merge or split partitions; consider composite/hybrid scheme. | Raise bucket count, hash on compound key, isolate “whales”, or use random bucketing |
| Red flags | >100 k partitions can introduce significant memory footprint for FE | >200 k tablets per BE; tablets exceeding 10 GB may encounter compaction issues. |
| Table type | Partition? | Typical key |
|---|---|---|
| Fact / event stream | Yes | date_trunc('day', event_time) |
| Huge dimension (billions rows) | Sometimes | Time or business key change date |
| Small dimension / lookup | No | Rely on hash distribution |
date_trunc('day', dt).tenant_id into the key when you need to manage the data in tenant basisPARTITION BY tenant_id, date_trunc('day', dt) prunes perfectly but creates #tenants × #days partitions. Keep below ≈ 100 k total or FE memory & BE compaction suffer.The granularity of PARTITION BY date_trunc('day', dt) should be adjusted based on the use case. You can use "hour," "day," or "month," etc. See date_trunc
| Granularity | Use when | Pros | Cons |
|---|---|---|---|
| Daily (default) | Most BI & reporting | Few partitions (365/yr); simple TTL | Less precise for "last 3 h" queries |
| Hourly | > 2 × tablet per day; IoT bursts | Hot‑spot isolation; 24 partitions/day | 8 700 partitions/yr |
| Weekly / Monthly | Historical archive | Tiny metadata; merges easy | Coarser pruning |
CREATE TABLE click_stream (
user_id BIGINT,
event_time DATETIME,
url STRING,
...
)
DUPLICATE KEY(user_id, event_time)
PARTITION BY date_trunc('day', event_time)
DISTRIBUTED BY HASH(user_id) BUCKETS xxx;
Recommended for most SaaS workloads. Prunes on time, keeps tenant rows co‑located.
CREATE TABLE metrics (
tenant_id INT,
dt DATETIME,
metric_name STRING,
v DOUBLE
)
PRIMARY KEY(tenant_id, dt, metric_name)
PARTITION BY date_trunc('DAY', dt)
DISTRIBUTED BY HASH(tenant_id) BUCKETS xxx;
When tenant-specific DML/DDL is necessary or large-scale tenants are present, be cautious of potential partition explosion.
CREATE TABLE activity (
tenant_id INT,
dt DATETIME,
id BIGINT,
....
)
DUPLICATE KEY(dt, id)
PARTITION BY tenant_id, date_trunc('MONTH', dt)
DISTRIBUTED BY HASH(id) BUCKETS xxx;