docs/en/faq/shared_data_faq.md
This topic provides answers to some frequently asked questions about shared-data clusters.
Check the BE log (be.INFO) to identify the exact cause. Common causes include:
aws_s3_path, endpoint, authentication).Other errors:
Error message: "Error 1064 (HY000): Unexpected exception: Failed to create shards. INVALID_ARGUMENT: shard info cannot be empty"
Cause: This often caused when automatic bucket inference is used while no CN or BE nodes are alive. This issues is fixed in v3.2.
Excessive bucket numbers (especially in partitioned tables) cause StarRocks to create many tablets. The system needs to write a tablet metadata file for each tablet in the object storage, whose high latency can drastically increase total creation time. You may consider:
create_tablet_worker_count.StarRocks supports two DROP TABLE modes:
DROP TABLE xxx: moves table metadata to FE recycle bin (data is not deleted).DROP TABLE xxx FORCE: immediately deletes table metadata and data.If cleanup fails, check:
DROP TABLE xxx FORCE was used.catalog_trash_expire_secondtrash_file_expire_time_secRun the following command to get the storage path.
SHOW PROC '/dbs/<database_name>';
Example:
mysql> SHOW PROC '/dbs/load_benchmark';
+---------+-------------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+--------------------------------------------------------------------------------------------------------------+
| TableId | TableName | IndexNum | PartitionColumnName | PartitionNum | State | Type | LastConsistencyCheckTime | ReplicaCount | PartitionType | StoragePath |
+---------+-------------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+--------------------------------------------------------------------------------------------------------------+
| 17152 | store_sales | 1 | NULL | 1 | NORMAL | CLOUD_NATIVE | NULL | 64 | UNPARTITIONED | s3://starrocks-common/xxxxxxxxx-xxxx_load_benchmark-1699408425544/5ce4ee2c-98ba-470c-afb3-8d0bf4795e48/17152 |
+---------+-------------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+--------------------------------------------------------------------------------------------------------------+
1 row in set (0.18 sec)
In versions earlier than v3.1.4, table data is scattered under a single directory.
From v3.1.4 onwards, data is organized by partition. The same command displays the table root path, which now contains subdirectories named after Partition IDs, and each partition directory holds sub-directories data/ (segment data files) and meta/ (tablet metadata files).
Common causes include:
datacache.partition_duration settings, causing caching failures.You need to analyze the Query Profile first to identify the root cause.
In shared-data clusters, data is stored remotely, so Data Cache is crucial. If queries become unexpectedly slow, check Query Profile metrics such as CompressedBytesReadRemote and IOTimeRemote.
Cache misses may be caused by:
datacache.partition_duration settings preventing caching.Without adequate compaction, many historical data versions remain, increasing the number of segment files accessed during queries. This increases I/O and slows down queries.
You can diagnose insufficient compaction by:
Checking Compaction Score for relevant partitions.
Compaction Score should remain below ~10. Excessively high Compaction Scores often indicate compaction failures.
Reviewing Query Profile metrics such as SegmentsReadCount.
If Segment counts are high, compaction may be lagging or stuck.
Tablets distribute data across Compute Nodes. Poor bucketing or skewed bucket keys can cause queries to run on only a subset of nodes.
Recommendations:
total data size / (1–5 GB)).datacache.partition_duration settingsIf this value is set to too small, data from “cold” partitions may not be cached, causing repeated remote reads. In Query Profile, if CompressedBytesReadRemote or IOCountRemote is non-zero, this may be the reason. Tune datacache.partition_duration accordingly.
Check whether the Compute Nodes under the warehouse can access object storage endpoint.
Obtain visible version by running:
SHOW PARTITIONS FROM <table_name>`
Execute the following statement to retrieve tablet metadata:
admin execute on <backend_id>
'System.print(StorageEngine.get_lake_tablet_metadata_json(<tablet_id>, <version>))'
StarRocks serializes transaction commit, so high ingestion rates may hit limits.
Monitor the following aspects:
Key behaviors include:
To clean up stuck compaction tasks, follow these steps:
Check version information of the partition, and compare CompactVersion and VisibleVersion.
SHOW PARTITIONS FROM <table_name>
Check compaction task status.
SHOW PROC '/compactions'
SELECT * FROM information_schema.be_cloud_native_compactions WHERE TXN_ID = <TxnID>
Cancel expired tasks.
a. Disable compaction and migration.
ADMIN SET FRONTEND CONFIG ("lake_compaction_max_tasks" = "0");
ADMIN SET FRONTEND CONFIG ('tablet_sched_disable_balance' = 'true');
ADMIN SHOW FRONTEND CONFIG LIKE 'lake_compaction_max_tasks';
ADMIN SHOW FRONTEND CONFIG LIKE 'tablet_sched_disable_balance';
b. Restart all BE nodes
c. Verify all compactions have failed.
SHOW PROC '/compactions'
d. Re-enable compaction & migration.
ADMIN SET FRONTEND CONFIG ("lake_compaction_max_tasks" = "-1");
ADMIN SET FRONTEND CONFIG ('tablet_sched_disable_balance' = 'false');
ADMIN SHOW FRONTEND CONFIG LIKE 'lake_compaction_max_tasks';
ADMIN SHOW FRONTEND CONFIG LIKE 'tablet_sched_disable_balance';
If ingestion happens in one warehouse but compaction runs in another (for example, default_warehouse), compaction must pull data across warehouses with no cache, slowing it down.
Solution:
lake_enable_vertical_compaction_fill_data_cache to true.Storage usage on object storage includes all historical versions, while SHOW DATA outputs only reflect the latest version.
However, if the difference is excessively large, it could be caused by the following issues:
You may consider tuning compaction or vacuum thread pools if necessary.
The query may be using a data version already compacted and vacuumed.
To resolve this, you can increase file retention by modifying the FE configuration lake_autovacuum_grace_period_minutes, and then retry the query.
Excessive number of small files may cause performance degradation. Common causes include:
Compaction will eventually merge small files, but tuning bucket count and batching ingestion helps prevent performance degradation.