docs/rfcs/2025-07-23-global-gc-worker.md
This RFC proposes the integration of a garbage collection (GC) mechanism within the Compaction process. This mechanism aims to manage and remove stale files that are no longer actively used by any system component, thereby reclaiming storage space.
With the introduction of features such as table repartitioning, a substantial number of Parquet files can become obsolete. Furthermore, failures during manifest updates may result in orphaned files that are never referenced by the system. Therefore, a periodic garbage collection mechanism is essential to reclaim storage space by systematically removing these unused files.
The garbage collection process will be integrated directly into the Compaction process. Upon the completion of a Compaction for a given region, the GC worker will be automatically triggered. Its primary function will be to identify and subsequently delete obsolete files that have persisted beyond their designated retention period. This integration ensures that garbage collection is performed in close conjunction with data lifecycle management, effectively leveraging the compaction process's inherent knowledge of file states.
This design prioritizes correctness and safety by explicitly linking GC execution to a well-defined operational boundary: the successful completion of a compaction cycle.
The GC worker operates as an integral part of the Compaction process. Once a Compaction for a specific region is completed, the GC worker is automatically triggered. Executing this process on a datanode is preferred to eliminate the overhead associated with having to set object storage configurations in the metasrv.
The detailed process is as follows:
Following flowchart illustrates the GC worker's process:
flowchart TD
A[Compaction Completed] --> B[Trigger GC Worker]
B --> C[Scan Region Manifest]
C --> D[Identify File Types]
D --> E[Unused Files
Never recorded in manifest]
D --> F[Obsolete Files
Previously in manifest
but marked for removal]
E --> G[Check Lingering Time]
F --> G
G --> H{File exceeds
configured lingering time?}
H -->|No| I[Skip deletion]
H -->|Yes| J[Check Temporary Manifest]
J --> K{File in use by
active queries?}
K -->|Yes| L[Retain file
Wait for next GC cycle]
K -->|No| M[Safely delete file]
I --> N[End GC cycle]
L --> N
M --> O[Update Manifest]
O --> N
N --> P[Wait for next Compaction]
P --> A
style A fill:#e1f5fe
style B fill:#f3e5f5
style M fill:#e8f5e8
style L fill:#fff3e0
An obsolete file is permanently deleted only if two conditions are met:
With the integration of the GC worker into the Compaction process, the risk of accidentally deleting newly created SST files that have not yet been recorded in the manifest is significantly mitigated. Consequently, the concept of "Unused Files" as a distinct category primarily susceptible to accidental deletion is largely resolved. Any files that are genuinely "unused" (i.e., never referenced by any manifest, including temporary ones) can be safely deleted after a configurable maximum lingering time.
For debugging and auditing purposes, a comprehensive list of recently deleted files can be maintained.
To prevent the GC worker from inadvertently deleting files that are actively being utilized by long-running analytical queries, a robust protection mechanism is introduced. This mechanism relies on temporary manifests that are actively kept "alive" by the queries using them.
When a long-running query is detected (e.g., by a slow query recorder), it will write a temporary manifest to the region's manifest directory. This manifest lists all files required for the query. However, simply creating this file is not enough, as a query runner might crash, leaving the temporary manifest orphaned and preventing garbage collection indefinitely.
To address this, the following "heartbeat" mechanism is implemented:
This approach ensures that only files for genuinely active queries are protected. The lifecycle of the temporary manifest is managed dynamically: it is created when a long query starts, kept alive through periodic updates, and is either deleted by the query upon normal completion or automatically cleaned up by the GC worker if the query terminates unexpectedly.
This mechanism may be too complex to implement at once. We can consider a two-phased approach:
one potential race condition with region-migration is illustrated below:
sequenceDiagram
participant gc_worker as GC Worker(same dn as region 1)
participant region1 as Region 1 (Leader → Follower)
participant region2 as Region 2 (Follower → Leader)
participant region_dir as Region Directory
gc_worker->>region1: Start GC, get region manifest
activate region1
region1-->>gc_worker: Region 1 manifest
deactivate region1
gc_worker->>region_dir: Scan region directory
Note over region1,region2: Region Migration Occurs
region1-->>region2: Downgrade to Follower
region2-->>region1: Becomes Leader
region2->>region_dir: Add new file
gc_worker->>region_dir: Continue scanning
gc_worker-->>region_dir: Discovers new file
Note over gc_worker: New file not in Region 1's manifest
gc_worker->>gc_worker: Mark file as orphan(incorrectly)
which could cause gc worker to incorrectly mark the new file as orphan and delete it, if config the lingering time for orphan files(files not mentioned anywhere(in used or unused)) is not long enough.
A good enough solution could be to use lock to prevent gc worker to happen on the region if region migration is happening on the region, and vise versa.
The race condition between gc worker and repartition also needs to be considered carefully. For now, acquiring lock for both region-migration and repartition during gc worker process could be a simple solution.
This section summarizes the key aspects and trade-offs of the proposed integrated GC worker, highlighting its advantages and potential challenges.
| Aspect | Current Proposal (Integrated GC) |
|---|---|
| Implementation Complexity | Medium. Requires careful integration with the compaction process and the slow query recorder for temporary manifest management. |
| Reliability | High. Integration with compaction and leveraging temporary manifests from long-running queries significantly mitigates the risk of incorrect deletion. Accurate management of lingering times for obsolete files and prevention of accidental deletion of newly created SSTs enhance data safety. |
| Performance Overhead | Low to Medium. The GC worker runs post-compaction, minimizing direct impact on write paths. Overhead from temporary manifest management by the slow query recorder is expected to be acceptable for long-running queries. |
| Impact on Other Components | Moderate. Requires modifications to the compaction process to trigger GC and the slow query recorder to manage temporary manifests. This introduces some coupling but enhances overall data safety. |
| Deletion Strategy | State- and Time-Based. Obsolete files are deleted based on a configurable lingering time, which is paused if the file is referenced by a temporary manifest. Unused files (never in a manifest) are also subject to a lingering time. |
This section outlines key areas requiring further discussion and defines potential avenues for future development.
Instead of integrating the GC worker directly into the Compaction process, a standalone GC service could be implemented. This service would operate independently, periodically scanning the storage for obsolete and unused files based on manifest information and predefined retention policies.
Pros:
Cons:
This alternative could be implemented in the future if the integrated GC worker proves insufficient or if there is a need for more advanced GC strategies.
This alternative would involve immediate deletion of files once they are removed from the manifest, without a lingering time.
Pros:
Cons: