doc/administration/housekeeping.md
{{< details >}}
{{< /details >}}
GitLab supports and automates housekeeping tasks in Git repositories to ensure that they can be served as efficiently as possible. Housekeeping tasks include:
[!warning] Do not manually execute Git commands to perform housekeeping in Git repositories that are controlled by GitLab. Doing so may lead to corrupt repositories and data loss.
Gitaly can perform housekeeping tasks in a Git repository in two ways:
The "eager" housekeeping strategy executes housekeeping tasks in a repository independent of the repository state. This is the default strategy as used by the manual trigger and the push-based trigger.
The eager housekeeping strategy is controlled by the GitLab application. Depending on the trigger that caused the housekeeping job to run, GitLab asks Gitaly to perform specific housekeeping tasks. Gitaly performs these tasks even if the repository is in an optimized state. As a result, this strategy can be inefficient in large repositories where performing the housekeeping tasks may be slow.
{{< history >}}
optimized_housekeeping. Enabled by default.optimized_housekeeping removed.{{< /history >}}
The heuristical (or "opportunistic") housekeeping strategy analyzes the repository's state and executes housekeeping tasks only when it finds one or more data structures are insufficiently optimized. This is the strategy used by scheduled housekeeping.
Heuristical housekeeping uses the following information to decide on the tasks it needs to run:
The decision whether any of the analyzed data structures need to be optimized is based on the size of the repository:
Gitaly does this to offset the fact that optimizing those data structures takes more time the bigger they get. It is especially important in large monorepos (which receive a lot of traffic) to avoid optimizing them too frequently.
You can change how often Gitaly is asked to optimize a repository.
There are different ways in which GitLab runs housekeeping tasks:
Administrators of repositories can manually trigger housekeeping tasks in a repository. In general this is not required as GitLab knows to automatically run housekeeping tasks. The manual trigger can be useful when either:
To trigger housekeeping tasks manually:
This starts an asynchronous background worker for the project's repository. The background worker asks Gitaly to perform a number of optimizations.
Housekeeping also removes unreferenced LFS files
from your project every 200 push, freeing up storage space for your project.
Unreachable objects are pruned as part of scheduled housekeeping. However, you can trigger manual pruning as well. Triggering housekeeping prunes unreachable objects with a grace period of two weeks. When you manually trigger the pruning of unreachable objects, the grace period is reduced to 30 minutes.
[!warning] Pruning unreachable objects does not guarantee the removal of leaked secrets and other sensitive information. For information on how to remove secrets that were committed but not pushed, see the remove a secret from your commits tutorial. Additionally, you can remove blobs individually. Refer to that documentation for possible consequences of performing that operation.
If a concurrent process (like
git push) has created an object but hasn't created a reference to the object yet, your repository can become corrupted if a reference to the object is added after the object is deleted. The grace period exists to reduce the likelihood of such race conditions. For example, if pushing many large objects frequently over a sometimes very slow connection, then the risk that comes with pruning unreachable objects is much higher than in a corporate environment where the project can be accessed only from inside the company with a performant connection. Consider the project usage profile when using this option and select a quiet period.
To trigger a manual prune of unreachable objects:
{{< details >}}
{{< /details >}}
While GitLab automatically performs housekeeping tasks based on the number of pushes, it does not maintain repositories that don't receive any pushes at all. As a result, dormant repositories or repositories that are only getting read requests may not benefit from improvements in the repository housekeeping strategy.
Administrators can enable a background job that performs housekeeping in all repositories at a customizable interval to remedy this situation. This background job processes all repositories hosted by a Gitaly node in a random order and eagerly performs housekeeping tasks on them. The Gitaly node stops processing repositories if it takes longer than the configured interval.
Background maintenance of Git repositories is configured in Gitaly. By default, Gitaly performs background repository maintenance every day at 12:00 noon for a duration of 10 minutes.
You can change this default in Gitaly configuration.
For environments with Gitaly Cluster (Praefect), the scheduled housekeeping start time can be staggered across Gitaly nodes so the scheduled housekeeping is not running simultaneously on multiple nodes.
If a scheduled housekeeping run reaches the duration specified, the running tasks are
gracefully canceled. On subsequent scheduled housekeeping runs, Gitaly randomly shuffles
the repository list to process.
The following snippet enables daily background repository maintenance starting at
23:00 for 1 hour for the default storage:
{{< tabs >}}
{{< tab title="Self-compiled (source)" >}}
[daily_maintenance]
start_hour = 23
start_minute = 00
duration = 1h
storages = ["default"]
Use the following snippet to completely disable background repository maintenance:
[daily_maintenance]
disabled = true
{{< /tab >}}
{{< tab title="Linux package (Omnibus)" >}}
gitaly['configuration'] = {
daily_maintenance: {
disabled: false,
start_hour: 23,
start_minute: 00,
duration: '1h',
storages: ['default'],
},
}
Use the following snippet to completely disable background repository maintenance:
gitaly['configuration'] = {
daily_maintenance: {
disabled: true,
},
}
{{< /tab >}}
{{< /tabs >}}
When the scheduled housekeeping is executed, you can see the following entries in your Gitaly log:
# When the scheduled housekeeping starts
{"level":"info","msg":"maintenance: daily scheduled","pid":197260,"scheduled":"2023-09-27T13:10:00+13:00","time":"2023-09-27T00:08:31.624Z"}
# When the scheduled housekeeping completes
{"actual_duration":321181874818,"error":null,"level":"info","max_duration":"1h0m0s","msg":"maintenance: daily completed","pid":197260,"time":"2023-09-27T00:15:21.182Z"}
The actual_duration (in nanoseconds) indicates how long the scheduled maintenance
took to execute. In the previous example, the scheduled housekeeping completed
in just over 5 minutes.
{{< details >}}
{{< /details >}}
Object pool repositories are used by GitLab to deduplicate objects across forks of a repository. When creating the first fork, we:
Any forks of this repository can now link against the object pool and thus only have to keep objects that diverge from the primary repository.
GitLab needs to perform special housekeeping operations in object pools:
These housekeeping operations are performed by the specialized
FetchIntoObjectPool RPC that handles all of these special tasks while also
executing the regular housekeeping tasks we execute for standard Git
repositories.
Object pools are getting optimized automatically whenever the primary member is getting garbage collected. Therefore, the cadence can be configured using the same Git GC period in that project.
If you need to manually invoke the RPC from a Rails console,
you can call project.pool_repository.object_pool.fetch. This is a potentially
long-running task, though Gitaly times out after about 8 hours.