doc/development/database/namespaces_storage_statistics.md
On Storage and limits management for groups, we want to facilitate a method for easily viewing the amount of storage consumed by a group, and allow easy management.
In GitLab, we update the project storage statistics through a callback every time the project is saved.
The summary of those statistics per namespace is then retrieved
by Namespaces#with_statistics scope. Analyzing this query we noticed that:
1.2 seconds for namespaces with over 15k projects.Additionally, the pattern that is currently used to update the project statistics (the callback) doesn't scale adequately. It is currently one of the largest database queries transactions on production that takes the most time overall. We can't add one more query to it as it increases the transaction's length.
Because of all of the above, we can't apply the same pattern to store
and update the namespaces statistics, as the namespaces table is one
of the largest tables on GitLab.com. Therefore we needed to find a performant and
alternative method.
Model can be updated through a refresh strategy based on a project routes SQL and a materialized view:
SELECT split_part("rs".path, '/', 1) as root_path,
COALESCE(SUM(ps.storage_size), 0) AS storage_size,
COALESCE(SUM(ps.repository_size), 0) AS repository_size,
COALESCE(SUM(ps.wiki_size), 0) AS wiki_size,
COALESCE(SUM(ps.lfs_objects_size), 0) AS lfs_objects_size,
COALESCE(SUM(ps.build_artifacts_size), 0) AS build_artifacts_size,
COALESCE(SUM(ps.pipeline_artifacts_size), 0) AS pipeline_artifacts_size,
COALESCE(SUM(ps.packages_size), 0) AS packages_size,
COALESCE(SUM(ps.snippets_size), 0) AS snippets_size,
COALESCE(SUM(ps.uploads_size), 0) AS uploads_size
FROM "projects"
INNER JOIN routes rs ON rs.source_id = projects.id AND rs.source_type = 'Project'
INNER JOIN project_statistics ps ON ps.project_id = projects.id
GROUP BY root_path
We could then execute the query with:
REFRESH MATERIALIZED VIEW root_namespace_storage_statistics;
While this implied a single query update (and probably a fast one), it has some downsides:
Similar to Attempt A: Model update done through a refresh strategy with a Common Table Expression
WITH refresh AS (
SELECT split_part("rs".path, '/', 1) as root_path,
COALESCE(SUM(ps.storage_size), 0) AS storage_size,
COALESCE(SUM(ps.repository_size), 0) AS repository_size,
COALESCE(SUM(ps.wiki_size), 0) AS wiki_size,
COALESCE(SUM(ps.lfs_objects_size), 0) AS lfs_objects_size,
COALESCE(SUM(ps.build_artifacts_size), 0) AS build_artifacts_size,
COALESCE(SUM(ps.pipeline_artifacts_size), 0) AS pipeline_artifacts_size,
COALESCE(SUM(ps.packages_size), 0) AS packages_size,
COALESCE(SUM(ps.snippets_size), 0) AS snippets_size,
COALESCE(SUM(ps.uploads_size), 0) AS uploads_size
FROM "projects"
INNER JOIN routes rs ON rs.source_id = projects.id AND rs.source_type = 'Project'
INNER JOIN project_statistics ps ON ps.project_id = projects.id
GROUP BY root_path)
UPDATE namespace_storage_statistics
SET storage_size = refresh.storage_size,
repository_size = refresh.repository_size,
wiki_size = refresh.wiki_size,
lfs_objects_size = refresh.lfs_objects_size,
build_artifacts_size = refresh.build_artifacts_size,
pipeline_artifacts_size = refresh.pipeline_artifacts_size,
packages_size = refresh.packages_size,
snippets_size = refresh.snippets_size,
uploads_size = refresh.uploads_size
FROM refresh
INNER JOIN routes rs ON rs.path = refresh.root_path AND rs.source_type = 'Namespace'
WHERE namespace_storage_statistics.namespace_id = rs.source_id
Same benefits and downsides as attempt A.
We could get rid of the model that stores the statistics in aggregated form and instead use a Redis Set. This would be the boring solution and the fastest one to implement, as GitLab already includes Redis as part of its Architecture.
The downside of this approach is that Redis does not provide the same persistence/consistency guarantees as PostgreSQL, and this is information we can't afford to lose in a Redis failure.
Directly relate the root namespace to its child namespaces, so whenever a namespace is created without a parent, this one is tagged with the root namespace ID:
| ID | root ID | parent ID |
|---|---|---|
| 1 | 1 | NULL |
| 2 | 1 | 1 |
| 3 | 1 | 2 |
To aggregate the statistics inside a namespace we'd execute something like:
SELECT COUNT(...)
FROM projects
WHERE namespace_id IN (
SELECT id
FROM namespaces
WHERE root_id = X
)
Even though this approach would make aggregating much easier, it has some major downsides:
153h, see https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/29772.This approach consists of continuing to use the incremental statistics updates we already have, but we refresh them through Sidekiq jobs and in different transactions:
namespace_aggregation_schedules) with two columns id and namespace_id.namespace_aggregation_schedules
project_statistics(https://gitlab.com/gitlab-org/gitlab/-/issues/29070), the insertion should be done in a different transaction and through a Sidekiq Job.1.5h hours.1.5h lease on Redis on a key based on the root namespace ID.1.5h.namespace_aggregation_schedules after the update.namespace_aggregation_schedules table and schedule jobs for every pending row.
This implementation has the following benefits:
project_statistics.The only downside of this approach is that namespaces' statistics are updated up to 1.5 hours after the change is done,
which means there's a time window in which the statistics are inaccurate. Because we're still not
enforcing storage limits, this is not a major problem.
Updating the storage statistics asynchronously, was the less problematic and performant approach of aggregating the root namespaces.
All the details regarding this use case can be found on:
Performance of the namespace storage statistics were measured in staging and production (GitLab.com). All results were posted on https://gitlab.com/gitlab-org/gitlab-foss/-/issues/64092: No problem has been reported so far.