doc/development/database_review.md
This page is specific to database reviews. Refer to our code review guide for broader advice and best practices for code review in general.
A database review is required for:
db/lib/gitlab/background_migration/lib/gitlab/database/count, distinct_count, estimate_batch_distinct_count and sum.
These metrics could have complex queries over large tables.
See the Analytics Instrumentation Guide
for implementation details.update, upsert, delete, update_all, upsert_all, delete_all or destroy_all
methods on an ActiveRecord object.A database reviewer is expected to look out for overly complex queries in the change and review those closer. If the author does not point out specific queries for review and there are no overly complex queries, it is enough to concentrate on reviewing the migration only.
You must provide the following artifacts when you request a ~database review. If your merge request description does not include these items, the review is reassigned back to the author.
If new migrations are introduced, database reviewers must review the output of both migrating (db:migrate)
and rolling back (db:rollback) for all migrations.
We have automated tooling for
GitLab (provided by the
db:check-migrations pipeline job) that provides this output in the CI job logs.
It is not required for the author to provide this output in the merge request description,
but doing so may be helpful for reviewers. The bot also checks that migrations are correctly
reversible.
If new queries have been introduced or existing queries have been updated, you are required to provide:
Refer to Preparation when adding or modifying queries for how to provide this information.
A merge request author's role is to:
~database label.A database reviewer's role is to:
A database maintainer's role is to:
Review workload is distributed using reviewer roulette (example). The MR author should request a review from the suggested database reviewer. When they sign off, they hand over to the suggested database maintainer.
If reviewer roulette didn't suggest a database reviewer & maintainer,
make sure you have applied the ~database label and rerun the
danger-review CI job, or pick someone from the
@gl-database team.
To make reviewing easier and therefore faster, take the following preparations into account.
db/structure.sql is updated as documented, and additionally ensure that the relevant version files under
db/schema_migrations were added or removed.change method or include a down method when using up.
db:check-migrations pipeline job has run successfully and the migration rollback behaves as expected.
db:check-schema job has run successfully and no unexpected schema changes are introduced in a rollback. This job may only trigger a warning if the schema was changed.spec/migrations if necessary. See Testing Rails migrations at GitLab for more details.CREATE INDEX CONCURRENTLY in Database Lab and add the execution time to the MR description:
10 minutes, the index should be moved to a post-migration.
Keep in mind that in this case you may need to split the migration and the application changes in separate releases to ensure the index
is in place when the code that needs it is deployed.db:gitlabcom-database-testing) in the test stage.
Data migrations are inherently risky. Additional actions are required to reduce the possibility of error that would result in corruption or loss of production data.
Include in the MR description:
~data-deletion.Write the raw SQL in the MR description. Preferably formatted
nicely with pgFormatter or
https://paste.depesz.com and using regular quotes
(for example, "projects"."id") and avoiding smart quotes (for example, “projects”.“id”).
In case of queries generated dynamically by using parameters, there should be one raw SQL query for each variation.
For example, a finder for issues that may take as a parameter an optional filter on projects, should include both the version of the query over issues and the one that joins issues and projects and applies the filter.
There are finders or other methods that can generate a very large amount of permutations. There is no need to exhaustively add all the possible generated queries, just the one with all the parameters included and one for each type of queries generated.
For example, if joins or a group by clause are optional, the versions without the group by clause and with less joins should be also included, while keeping the appropriate filters for the remaining tables.
If a query is always used with a limit and an offset, those should always be included with the maximum allowed limit used and a non 0 offset.
When reviewing queries, we want to assess the SQL query that is actually executed in the database, including any scopes, pagination, and limits. It can be difficult or sometimes even impossible to infer the exact query from reading the Rails code. To get the most accurate SQL, try using one of these methods:
Use the performance bar:
Manually test your feature, then click the pg section of the performance bar
to see the SQL queries executed. To see the queries for an API request, find and select
the request using the drop-down menu on the right-hand side.
Check development.log: Test your feature then look inside log/development.log
in the GitLab directory to see all the queries executed by the application.
This does not include queries executed in a Sidekiq worker.
Run specs with ActiveRecord::Base.logger: You can log queries to stdout by adding
these blocks to your RSpec tests:
before do
ActiveRecord::Base.logger = Logger.new($stdout)
end
after do
ActiveRecord::Base.logger = nil
end
Run the tests with bundle exec rspec <test_file> and the queries will appear
in the test output. This will only produce correct queries in integration tests.
The output of unit tests may not be correct if there is an additional component modifying
the ActiveRecord relation, such as pagination middleware.
explain command in the postgres.ai chatbot. The explain command runs
EXPLAIN ANALYZE.
EXPLAIN ANALYZE. Create links to the plan using explain.depesz.com or explain.dalibo.com. Be sure to paste both the plan and the query used in the form.To produce a query plan with enough data, you can use the IDs of:
gitlab-org namespace (namespace_id = 9970), for queries involving a group.gitlab-org/gitlab-foss (project_id = 13083) or the gitlab-org/gitlab (project_id = 278964) projects, for queries involving a project.
project_namespace_id of these projects may be required to create a query plan. These are 15846663 (for gitlab-org/gitlab) and 15846626 (for gitlab-org/gitlab-foss)gitlab-qa user (user_id = 1614863), for queries involving a user.
user_id, or the user_id of a user with a long history within the project or group being used to generate the query plan.That means that no query plan should return 0 records or less records than the provided limit (if a limit is included). If a query is used in batching, a proper example batch with adequate included results should be identified and provided.
[!note] The
UPDATEstatement always returns 0 records. To identify the rows it updates, we need to check the following lines below.
For example, the UPDATE statement returns 0 records, but we can see that it updates 1 row from the line starting with -> Index scan.:
EXPLAIN UPDATE p_ci_pipelines SET updated_at = current_timestamp WHERE id = 1606117348;
ModifyTable on public.p_ci_pipelines (cost=0.58..3.60 rows=0 width=0) (actual time=5.977..5.978 rows=0 loops=1)
Buffers: shared hit=339 read=4 dirtied=4
WAL: records=20 fpi=4 bytes=21800
I/O Timings: read=4.920 write=0.000
-> Index Scan using ci_pipelines_pkey on public.ci_pipelines p_ci_pipelines_1 (cost=0.58..3.60 rows=1 width=18) (actual time=0.041..0.044 rows=1 loops=1)
Index Cond: (p_ci_pipelines_1.id = 1606117348)
Buffers: shared hit=8
I/O Timings: read=0.000 write=0.000
If your queries belong to a new feature in GitLab.com and thus they don't return data in production:
exec UPDATE issues SET ...) and creation of new tables and columns (exec ALTER TABLE issues ADD COLUMN ...).More information on how to find the number of actual returned records in Understanding EXPLAIN plans
ActiveRecord::Relation from finders and scopes.
PostgreSQL query plans are dependent on all the final parameters,
including limits and other things that may be added before final execution.
One way to be sure of the actual query executed is to check
log/development.log.dependent: ... that may no longer be necessary.WHERE, ORDER BY, GROUP BY, and JOINs.db/fixtures/development/. These fixtures are also used
to ensure that upgrades complete successfully,
so it's important that new tables are always populated.index(column_A, column_B, column_C) makes the indexes index(column_A, column_B) and index(column_A) redundant.Using update, upsert, delete, update_all, upsert_all, delete_all or destroy_all
ActiveRecord methods requires extra care because they modify data and can perform poorly, or they
can destroy data if improperly scoped. These methods are also
incompatible with Common Table Expression (CTE) statements.
Danger will comment on a merge request diff when these methods are used.
Follow documentation for preparation when adding or modifying queries to add the raw SQL query and query plan to the merge request description, and request a database review.
scripts/database/migrate.rb
to make this process more efficient.db:gitlabcom-database-testing) is passing.db/structure.sql contains only changes related to migrations in this merge request - no unrelated schema modifications#down methoddb/schema_migrations were added or removed#f_upcoming_release Slack channelgitlab-com-database-testing comment (titled Database Migrations (on the main database) etc.) to make sure they adhere to our query performance guidelines