doc/development/database/adding_database_indexes.md
Indexes can be used to speed up database queries, but when should you add a new index? Traditionally the answer to this question has been to add an index for every column used for filtering or joining data. For example, consider the following query:
SELECT *
FROM projects
WHERE user_id = 2;
Here we are filtering by the user_id column and as such a developer may decide
to index this column.
While in certain cases indexing columns using the above approach may make sense,
it can actually have a negative impact. Whenever you write data to a table, any
existing indexes must also be updated. The more indexes there are, the slower this
can potentially become. Indexes can also take up significant disk space, depending
on the amount of data indexed and the index type. For example, PostgreSQL offers
GIN indexes which can be used to index certain data types that cannot be
indexed by regular B-tree indexes. These indexes, however, generally take up more
data and are slower to update compared to B-tree indexes.
Because of all this, it's important make the following considerations when adding a new index:
In some situations, an index might not be required:
1,000 records) and it's not expected to exponentially grow in size.Additionally, wide indexes are not required to match all filter criteria of queries. We just need to cover enough columns so that the index lookup has a small enough selectivity.
The first step is to make sure your query re-uses as many existing indexes as possible. For example, consider the following query:
SELECT *
FROM todos
WHERE user_id = 123
AND state = 'open';
Now imagine we already have an index on the user_id column but not on the
state column. One may think this query performs badly due to state being
unindexed. In reality the query may perform just fine given the index on
user_id can filter out enough rows.
The best way to determine if indexes are re-used is to run your query using
EXPLAIN ANALYZE. Depending on the joined tables and the columns being used for filtering,
you may find an extra index doesn't make much, if any, difference.
In short:
EXPLAIN ANALYZE and study the output to find the most
ideal query.Partial indexes are indexes with a WHERE clause that limits them to a subset of matching rows.
They can offer several advantages over full indexes, including:
Partial indexes work best for queries that always filter on known conditions and target a specific subset of data. Common use cases include:
WHERE column IS NOT NULLWHERE feature_enabled = trueWHERE deleted_at IS NULLWHERE status IN ('queued', 'running')Before creating any new partial index, first examine existing indexes for potential reuse or modification. Since each index incurs maintenance overhead, prioritize adapting current indexes over adding new ones.
Consider the following application code which introduces a new count query:
def namespace_count
NamespaceSetting.where(duo_features_enabled: duo_settings_value).count
end
def duo_settings_value
params['duo_settings_value'] == 'default_on'
end
where namespace_settings is a table with 1 million records,
and duo_features_enabled is a nullable Boolean column.
Let's assume that we recently introduced this column and it was not backfilled.
This means we know that the majority of the records in the namespace_settings table have a NULL
value for duo_features_enabled. We can also see that duo_settings_value will only either yield
true or false.
Indexing all rows would be inefficient as we mostly have NULL values. Instead,
we can introduce a partial index that targets only the data of interest:
CREATE INDEX index_namespace_settings_on_duo_features_enabled_not_null
ON namespace_settings (duo_features_enabled)
WHERE duo_features_enabled IS NOT NULL;
Now we have an index that is just a small fraction of the full index size and the query planner can effectively skip over hundreds of thousands of irrelevant records.
A database may not use an index even when a regular sequence scan (iterating over all rows) is faster, especially for small tables.
Consider adding an index if a table is expected to grow, and your query has to filter a lot of rows.
You may not want to add an index if the table size is small (<1,000 records),
or if existing indexes already filter out enough rows.
Indexes have to be updated on every table write. In the case of PostgreSQL, all existing indexes are updated whenever data is written to a table. As a result, having many indexes on the same table slows down writes. It's therefore important to balance query performance with the overhead of maintaining an extra index.
For example, if adding an index reduces SELECT timings by 5 milliseconds but increases INSERT/UPDATE/DELETE timings by 10 milliseconds, the new index may not be worth it. A new index is more valuable when SELECT timings are reduced and INSERT/UPDATE/DELETE timings are unaffected.
GitLab enforces a limit of 15 indexes per table. This limitation:
[!note] If you need to add an index to a table that already has 15 indexes, consider:
We have RuboCop checks (PreventIndexCreation) against further new indexes on selected tables
that are frequently accessed.
This is due to LockManager LWLock contention.
For the same reason, there are also RuboCop checks (AddColumnsToWideTables) against adding
new columns to these tables.
To minimize the risk of creating unnecessary indexes, do these in the same merge request if possible:
The migrations that create indexes are usually short, and do not significantly increase a merge request's size. Doing so allows backend and database reviewers to review more efficiently without switching contexts between merge requests or commits.
The authoritative guide is the migration style guide. When in doubt, consult the guide.
Here are some common scenarios with a recommended choice as a quick reference.
Use a post-deployment migration. Existing queries already work without the added indexes, and would not critical to operating the application.
If indexing takes a long time to finish (a post-deployment migration should take less than 10 minutes) consider indexing asynchronously.
Always examine the query plans for new or updated queries. First, confirm they do not time-out or significantly exceed the recommended query timings without a dedicated index.
If the queries don't time-out or breach the query timings:
Queries that time-out or breach query timings require different actions, depending on whether they do so only on GitLab.com, or for all GitLab instances. Most features require a dedicated index only for GitLab.com, one of the largest GitLab installations.
Use two MRs to create the index in a post-deployment migration and make the application code change:
[!note] If you can use a feature flag, you might be able to use a single MR to make the code changes behind the feature flag. Include the post-deployment migration at the same time. After the post-deployment migration executes, you can enable the feature flag.
For GitLab.com, we execute post-deployment migrations throughout a single release through continuous integration:
t, a group of merge requests are merged and ready to deploy.t+1, the regular migrations from the group are executed on GitLab.com's staging and production database.t+2, the application code changes from the group start deploying in a rolling mannerAfter the application code changes are fully deployed, The release manager can choose to execute post-deployment migrations at their discretion at a much later time. The post-deployment migration executes one time per day pending GitLab.com availability. For this reason, you need a confirmation the post-deployment migrations included in the first MR were executed before merging the second MR.
It's not possible to check query performance directly on GitLab Self-Managed instances. PostgreSQL produces an execution plan based on the data distribution, so guessing query performance is a hard task.
If you are concerned about the performance of a query on GitLab Self-Managed instances and decide that GitLab Self-Managed instances must have an index, follow these recommendations:
For this reason, an application must not assume a database schema applied by the post-deployment migrations has shipped in the same release. The application code should continue to work without the indexes added in the post-deployment migrations in the same release.
You have two options depending on how long it takes to create the index:
PostgreSQL's unique index acts as a constraint. Adding one to an existing table can be tricky.
Unless the table is absolutely guaranteed to be tiny for GitLab.com and GitLab Self-Managed instances, you must use multiple post-deployment migrations over multiple releases to:
Refer to the multi-release approach outlined in the section for adding a NOT NULL constraint.
PostgreSQL's unique index, unlike the regular constraints, cannot be introduced in a non-validated state. You must use PostgreSQL's partial unique index and the application validation to enforce the desired uniqueness for new and updated records while the removal and fix are in progress.
The details of the work might vary and require different approaches. Consult the Database team, reviewers, or maintainers to plan the work.
For more information, see Unique constraints in Cells.
Unused indexes should be dropped because they increase maintenance overhead, consume disk space, and can degrade query planning efficiency without providing any performance benefit. However, dropping an index that's still used could result in query performance degradation or timeouts, potentially leading to incidents. It's important to verify the index is unused on both on GitLab.com and GitLab Self-Managed instances prior to removal.
To see which indexes are candidates for removal, you can run the following query:
SELECT relname as table_name, indexrelname as index_name, idx_scan, idx_tup_read, idx_tup_fetch, pg_size_pretty(pg_relation_size(indexrelname::regclass))
FROM pg_stat_all_indexes
WHERE schemaname = 'public'
AND idx_scan = 0
AND idx_tup_read = 0
AND idx_tup_fetch = 0
ORDER BY pg_relation_size(indexrelname::regclass) desc;
This query outputs a list containing all indexes that have not been used since the stats were last reset and sorts them by index size in descending order. More information on the meaning of the various columns can be found at https://www.postgresql.org/docs/16/monitoring-stats.html.
For GitLab.com, you can check the latest generated production reports
on postgres.ai and inspect the H002 Unused Indexes file.
[!warning] These reports only show indexes that have no recorded usage since the last statistics reset. They do not guarantee that the indexes are never used.
This section contains resources to help you evaluate an index and confirm that it's safe to remove. Note that this is only a suggested guide and is not exhaustive. Ultimately, the goal is to gather enough data to justify dropping the index.
Be aware that certain factors can give the false impression that an index is unused, such as:
Start by gathering all the metadata available for the index, verifying its name and definition.
Run \d+ <PARENT_INDEX_NAME> in Database Lab.
Run the following query to see the full parent-child index structure in more detail:
SELECT
parent_idx.relname AS parent_index,
child_tbl.relname AS child_table,
child_idx.relname AS child_index,
dep.deptype,
pg_get_indexdef(child_idx.oid) AS child_index_def
FROM
pg_class parent_idx
JOIN pg_depend dep ON dep.refobjid = parent_idx.oid
JOIN pg_class child_idx ON child_idx.oid = dep.objid
JOIN pg_index i ON i.indexrelid = child_idx.oid
JOIN pg_class child_tbl ON i.indrelid = child_tbl.oid
WHERE
parent_idx.relname = '<PARENT_INDEX_NAME>';
For GitLab.com, you can view index usage data in Grafana.
Query the metric pg_stat_user_indexes_idx_scan filtered by the relevant index(s) for at least the last 6 months.
The query below shows index usage rate across all database instances combined.
sum by (indexrelname) (rate(pg_stat_user_indexes_idx_scan{env="gprd", relname=~"<TABLE_NAME_REGEX>", indexrelname=~"<INDEX_NAME_REGEX>"}[30d]))
For partitioned tables, we must check that all child indexes are unused prior to dropping the parent.
If the data shows that an index has zero or negligible usage, it's a strong candidate for removal. However, keep in mind that this is limited to usage on GitLab.com. We should still investigate all related queries to ensure it can be safely removed for GitLab Self-Managed instances.
An index that shows low usage may still be dropped if we can confirm that other existing indexes would sufficiently support the queries using it. PostgreSQL decides which index to use based on data distribution statistics, so in certain situations it may slightly prefer one index over another even if both indexes adequately support the query, which may account for the occasional usage.
The following are ways to find all queries that may utilize the index. It's important to understand the context in which the queries are or may be executed so that we can determine if the index either:
Investigate the origins of the index.
Examine queries outputted from running the rspec:merge-auto-explain-logs CI job.
auto_explain/auto_explain.ndjson.gzExamine queries recorded in PostgreSQL logs on Kibana.
Generally, you can filter for json.sql values that contain the table name and key column(s) from the index definition. Example KQL:
json.sql: <TABLE_NAME> AND json.sql: *<COLUMN_NAME>*
While there are many factors that affect index usage, the query's filtering and ordering clauses often have the most influence. A general guideline is to find queries whose conditions align with the index structure. For example, PostgreSQL is more likely to utilize a B-Tree index for queries that filter on the index's leading column(s) and satisfy its partial predicate (if any).
Caveat: We only keep the last 7 days of logs and this data does not apply to GitLab Self-Managed usage.
Manually search through the GitLab codebase.
After collecting the relevant queries, you can then obtain EXPLAIN plans to help you assess if a query relies on the index in question. For this process, it's necessary to have a good understanding of how indexes support queries and how their usage is affected by data distribution changes. We recommend seeking guidance from a database domain expert to help with your assessment.
When dropping or replacing an index, developers sometimes assume that an existing composite index can serve as a replacement. However, PostgreSQL B-tree indexes are most efficient when queries filter on the leading (leftmost) columns of the index. A composite index cannot efficiently support queries that filter only on non-leading columns.
For example, consider the ssh_signatures table with the following indexes:
index_ssh_signatures_on_commit_sha on (commit_sha)index_ssh_signatures_on_project_id_and_commit_sha on (project_id, commit_sha)The composite index (project_id, commit_sha) can efficiently support:
Queries filtering on both project_id and commit_sha:
SELECT * FROM ssh_signatures WHERE project_id = 1 AND commit_sha = 'abc123';
Queries filtering only on project_id (the leading column):
SELECT * FROM ssh_signatures WHERE project_id = 1;
However, the composite index cannot efficiently support queries filtering only on commit_sha:
SELECT * FROM ssh_signatures WHERE commit_sha = 'abc123';
For this query, the single-column index index_ssh_signatures_on_commit_sha is required. Dropping it would
cause the query to perform poorly, potentially leading to timeouts or incidents in production.
Before dropping any index, you must verify that all queries using that index can be efficiently served by other existing indexes. Pay special attention to composite index column ordering and ensure that queries filter on the leading columns. For more information on how PostgreSQL uses composite indexes, see the PostgreSQL documentation on multicolumn indexes.
If you're dropping an index that you think it's unused, check the index usage stats.
Indexes with complex definitions must be explicitly named rather than relying on the implicit naming behavior of migration methods. In short, that means you must provide an explicit name argument for an index created with one or more of the following options:
whereusingorderlengthtypeopclassCheck our Constraints naming conventions page.
As Rails is database-independent, it generates an index name only from the required options of all indexes: table name and column names. For example, imagine the following two indexes are created in a migration:
def up
add_index :my_table, :my_column
add_index :my_table, :my_column, where: 'my_column IS NOT NULL'
end
Creation of the second index would fail, because Rails would generate the same name for both indexes.
This naming issue is further complicated by the behavior of the index_exists? method.
It considers only the table name, column names, and uniqueness specification
of the index when making a comparison. Consider:
def up
unless index_exists?(:my_table, :my_column, where: 'my_column IS NOT NULL')
add_index :my_table, :my_column, where: 'my_column IS NOT NULL'
end
end
The call to index_exists? returns true if any index exists on
:my_table and :my_column, and index creation is bypassed.
The add_concurrent_index helper is a requirement for creating indexes
on populated tables. Because it cannot be used inside a transactional
migration, it has a built-in check that detects if the index already
exists. In the event a match is found, index creation is skipped.
Without an explicit name argument, Rails can return a false positive
for index_exists?, causing a required index to not be created
properly. By always requiring a name for certain types of indexes, the
chance of error is greatly reduced.
The easiest way to test for existence of an index by name is to use the index_name_exists? method, but the index_exists? method can also be used with a name option. For example:
class MyMigration < Gitlab::Database::Migration[2.1]
INDEX_NAME = 'index_name'
def up
# an index must be conditionally created due to schema inconsistency
unless index_exists?(:table_name, :column_name, name: INDEX_NAME)
add_index :table_name, :column_name, name: INDEX_NAME
end
end
def down
# no op
end
end
Keep in mind that concurrent index helpers like add_concurrent_index, remove_concurrent_index, and remove_concurrent_index_by_name already perform existence checks internally.
There may be times when an index is only needed temporarily.
For example, in a migration, a column of a table might be conditionally updated. To query which columns must be updated in the query performance guidelines, an index is needed that would otherwise not be used.
In these cases, consider a temporary index. To specify a temporary index:
tmp_ and follow the naming conventions.A temporary migration would look like:
INDEX_NAME = 'tmp_index_projects_on_owner_where_emails_disabled'
def up
# Temporary index to be removed in 13.9 https://gitlab.com/gitlab-org/gitlab/-/issues/1234
add_concurrent_index :projects, :creator_id, where: 'emails_disabled = false', name: INDEX_NAME
end
def down
remove_concurrent_index_by_name :projects, INDEX_NAME
end
Sometimes it is necessary to add an index to support a batched background migration. It is commonly done by creating two post deployment migrations:
In most cases, no additional work is needed. The new index is created and is used as expected when queuing and executing the batched background migration.
Expression indexes,
however, do not generate statistics for the new index on creation. Autovacuum
eventually runs ANALYZE, and updates the statistics so the new index is used.
Run ANALYZE explicitly only if it is needed right after the index
is created, such as in the background migration scenario described above.
To trigger ANALYZE after the index is created, update the index creation migration
to analyze the table:
# in db/post_migrate/
INDEX_NAME = 'tmp_index_projects_on_owner_and_lower_name_where_emails_disabled'
TABLE = :projects
disable_ddl_transaction!
def up
add_concurrent_index TABLE, '(creator_id, lower(name))', where: 'emails_disabled = false', name: INDEX_NAME
connection.execute("ANALYZE #{TABLE}")
end
ANALYZE should only be run in post deployment migrations and should not target
large tables.
If this behavior is needed on a larger table, ask for assistance in the #database Slack channel.
You cannot create indexes
concurrently on partitioned tables.
However, creating indexes non-concurrently holds a write lock on the table being indexed.
Therefore, you must use CONCURRENTLY when you create indexes to avoid service disruption in a hot system.
As a workaround, the Database team has provided add_concurrent_partitioned_index.
This helper creates indexes on partitioned tables without holding a write lock.
Under the hood, add_concurrent_partitioned_index:
CONCURRENTLY.A Rails migration example:
# in db/post_migrate/
class AddIndexToPartitionedTable < Gitlab::Database::Migration[2.1]
include Gitlab::Database::PartitioningMigrationHelpers
disable_ddl_transaction!
TABLE_NAME = :table_name
COLUMN_NAMES = [:partition_id, :id]
INDEX_NAME = :index_name
def up
add_concurrent_partitioned_index(TABLE_NAME, COLUMN_NAMES, name: INDEX_NAME)
end
def down
remove_concurrent_partitioned_index_by_name(TABLE_NAME, INDEX_NAME)
end
end
For very large tables, index creation can be a challenge to manage.
While add_concurrent_index creates indexes in a way that does not block
ordinary traffic, it can still be problematic when index creation runs for
many hours. Necessary database operations like autovacuum cannot run, and
on GitLab.com, the deployment process is blocked waiting for index
creation to finish.
To limit impact on GitLab.com, a process exists to create indexes asynchronously during weekend hours. Due to generally lower traffic and fewer deployments, index creation can proceed at a lower level of risk.
An example of creating an index using
the asynchronous index helpers can be seen in the block below. This migration
enters the index name and definition into the postgres_async_indexes
table. The process that runs on weekends pulls indexes from this
table and attempt to create them.
# in db/post_migrate/
INDEX_NAME = 'index_ci_builds_on_some_column'
# TODO: Index to be created synchronously in https://gitlab.com/gitlab-org/gitlab/-/issues/XXXXX
def up
prepare_async_index :ci_builds, :some_column, name: INDEX_NAME
end
def down
unprepare_async_index :ci_builds, :some_column, name: INDEX_NAME
end
For partitioned table, use:
# in db/post_migrate/
include Gitlab::Database::PartitioningMigrationHelpers
PARTITIONED_INDEX_NAME = 'index_p_ci_builds_on_some_column'
# TODO: Partitioned index to be created synchronously in https://gitlab.com/gitlab-org/gitlab/-/issues/XXXXX
def up
prepare_partitioned_async_index :p_ci_builds, :some_column, name: PARTITIONED_INDEX_NAME
end
def down
unprepare_partitioned_async_index :p_ci_builds, :some_column, name: PARTITIONED_INDEX_NAME
end
Async indexes are only supported for GitLab.com environments,
so prepare_async_index and prepare_partitioned_async_index are no-ops for other environments.
[!note]
prepare_partitioned_async_indexonly creates the indexes for partitions asynchronously. It doesn't attach the partition indexes to the partitioned table. In the next step for the partitioned table,add_concurrent_partitioned_indexwill not only add the index synchronously but also attach the partition indexes to the partitioned table.
Verify that the post-deploy migration was executed on GitLab.com using ChatOps with
/chatops gitlab run auto_deploy status <merge_sha>. If the output returns db/gprd,
the post-deploy migration has been executed in the production database. For more information, see
How to determine if a post-deploy migration has been executed on GitLab.com.
In the case of an index created asynchronously, wait until the next week so that the index can be created over a weekend.
Async indexes are scheduled to run every 12th minute during weekends (12 * * * 0,6). The configuration is set in chef-repo and omnibus.
If the index is not created after a weekend, check the status of queued index operations by running the below query in a recent DB thin clone.
SELECT definition, created_at, attempts, last_error FROM postgres_async_indexes
WHERE definition ILIKE 'CREATE%'
ORDER BY attempts ASC, id ASC;
Use Database Lab to check if creation was successful.
Ensure the output does not indicate the index is invalid.
After the index is verified to exist on the production database, create a second
merge request that adds the index synchronously. The schema changes must be
updated and committed to structure.sql in this second merge request.
The synchronous migration results in a no-op on GitLab.com, but you should still add the
migration as expected for other installations. The below block
demonstrates how to create the second migration for the previous
asynchronous example.
[!warning] Verify that the index exists in production before merging a second migration with
add_concurrent_index. If the second migration is deployed before the index has been created, the index is created synchronously when the second migration executes.
# in db/post_migrate/
INDEX_NAME = 'index_ci_builds_on_some_column'
disable_ddl_transaction!
def up
add_concurrent_index :ci_builds, :some_column, name: INDEX_NAME
end
def down
remove_concurrent_index_by_name :ci_builds, INDEX_NAME
end
# in db/post_migrate/
include Gitlab::Database::PartitioningMigrationHelpers
PARTITIONED_INDEX_NAME = 'index_p_ci_builds_on_some_column'
disable_ddl_transaction!
def up
add_concurrent_partitioned_index :p_ci_builds, :some_column, name: PARTITIONED_INDEX_NAME
end
def down
remove_concurrent_partitioned_index_by_name :p_ci_builds, PARTITIONED_INDEX_NAME
end
You must test the database index changes locally before creating a merge request.
Use the asynchronous index helpers on your local environment to test changes for creating an index:
Feature.enable(:database_async_index_creation) and Feature.enable(:database_reindexing) in the Rails console.bundle exec rails db:migrate so that it creates an entry in the postgres_async_indexes table.bundle exec rails gitlab:db:execute_async_index_operations:all so that the index is created asynchronously on all databases.gdk psql and run the command \d <index_name> to check that your newly created index exists.
\d gitlab_partitions_dynamic.<table_name>For very large tables, index destruction can be a challenge to manage.
While remove_concurrent_index removes indexes in a way that does not block
ordinary traffic, it can still be problematic if index destruction runs for
many hours. Necessary database operations like autovacuum cannot run, and
the deployment process on GitLab.com is blocked while waiting for index
destruction to finish.
To limit the impact on GitLab.com, use the following process to remove indexes asynchronously during weekend hours. Due to generally lower traffic and fewer deployments, index destruction can proceed at a lower level of risk.
For example, to destroy an index using the asynchronous index helpers:
# in db/post_migrate/
INDEX_NAME = 'index_ci_builds_on_some_column'
# TODO: Index to be destroyed synchronously in https://gitlab.com/gitlab-org/gitlab/-/issues/XXXXX
def up
prepare_async_index_removal :ci_builds, :some_column, name: INDEX_NAME
end
def down
unprepare_async_index :ci_builds, :some_column, name: INDEX_NAME
end
This migration enters the index name and definition into the postgres_async_indexes
table. The process that runs on weekends pulls indexes from this table and attempt
to remove them.
You must test the database index changes locally before creating a merge request. Include the output of the test in the merge request description.
/chatops gitlab run auto_deploy status <merge_sha>. If the output returns db/gprd,
the post-deploy migration has been executed in the production database. For more information, see
How to determine if a post-deploy migration has been executed on GitLab.com.After you verify the index no longer exists in the production database, create a second
merge request that removes the index synchronously. The schema changes must be
updated and committed to structure.sql in this second merge request.
The synchronous migration results in a no-op on GitLab.com, but you should still add the
migration as expected for other installations. For example, to
create the second migration for the previous asynchronous example:
[!warning] Verify that the index no longer exists in production before merging a second migration with
remove_concurrent_index_by_name. If the second migration is deployed before the index has been destroyed, the index is destroyed synchronously when the second migration executes.
# in db/post_migrate/
INDEX_NAME = 'index_ci_builds_on_some_column'
disable_ddl_transaction!
def up
remove_concurrent_index_by_name :ci_builds, name: INDEX_NAME
end
def down
add_concurrent_index :ci_builds, :some_column, name: INDEX_NAME
end
To test changes for removing an index, use the asynchronous index helpers on your local environment:
Feature.enable(:database_reindexing) in the Rails console.bundle exec rails db:migrate which should create an entry in the postgres_async_indexes table.bundle exec rails gitlab:db:reindex destroy the index asynchronously.gdk psql and run \d <index_name> to check that the destroyed index no longer exists.