doc/development/database/batching_best_practices.md
This document gives an overview about the available batching strategies we use at GitLab. We list the pros and cons of each strategy so engineers can pick the ideal approach for their use case.
When dealing with a large volume of records, reading, updating or deleting the records in one database query can be challenging; the operation could easily time out. To avoid this problem, we should process the records in batches. Batching usually happens in background jobs, where runtime constraints are more relaxed than during web requests.
In rare cases (older features), batching also happens in web requests. However, for new features this is discouraged due to the short web request timeout (60 seconds by default). As a guideline, using background jobs (Sidekiq workers) should be considered as the first option when implementing a feature that needs to process a large volume of records.
Batching performance is closely related to pagination performance since the underlying libraries and database queries are essentially the same. When implementing batching it's important to be familiar with the pagination performance guidelines and the documentation related to our batching utilities.
There are two main aspects to consider when implementing batching in background jobs: total runtime and data modification volume.
Background jobs shouldn't run for a long time. A Sidekiq process can crash or it can be forcefully stopped (for example, on restart or deployment). Additionally, due to our error budget rules, after 5 minutes of runtime, error budget violations will be added to the group where the feature is registered. When implementing batching in background jobs, make sure that you're familiar with our guidelines related to idempotent jobs
Updating or deleting a large volume of records can increase database replication lag and it can add extra strain to the primary database. It's advisable to limit the total number of records we process (or batch over) within the background job.
To address the potential issues mentioned above the following measures should be considered:
When applying limits, it's important to mention that long-running background jobs should implement a "continue later" mechanism where a new job is scheduled after the limit is reached to continue the work where the batching was stopped. This is important when a job is so long that it's very likely that it won't fit into the 5 minutes runtime.
An example implementation of runtime limiting using the Gitlab::Metrics::RuntimeLimiter class:
def perform(project_id)
runtime_limiter = Gitlab::Metrics::RuntimeLimiter.new(3.minutes)
project = Project.find(1)
project.issues.each_batch(of: :iid) do |scope|
scope.update_all(updated_at: Time.current)
break if runtime_limiter.over_time?
end
end
The batching in the code snippet stops when 3 minutes of runtime is reached. The problem now is that we have no way to continue the processing. To do that, we need to schedule a new background job with enough information to continue the processing. In the snippet, we batch issues within a project by the iid column. For the next job, we need to provide the project ID and the last processed iid values. This information we often call as the cursor.
def perform(project_id, iid = nil)
runtime_limiter = Gitlab::Metrics::RuntimeLimiter.new(3.minutes)
project = Project.find(project_id)
# Restore the previous iid if present
project.issues.where('iid > ?', iid || 0).each_batch(of: :iid) do |scope|
max_iid = scope.maximum(:iid)
scope.update_all(updated_at: Time.current)
if runtime_limiter.over_time?
MyJob.perform_in(2.minutes, project_id, max_iid)
break
end
end
end
Implementing a "continue later" mechanism can add significant complexity to the implementation. Hence, before committing to this work, analyze the existing data in the production database and try to extrapolate data growth. A few examples:
pending todos for a given user as done does not need a "continue later" mechanism.
When a very large volume of updates happen in the background job, it's advisable (not a strict requirement) to add some sleep to the code and limit the total number of records we update. This reduces the pressure on the primary databases and gives a small window for potential database migrations to acquire heavier locks.
def perform(project_id, iid = nil)
max_updates = 100_000 # Allow maximum N updates
updates = 0
status = :completed
runtime_limiter = Gitlab::Metrics::RuntimeLimiter.new(3.minutes)
project = Project.find(project_id)
project.issues.where('iid > ?', iid || 0).each_batch(of: :iid) do |scope|
max_iid = scope.maximum(:iid)
updates += scope.update_all(updated_at: Time.current)
if runtime_limiter.over_time? || updates >= max_updates
MyJob.perform_in(2.minutes, project_id, max_iid)
status = :limit_reached
break
end
# Adding sleep when we expect long running batching that modifies large volume of data
sleep 0.01
end
end
For traceability purposes, it's a good practice to expose metrics so we can see how the batching performs in Kibana:
log_extra_metadata_on_done(:result, {
status: :limit_reached, # or :completed
updated_rows: updates
})
Scheduling the next job in the example above is not crash safe (the job can be lost), for very important tasks this approach is not suitable. A safe and common pattern is using a scheduled worker that executes the work based on a cursor. The cursor can be persisted in the DB or in Redis depending on the consistency requirements. This means that the cursor is no longer passed via the job arguments.
The frequency of the scheduled worker can be adjusted depending on the urgency of the task. We have examples when a scheduled worker is enqueued every minute to process urgent items.
Example: process all issues in a project.
def perform
project_id, iid = load_cursor # Load cursor from Redis
return unless project_id # Nothing was enqueued
project = Project.find(project_id)
project.issues.where('iid > ?', iid || 0).each_batch(of: :iid) do |scope|
# Do something with issues.
# Break here, set interrupted flag if time limit is up.
# Set iid to the last processed value.
end
# Continue the work later
push_cursor(project_id, iid) if interrupted?
end
private
def load_cursor
# Take 1 element, not crash safe.
raw_cursor = Gitlab::Redis::SharedState.with do |redis|
redis.lpop('my_cursor')
end
return unless raw_cursor
cursor = Gitlab::Json.parse(raw_cursor)
[cursor['project_id'], cursor['iid']]
end
def push_cursor(project_id, iid)
# Work is not finished, put the cursor at the beginning of the list so the next job can pick it up.
Gitlab::Redis::SharedState.with do |redis|
redis.lpush('my_cursor', Gitlab::Json.dump({ project_id: project_id, iid: iid }))
end
end
In the application code, you can put an item in the queue after the database transaction commits (see transaction guidelines for more details):
def execute
ApplicationRecord.transaction do
user.save!
Event.create!(user: user, issue: issue)
end
# Application could crash here
MyRedieQueue.add(user: user, issue: issue)
end
This approach is not crash-safe, the item would not be enqueued if the application crashes right after the transaction commits.
Pros:
Cons:
An alternative approach would be storing the queue in the PostgreSQL database. In this case, we can implement the transactional outbox pattern which ensures consistency in case of application (web or worker) crashes.
Pros:
Cons:
Example: set up reliable way of sending emails
# In a service
def execute
ApplicationRecord.transaction do
user.save!
Event.create!(user: user, issue: issue)
IssueEmailWorkerQueue.insert!(user: user, issue: issue)
end
end
The IssueEmailWorkerQueue record stores all necessary information for executing a job. In the scheduled background job we can process the table in a specific order.
def perform
runtime_limiter = Gitlab::Metrics::RuntimeLimiter.new(3.minutes)
items = EmailWorkerQueue.order(:id).take(25)
items.each do |item|
# Do something with the item
end
end
[!note] To avoid parallel processing of records, you might need to wrap the execution with a distributed Redis lock.
Example Redis lock usage:
class MyJob
include ApplicationWorker
include Gitlab::ExclusiveLeaseHelpers
MAX_TTL = 2.5.minutes.to_i # It should be similar to the runtime limit.
def perform
in_lock('my_lock_key', ttl: MAX_TTL, retries: 0) do
# Do the work here.
end
end
end
Sidekiq jobs can consume substantial database resources. If your job only batches over data but does not modify anything in the database, consider setting attributes favoring database replicas. See the documentation for the Sidekiq worker attributes.
To keep the examples easy to follow, we omit the code for limiting the runtime.
[!note] Some examples include an optional variable assignment to the
cursorvariable. This is optional step which can be used when implementing the "continue later" mechanism.
The strategy leverages the fact that after updating or deleting records in the database, the exact same query will return different records. This strategy can only be used when we want to delete or update certain records.
Example:
loop do
# Requires an index on project_id
delete_count = project.issues.limit(1000).delete_all
break if delete_count == 0 # Exit the loop when there are not records to be deleted
end
Pros:
Cons:
DELETE or UPDATE query is a must. There are some issues with CTEs when updating or deleting records.break logic has a bug we might end up in an infinite loop.It's possible to make the loop-based approach process records in a specific order:
loop do
# Requires a composite index on (project_id, created_at)
delete_count = project.issues.limit(1000).order(created_at: :desc).delete_all
break if delete_count == 0
end
With the index mentioned in the previous example, we can also use timestamp conditions:
loop do
# Requires a composite index on (project_id, created_at)
delete_count = project
.issues
.where('created_at < ?', 1.month.ago)
.limit(1000)
.order(created_at: :desc)
.delete_all
break if delete_count == 0
end
We can use a single, unique column (primary key or column which has a unique index) for batching with the EachBatch module. This is one of the most commonly used batching strategy in GitLab.
# Requires a composite index on (project_id, id).
# EachBatch uses the primary key by default for the batching.
cursor = nil
project.issues.where('id > ?', cursor || 0).each_batch do |batch|
issues = batch.to_a
cursor = issues.last.id # For the next job
# do something with the issues records
end
Pros:
Cons:
ORDER BY column (ID) must be unique in the context of the query.timestamp column condition or other complex conditions (IN, NOT EXISTS) are present.EachBatch requires a unique database column (usually the ID column) however, there are rare cases when the feature needs to batch over a non-unique column. Example: bump all project timestamp values which have at least one issue.
One approach is to batch over the "parent" table, in this case using the Project model.
cursor = nil
# Uses the primary key index
Project.where('id > ?', cursor || 0).each_batch do |batch|
cursor = batch.maximum(:id) # For the next job
project_ids = batch
.where('EXISTS (SELECT 1 FROM issues WHERE projects.id=issues.project_id)')
.pluck(:id)
Project.where(id: project_ids).update_all(update_all: Time.current)
end
Pros:
Cons:
The batching query runs a full table scan over the projects table which might be wasteful, alternatively, we can use the distinct_each_batch helper method:
# requires an index on (project_id)
Issue.distinct_each_batch(column: :project_id) do |scope|
project_ids = scope.pluck(:project_id)
cursor = project_ids.last # For the next job
Project.where(id: project_ids).update_all(update_all: Time.current)
end
Pros:
Cons:
Keyset-based batching allows you to iterate over records in a specific order where multi-column sorting is also possible. The most common use cases are when we need to process data ordered via a timestamp column.
Example: delete issue records older than one year.
def perform
cursor = load_cursor || {}
# Requires a composite index on (created_at, id) columns
scope = Issue.where('created_at > ?', 1.year.ago).order(:created_at, :id)
iterator = Gitlab::Pagination::Keyset::Iterator.new(scope: scope, cursor: cursor)
iterator.each_batch(of: 100) do |records|
loaded_records = records.to_a
loaded_records.each { |record| record.destroy } # Calling destroy so callbacks are invoked
end
cursor = iterator.send(:cursor) # Store the cursor after this step, for the next job
end
With keyset-based batching, you could adjust the ORDER BY clause to match the column configuration of an existing index. Consider the following index:
CREATE INDEX issues_search_index ON issues (project_id, state, created_at, id)
This index cannot be used by the snippet above because the ORDER BY column list doesn't match exactly the column list in the index definition. However, if we alter the ORDER BY clause then the index would be picked up by the query planner:
# Note: this is a different sort order but at least we can use an existing index
scope = Issue.where('created_at > ?', 1.year.ago).order(:project_id, :state, :created_at, :id)
Pros:
Cons:
ORDER BY column will be stored in the cursor).This batching technique uses offset pagination when loading new records. Offset pagination should be used only as a last resort when the given query cannot be paginated via EachBatch or via keyset-pagination. One reason for choosing this technique is when there is no suitable index available for the SQL query to use a different batching technique. Example: in a background job we load too many records without limit and it started to time out. The order of the records are important.
def perform(project_id)
# We have a composite index on (project_id, created_at) columns
issues = Issue
.where(project_id: project_id)
.order(:created_at)
.to_a
# do something with the issues
end
As the number of issues within the project grows, the query gets slower and eventually times out. Using a different batching technique such as keyset-pagination is not possible because the ORDER BY clause is depending on a timestamp column which is not unique (see the tie-breaker section). Ideally, we should order on the created_at, id columns, however we don't have that index available. In a time-sensitive scenario (such as an incident) it might not be feasible to introduce a new index right away so as a last resort we can attempt offset pagination.
def perform(project_id)
page = 1
loop do
issues = Issue.where(project_id: project_id).order(:created_at).page(page).to_a
page +=1
break if issues.empty?
# do something with the issues
end
end
The snippet above can be a short term fix until a proper solution is in place. Offset pagination gets slower as the page number increases which means that there might be a chance where the offset paginated query times out the same way as the original query. The chances are reduced to some extent by the database buffer cache which keeps the previously loaded records in memory; Thus, the consecutive (short-term) lookup of the same rows will not have very high impact on the performance.
Pros:
Cons:
We have several features where we need to query data in the top-level namespace and its subgroups. There are outlier group hierarchies which contain several thousand subgroups or projects. Querying such hierarchies can easily lead to database statement timeouts when additional subqueries or joins are added.
Example: iterate over issues in a group
group = Group.find(9970)
Issue.where(project_id: group.all_project_ids).each_batch do |scope|
# Do something with issues
end
The example above will load all subgroups, all projects and all issues in the group hierarchy which will very likely lead to database statement timeout. The query above can be slightly improved with database indexes as a short-term solution.
When you need to process records in a specific order in a group, you can use the in-operator optimization which can provide better performance than using a standard each_batch based batching strategy.
You can see an example for batching over records in the group hierarchy.
Pros:
Cons:
This technique can be used when we always have to batch from the top-level group (group without parent group). In this case we can leverage the following index in the namespaces table:
"index_on_namespaces_namespaces_by_top_level_namespace" btree ((traversal_ids[1]), type, id) -- traversal_ids[1] is the top-level group id
Example batching query:
Namespace.where('traversal_ids[1] = ?', 9970).where(type: 'Project').each_batch do |project_namespaces|
project_ids = Project.where(project_namespace_id: project_namespaces.select(:id)).pluck(:id)
cursor = project_namespaces.last.id # For the next job
project_ids.each do |project_id|
Issue.where(project_id: project_id).each_batch(column: :iid) do |issues|
# do something with the issues
end
end
end
Pros:
EachBatch.Cons:
Using the NamespaceEachBatch class allows us to batch a specific branch of the group hierarchy (tree).
# current_id: id of the namespace record where we iterate from
# depth: depth of the tree where the iteration was stopped previously. Initially, it should be the same as the current_id
cursor = { current_id: 9970, depth: [9970] } # This can be any namespace id
# Instantiate the object to iterate over project namespaces only.
iterator = Gitlab::Database::NamespaceEachBatch.new(namespace_class: Namespaces::ProjectNamespace, cursor: cursor)
# Requires a composite index on (parent_id, id) columns
iterator.each_batch(of: 100) do |ids, new_cursor|
cursor = new_cursor # For the next job, contains the new current_id and depth values
project_ids = Project.where(project_namespace_id: ids)
project_ids.each do |project_id|
Issue.where(project_id: project_id).each_batch(column: :iid) do |issues|
# do something with the issues
end
end
end
Pros:
Cons:
We consider complex queries where the query contains multiple filters and joins. Most of the time these queries cannot be batched easily. A few examples:
JOIN to filter out rows.IN filters or complex AND or OR conditions.