doc/development/sidekiq/compatibility_across_updates.md
The arguments for a Sidekiq job are stored in a queue while it is scheduled for execution. During a online update, this could lead to several possible situations:
On GitLab.com, we do not currently have a Sidekiq deployment in the canary stage. This means that a new worker than can be scheduled from an HTTP endpoint may be scheduled from canary but not run on Sidekiq until the full production deployment is complete. This can be several hours later than scheduling the job. For some workers, this will not be a problem. For others - particularly latency-sensitive jobs - this will result in a poor user experience.
This only applies to new worker classes when they are first introduced. As we recommend using feature flags as a general development process, it's best to control the entire change (including scheduling of the new Sidekiq worker) with a feature flag.
Jobs need to be backward and forward compatible between consecutive versions of the application. Adding or removing an argument may cause problems.
During any deployment, there's a period of time where some application nodes have been updated while others haven't. If an updated node queues a job with new arguments, but an older Sidekiq node processes it, the job will fail due to an argument mismatch.
For GitLab.com, this can occur if there are multiple deployments in the same milestone. Most self-managed deployments update all nodes sequentially in a single deployment cycle each release, so we need to spread the changes across multiple releases.
Before you remove arguments from the perform_async and perform methods., deprecate them. The
following example deprecates and then removes arg2 from the perform_async method:
Provide a default value (usually nil) and use a comment to mark the
argument as deprecated in the coming minor release. (Release M)
class ExampleWorker
# Keep arg2 parameter for backwards compatibility.
def perform(object_id, arg1, arg2 = nil)
# ...
end
end
One minor release later, stop using the argument in perform_async. (Release M+1)
ExampleWorker.perform_async(object_id, arg1)
At the next major release, remove the value from the worker class. (Next major release)
class ExampleWorker
def perform(object_id, arg1)
# ...
end
end
There are two options for safely adding new arguments to Sidekiq workers:
This approach requires multiple releases.
Add the argument to the worker with a default value (Release M).
class ExampleWorker
def perform(object_id, new_arg = nil)
# ...
end
end
Add the new argument to all the invocations of the worker (Release M+1).
ExampleWorker.perform_async(object_id, new_arg)
Remove the default value (Release M+2).
class ExampleWorker
def perform(object_id, new_arg)
# ...
end
end
This approach doesn't require multiple releases if an existing worker already uses a parameter hash.
Use a parameter hash in the worker to allow future flexibility.
class ExampleWorker
def perform(object_id, params = {})
# ...
end
end
To remove a worker class, follow these steps over three minor releases:
Remove any code that enqueues the jobs.
For example, if there is a UI component or an API endpoint that a user can interact with that results in the worker instance getting enqueued, make sure those surface areas are either removed or updated in a way that the worker instance is no longer enqueued.
This ensures that instances related to the worker class are no longer being enqueued.
Ensure both the frontend and backend code no longer relies on any of the work that used to be done by the worker.
In the relevant worker classes, replace the contents of the perform method with a no-op, while keeping any arguments intact.
For example, if you're working with the following ExampleWorker:
class ExampleWorker
def perform(object_id)
SomeService.run!(object_id)
end
end
Implementing the no-op might look like this:
class ExampleWorker
def perform(object_id); end
end
By implementing this no-op, you can avoid unnecessary cycles once any deprecated jobs that are still enqueued eventually get processed.
Add a migration (not a post-deployment migration) that uses sidekiq_remove_jobs:
class RemoveMyDeprecatedWorkersJobInstances < Gitlab::Database::Migration[2.1]
# Always use `disable_ddl_transaction!` while using the `sidekiq_remove_jobs` method,
# as we had multiple production incidents due to `idle-in-transaction` timeout.
disable_ddl_transaction!
DEPRECATED_JOB_CLASSES = %w[
MyDeprecatedWorkerOne
MyDeprecatedWorkerTwo
]
def up
Gitlab::SidekiqSharding::Validator.allow_unrouted_sidekiq_calls do
# If the job has been scheduled via `sidekiq-cron`, we must also remove
# it from the scheduled worker set using the key used to define the cron
# schedule in config/initializers/1_settings.rb.
job_to_remove = Sidekiq::Cron::Job.find('my_deprecated_worker')
# The job may be removed entirely:
job_to_remove.destroy if job_to_remove
# The job may be disabled:
job_to_remove.disable! if job_to_remove
end
# Removes scheduled instances from Sidekiq queues
sidekiq_remove_jobs(job_klasses: DEPRECATED_JOB_CLASSES)
end
def down
# This migration removes any instances of deprecated workers and cannot be undone.
end
end
Delete the worker class file and follow the guidance in our Sidekiq queues documentation around running Rake tasks to regenerate/update related files.
For the same reasons that removing workers is dangerous, care should be taken when renaming queues.
When renaming queues, use the sidekiq_queue_migrate helper migration method
in a post-deployment migration:
class MigrateTheRenamedSidekiqQueue < Gitlab::Database::Migration[2.1]
restrict_gitlab_migration gitlab_schema: :gitlab_main_org
disable_ddl_transaction!
def up
sidekiq_queue_migrate 'old_queue_name', to: 'new_queue_name'
end
def down
sidekiq_queue_migrate 'new_queue_name', to: 'old_queue_name'
end
end
You must rename the queue in a post-deployment migration not in a standard migration. Otherwise, it runs too early, before all the workers that schedule these jobs have stopped running. See also other examples.
We should treat this similar to adding a new worker. That means we only start scheduling the newly-named worker after the Sidekiq deployment finishes.
To ensure backward and forward compatibility between consecutive versions of the application, follow these steps over three minor releases:
Create the newly named worker, and have the old worker call the new worker's #perform method. Introduce a feature flag to control when we start scheduling the new worker. (Release M)
Any old worker jobs that are still in the queue will delegate to the new worker. When this version is deployed, it is no longer relevant which version of the job is scheduled or which Sidekiq handles it, an old-Sidekiq will use the old worker's full implementation, a new-Sidekiq will delegate to the new worker.
Enable the feature flag for GitLab.com, and after that prepare an MR to enable it by default. (Release M+1)
Remove the old worker class and the feature flag. (Release M+2)