doc/development/testing_guide/unhealthy_tests.md
This page provides technical reference for understanding and debugging flaky tests in GitLab. For process information about flaky test management, monitoring, and best practices, see the Flaky Tests handbook page.
It's a test that sometimes fails, but if you retry it enough times, it passes, eventually.
seed from the CI job logwhile :; do bin/rspec <spec> || break; done in a loop to find a seedbin/rspec --seed <previously found> --require ./config/initializers/macos.rb --bisect <spec>let_it_be is a common source of problemsseedscripts/rspec_check_order_dependence to ensure the spec can be run in random orderwhile :; do bin/rspec <spec> || break; done in a loop again (and grab lunch) to verify it's no longer flakyWhen a flaky test is blocking development on master, it should be quarantined to prevent impacting other developers.
The Test Quarantine Process handbook page
provides comprehensive guidance on the quarantine process, including:
For immediate quarantine needs, use the fast quarantine process for rapid merging. For implementation details on how to quarantine tests in your codebase, refer to the handbook page.
Failing tests are automatically retried once in a separate RSpec process.
For more information, see Automatic retry of failing tests in a separate process.
Description: Data state has leaked from a previous test. The actual cause is probably not the flaky test here.
Difficulty to reproduce: Moderate. Usually, running the same spec files until the one that's failing reproduces the problem.
Resolution: Fix the previous tests and/or places where the test data or environment is modified, so that it's reset to a pristine test after each test.
Examples:
let_it_be shared between test examples, while some test modifies the model
either deliberately or unwillingly causing out-of-sync data in test examples. This can result in PG::QueryCanceled: ERROR in the subsequent test examples or retries.
For more information about state leakages and resolution options, see GitLab testing best practices.let_it_be depended on a stub defined in a before block. let_it_be executes during before(:all), so the stub was not yet set. This exposed the tests to the actual method call, which happened to use a method cache.Description: The test assumes the dataset is in a particular (usually limited) state or order, which might not be true depending on when the test run during the test suite.
Difficulty to reproduce: Moderate, as the amount of data needed to reproduce the issue might be difficult to achieve locally. Ordering issues are easier to reproduce by repeatedly running the tests several times.
Resolution:
Examples:
master if the order of tests changes.42). If the test is run early in the test
suite, it might pass as not enough records were created before it, but as soon as it would run
later in the suite, there could be a record that actually has the ID 42, hence the test would
start to fail.ORDER BY, database is not given deterministic ordering, or data race can happen
in the tests.Description: SQL Query limit has reached triggering Gitlab::QueryLimiting::Transaction::ThresholdExceededError.
Difficulty to reproduce: Moderate, this failure may depend on the state of query cache which can be impacted by order of specs.
Resolution: See query count limits docs.
</details> <details> <summary><strong>Random input</strong> - <code>flaky-test::random input</code></summary>Description: The test use random values, that sometimes match the expectations, and sometimes not.
Difficulty to reproduce: Easy, as the test can be modified locally to use the "random value" used at the time the test failed
Resolution: Once the problem is reproduced, it should be easy to debug and fix either the test or the app.
Examples:
Description: The DOM selector used in the test is unreliable.
Difficulty to reproduce: Moderate to difficult. Depending on whether the DOM selector is duplicated, or appears after a delay etc. Adding a delay in API or controller could help reproducing the issue.
Resolution: It really depends on the problem here. It could be to wait for requests to finish, to scroll down the page etc.
Examples:
element not found error.Description: The test is assuming a specific date or time.
Difficulty to reproduce: Easy to moderate, depending on whether the test consistently fails after a certain date, or only fails at a given time or date.
Resolution: Freezing the time is usually a good solution.
Examples:
Description: The test fails from time to time due to infrastructure issues.
Difficulty to reproduce: Hard. It's really hard to reproduce CI infrastructure issues. It might be possible by using containers locally.
Resolution: Starting a conversation with the Infrastructure department in a dedicated issue is usually a good idea.
Examples:
Description: A flaky test issue arising from timing-related factors, such as delays, eventual consistency, asynchronous operations, or race conditions. These issues may stem from shortcomings in the test logic, the system under test, or their interaction. While tests can sometimes address these issues through improved synchronization, they may also reveal underlying system bugs that require resolution.
Difficulty to reproduce: Moderate. It can be reproduced, for example, in feature tests by attempting to reference an element on a page that is not yet rendered, or in unit tests by failing to wait for an asynchronous operation to complete.
Resolution: In the end-to-end test suite, using an eventually matcher.
Examples:
It could help to split the large RSpec files in multiple files in order to narrow down the context and identify the problematic tests.
Reproducing a job failure in CI always helps with troubleshooting why and how a test fails. This require us running the same test files with the same spec order. Since we use Knapsack to distribute tests across parallelized jobs, and files can be distributed differently between two pipelines, we can hardcode this job distribution through the following steps:
gitlab-org/gitlab branch to the same commit to ensure we are running with the same copy of the project.Running command: bundle exec rspec, the last argument of this command should contain a list of filenames. Copy this list.tooling/lib/tooling/parallel_rspec_runner.rb where the test file distribution happens. Have a look at this merge request as an example, store the file list you copied from step 2 into a TEST_FILES constant and have RSpec run this list by updating the rspec_command method as done in the example MR.spec/tooling/lib/tooling/parallel_rspec_runner_spec.rb so it doesn't cause your pipeline to fail early.spec/support/rspec_order.rb file by hard coding Kernel.srand with the value shown in the originally failing job, as done in merge request 128428. You can find the srand value in the job log by searching Randomized with seed which is followed by this value.To identify ordering issues in a single file read about how to reproduce a flaky test locally.
Some flaky tests can fail depending on the order they run with other tests. For example:
To identify the ordering issues across different files, you can use scripts/rspec_bisect_flaky,
which would give us the minimal test combination to reproduce the failure:
First obtain the list of specs that ran before the flaky test. You can search
for the list under Knapsack node specs: in the CI job output log.
Save the list of specs as a file, and run:
cat knapsack_specs.txt | xargs scripts/rspec_bisect_flaky
If there is an order-dependency issue, the script above will print the minimal reproduction.
We collect information about tests duration in ClickHouse database. The data is visualized using following Grafana dashboard.
In this issue, we defined thresholds for tests duration that can act as a guide.
For tests that are above the thresholds, we automatically report slowness occurrences in Test issues so that groups can improve them.
For tests that are slow for a legitimate reason and to skip issue creation, add allowed_to_be_slow: true.
| Date | Feature tests | Controllers and Requests tests | Unit | Other | Method |
|---|---|---|---|---|---|
| 2023-02-15 | 67.42 seconds | 44.66 seconds | - | 76.86 seconds | Top slow test eliminating the maximum |
| 2023-06-15 | 50.13 seconds | 19.20 seconds | 27.12 | 45.40 seconds | Avg for top 100 slow tests |
The following patterns are the most common causes of test slowness identified during systematic improvement efforts. Each links to detailed guidance in the testing best practices guide.
Waiting full timeout for expected-absent elements — Capybara waits the full
default timeout before concluding an element is absent. Use have_no_testid
instead of not_to have_testid, and wait: 0 for generic matchers inside
already-loaded containers. See Avoid waiting for elements you expect to be absent.
Using all() instead of find() — all() does not raise on missing
elements and does not benefit from Capybara's smart waiting. Block iteration
over all() results is particularly slow. See
Avoid all() with .first or block iteration.
Slow shared examples with wide inclusion — A shared example that is slow multiplies its cost across every file that includes it. See Performance impact of slow shared examples.
Triggering real external operations — Specs that shell out to compile binaries or run Git commands inherit that wall-clock cost even when the logic is already unit-tested. See Mock expensive external operations.
Factory cascades — Unnecessarily deep factory associations silently multiply database writes. See Optimize factory usage.
Unnecessary :js tag — Running specs with a full JavaScript browser when
an HTML response would suffice. See
Don't request capabilities you don't need.