buildscripts/monitor_build_status/README.md
TL;DR: During times of high BF volume, code approvals and merging in 10gen/mongo master will be restricted to only allow changes that help reduce BFs, Bugs, Performance Regressions, and paying down technical debt.
The master branch should remain stable to develop the Server efficiently, and to be within 30 days of releasing at all times. If it becomes too unstable, or "too red," we want to aggressively focus on getting it back into the green. As a side benefit to releasability, a "greener" build should make patch build failures more meaningful. This will also reduce release time stress by having the release time period look and feel more like normal business.
Each team carries a quota (see below for details). When a team exceeds their quota - they enter a "code lockdown".
During a "code lockdown," Code Owners are expected to only approve work that closes BFs or helps us reduce/avoid the next Blocking state. i.e. aimed at fixing a BF, a class of BFs, bugs, performance regression, etc.
If your PR does not meet this criteria, it may be pending for some time until the system becomes unblocked. There are of course reasonable exceptions, below.
All feature work stops during a "code lockdown." In exceptional circumstances VPs can approve exceptions.
We understand that in many cases addressing the larger BF problem requires refactoring, modularity improvements, changes to our test and paying down other kinds of technical debt. During a "code lockdown" this work is expressly permitted and mergeable - with the guidance that teams index heavily on risk when deciding what to work on. If a piece of work feels like it makes the BF problem worse before it gets better, talk to your director about how to proceed.
Allowable Examples (not exclusive):
If a team is in a lockdown, but the rest of the org is not - their focus should likely skew towards work that expedites their lockdown exit.
If the org is in a lockdown, but a team doesn’t have BFs to work on - they should balance helping other teams with the work they’ve identified as addressing the underlying BF problem.
The higher the risk of the work, the more involvement the Staff+ engineers and the Director/VP should have in the decision about what is ok to merge and what isn’t.
Code Owners should join the #10gen-mongo-code-lockdown Slack channel to receive daily updates on the status of the build. It produces daily metrics with instructions if there is a state change.
If we change to a blocking state, code owners should use their discretion to only approve changes that are allowed (see above). If we exit the blocking state, code owners should approve PRs as usual.
Currently monitored thresholds:
| Quota | Team (older than 48h) | VP (older than 7d) | Global (older than 7d) | | --------- | ---------------------------- | ------------------------- | ----------------------------- | | Hot | 6 | 16 | 60 | | Cold | 10 | 32 | 100 |
Source-of-truth implementation: etc/code_lockdown.yml.
A dashboard is available at go/issue-quota.
This shows relevant JIRA queries for a more live and interactive view of the state.
Some teams may fix a BF in master, but are "waiting for fix" on older branches, which keeps the BF counted against the thresholds. Guidance here is currently evolving.
If the build failure is not frequently occurring, it can be marked as P5-Trivial, and it won’t count towards your team’s build failures for the block merge.
As we iterate on our processes for this, the exclude-from-master-quota label can be used to exclude BFs that should not be included in these quotas. The expectation is that this is an interim solution as we improve our processes especially around BFs that remain open pending backports.
Specifically:
exclude-from-master-quota label to the ticket.P5 - Trivial and apply the keep-trivial label.Priority to P5 - Trivial and apply the keep-trivial-X.Y label appropriately.For any new proposals, changes to thresholds, or concerns regarding their application, please escalate to your Director/VP. We want advocacy from all levels to make this a successful change to our engineering culture.
Run the following to read about supported options:
python buildscripts/monitor_build_status/cli.py --help
For Jira API authentication, use the JIRA_AUTH_PAT env variable. More about Jira Personal Access Tokens (PATs) can be found here.
Use your PAT to run the following and output its results:
JIRA_AUTH_PAT=<auth-token> python buildscripts/monitor_build_status/cli.py
The above will not send notifications to the Slack channel.
Slack notifications use a webhook from the Devprod Correctness Slack app (rather than user credentials) for security. The webhook URL is read from the mongo-code-lockdown-webhook Evergreen expansion, which points to the #10gen-mongo-code-lockdown Slack channel.