docs/react-v9/contributing/rfcs/shared/build-system/04-build-shield.md
We need a designated build "firefighter" in each time zone who's responsible for addressing any urgent build-related issues and keeping things unblocked.
This is a separate rotation from regular shield, and at least with the initial set of responsibilities proposed, it's a much lower average time commitment.
To this point, build "firefighting" tends to de facto fall to one or two particular people in each time zone, which has multiple disadvantages:
Having build shield as a desginated, rotating responsibility will help address all of these issues. It will also help more people gain experience with build troubleshooting.
(Longer-term, we'd like to start scheduling ongoing efforts for build improvements/upgrades, but this is an incremental step in the meantime. And even if we get the build in a better state overall, random "fires" will still happen sometimes, and someone needs to be on point to investigate.)
Initially this will rotate within the build v-team, but once we have better documentation (see later in doc) we may be able to expand it to the team as a whole.
Since the focus is addressing urgent issues, we need one person per time zone.
Initial Redmond rotation:
Initial Europe rotation:
Initially we'll try a shift duration of 1 week.
Total time commitment would be allocated as up to 25% (can change later if it's not enough).
Some weeks you'll get lucky and have almost nothing to do, but other weeks something will "blow up" in a way that requires much more time to address. If the time goes beyond ~25% (or whatever number we decide above) it's worth calling out to management and/or pulling in other people as needed.
Starting out, the responsibility of build shield is limited to ensuring that PR/CI/release builds and local development stay unblocked (and documenting any failures that occur).
This includes monitoring for:
Prioritization is roughly as follows. As with normal shield, it's fine to pull in others for consultation/help as needed, especially if it's a particularly complex or urgent issue, or if a more in-depth fix turns out to be necessary.
| Issue type | Priority |
|---|---|
| Release build failed | 🔥 Investigate ASAP (fix issue, or retry if intermittent) |
| CI build fails > 50% | 🔥 Fix ASAP (or notify team if external issue) |
| PR builds fail > 50% on non-user errors (CI okay) | 🔥 ^ |
| Local builds broken for most/all of team (PR/CI okay) | 🔥 Fix or at least find workaround ASAP |
| Published package broken | 🔥 Fix ASAP (usually due to dep issue or missing file) |
| CI build intermittently broken | ⚠️ Try to investigate/fix (greater frequency => higher priority), or file an issue if it will take significant time |
| PR intermittently fail on non-user errors (CI okay) | ⚠️ ^ |
| Local builds broken for one person/scenario | ⚠️ ^ but at least get them unblocked |
| v7/8 website release failed (uncommon) | ⚠️ Re-run failed stage, or see internal wiki |
Another way to think about it is that whenever there's an automated failure notification, it's build shield's responsibility to at least click the link and look at the error message ASAP (and add a note to the failure post about what the error was).
Note that PR builds (and local builds) are a little different because in those cases, there are no automated notifications and no way to meaningfully track metrics (PR builds are expected to fail a lot). So in that case you're relying on team members noticing and posting about issues.
The second responsibility is documenting failures and troubleshooting procedures, with the goals of:
Where/how to document:
At least initially, build shield is NOT responsible for general build improvements/maintenance, such as:
(If time allows and you'd like to work on any of these things, that's great, but it's not the primary expectation.)
We may in the future make build shield a larger responsibility (with larger time allocation) to address these types of things in a more systematic manner. But to make this a feasible, incremental improvement/experiment, we're keeping the scope small initially.
The regular shield person has more than enough to do already, so it probably doesn't make sense to add this to their responsibilities.
As mentioned earlier, we'd like to start scheduling ongoing build improvements (and when that happens, some of the build shield responsibility could move there), but setting up a separate rotation in the meantime is an incremental improvement.