Incident Post-Mortem: Builds Failing

Date: 2025-07-17
Duration: 6:29pm to 11:05pm
Impact: 53 build failures across 22 projects
Severity: Critical
Prepared by: Noah Crowley, Abhi Aiyer, Ose Dania

Issue Summary

Changes were made to improve the availability of the underlying Knative infrastructure, which resulted in persistent volume claims (PVC) being disabled. As a result, builds were failing until PVC functionality was re-enabled.

Timeline

All times in Pacific Time (PT)

6:23pm: Changes were made to Knative infrastructure to improve availability
6:29pm: First build failure occurred
10:26pm: Build failures identified by staff, response initiated
10:46pm: SME paged
10:49pm: Identified cause of failures: Persistent volume claims disabled
10:59pm: Last build failure occurred
11:02pm: Persistent volume claims re-enabled
11:05pm: Incident recovered

Root Cause Analysis

In order to reduce deployment times and improve performance, our runners use PVC claims to mount Filestore volumes. When changes intended to improve availability were made to the underlying Knative infrastructure, the new configuration inadvertently disabled PVC claims for Knative services. When build jobs complete, they automatically deploy a new runner with the updated build. Because the runner configuration includes a read/write PVC, but PVCs were disabled, the conflict caused Knative to fail when creating the new runner service. This in turn caused the build job to fail.

Resolution

Knative configuration was updated to re-enable PVCs.

Impact Assessment

Users Affected: 53 builds across 22 projects
Services Affected: Mastra Cloud builder, runner
Duration: 4h 36m

Action Items

Immediate Actions (0-7 days)

Add business logic alerts for failed builds
Log build failures as errors
Reconcile IaC with existing configuration
Implement Incident.io

Short-term Improvements (8-30 days)

Add synthetic monitor that triggers a build
Staging Environment

Lessons Learned

What Went Well

Issue was identified by staff
SME was able to identify proximate cause through error messages

What Went Wrong

No validation of business functionality after infrastructure changes
Limited ability to page engineers
Drift between IaC and production configuration

Where We Got Lucky

Issue was identified by staff
Issue was not during prime business hours

Prevention Measures

Improvements to logs, metrics, and alerting.

Implementation of incident management tooling, including on-call and pagers.