Back to Mastra

Incident Post-Mortem: Builds Failing

communications/postmortems/2025-07-17.md

2025-12-182.6 KB
Original Source

Incident Post-Mortem: Builds Failing

  • Date: 2025-07-17
  • Duration: 6:29pm to 11:05pm
  • Impact: 53 build failures across 22 projects
  • Severity: Critical
  • Prepared by: Noah Crowley, Abhi Aiyer, Ose Dania

Issue Summary

Changes were made to improve the availability of the underlying Knative infrastructure, which resulted in persistent volume claims (PVC) being disabled. As a result, builds were failing until PVC functionality was re-enabled.

Timeline

All times in Pacific Time (PT)

  • 6:23pm: Changes were made to Knative infrastructure to improve availability
  • 6:29pm: First build failure occurred
  • 10:26pm: Build failures identified by staff, response initiated
  • 10:46pm: SME paged
  • 10:49pm: Identified cause of failures: Persistent volume claims disabled
  • 10:59pm: Last build failure occurred
  • 11:02pm: Persistent volume claims re-enabled
  • 11:05pm: Incident recovered

Root Cause Analysis

In order to reduce deployment times and improve performance, our runners use PVC claims to mount Filestore volumes. When changes intended to improve availability were made to the underlying Knative infrastructure, the new configuration inadvertently disabled PVC claims for Knative services. When build jobs complete, they automatically deploy a new runner with the updated build. Because the runner configuration includes a read/write PVC, but PVCs were disabled, the conflict caused Knative to fail when creating the new runner service. This in turn caused the build job to fail.

Resolution

Knative configuration was updated to re-enable PVCs.

Impact Assessment

  • Users Affected: 53 builds across 22 projects
  • Services Affected: Mastra Cloud builder, runner
  • Duration: 4h 36m

Action Items

Immediate Actions (0-7 days)

  • Add business logic alerts for failed builds
  • Log build failures as errors
  • Reconcile IaC with existing configuration
  • Implement Incident.io

Short-term Improvements (8-30 days)

  • Add synthetic monitor that triggers a build
  • Staging Environment

Lessons Learned

What Went Well

  • Issue was identified by staff
  • SME was able to identify proximate cause through error messages

What Went Wrong

  • No validation of business functionality after infrastructure changes
  • Limited ability to page engineers
  • Drift between IaC and production configuration

Where We Got Lucky

  • Issue was identified by staff
  • Issue was not during prime business hours

Prevention Measures

Improvements to logs, metrics, and alerting.

Implementation of incident management tooling, including on-call and pagers.