communications/postmortems/2025-04-29.md
Date: 2025-04-29
Duration: 2:00PM to 7:30PM
Impact: Mastra Cloud Build and Deployment
Severity: [Critical]
Prepared by: [Abhi Aiyer, Yujohn Nattrass, Gavin Minami]
At 2:00PM PT, the Mastra Cloud Builder service began failing to start new build jobs. Kubernetes Jobs were failing to initialize with E2BIG errors, preventing users from building and deploying their Mastra applications. The rate of successful builds had been decreasing over time.
This incident affected all users attempting to build or deploy applications using Mastra Cloud during the outage period.
The engineering team explored multiple mitigation strategies, including configMaps and file-based configuration approaches. The investigation was initially challenging because the Mastra server deployments continued to function normally, suggesting the issue might not be Kubernetes-related. Further analysis revealed that Kubernetes was automatically generating environment variables for service discovery, creating an ever-growing list of job pods that eventually exceeded the environment variable size limit. This condition had developed gradually over time and only manifested as a critical issue when increased platform usage pushed the system beyond the threshold.
All times in Pacific Time (PT)
The root cause of this incident was Kubernetes automatically generating environment variables for service discovery that exceeded the maximum allowed size (E2BIG error).
The Mastra Cloud Builder service uses Kubernetes Jobs to execute build tasks. As the number of jobs in the cluster grew over time, Kubernetes automatically added environment variables for each job pod to facilitate service discovery. Eventually, the total size of these auto-generated environment variables exceeded the system limit (approximately 2MB), causing new job pods to fail during initialization with E2BIG errors.
The engineering team was initially misled because:
Misleading error context: We assumed the E2BIG error was related to user-defined environment variables, when in fact it was caused by Kubernetes-generated variables.
Selective failure pattern: Only the Builder service jobs were failing, while the Mastra server deployments continued to function normally. This led us to initially rule out Kubernetes as the source of the problem.
Gradual onset: The issue had likely existed for some time but only became critical when increased usage pushed the number of jobs over the threshold.
Three critical system deficiencies contributed to the extended duration of this incident:
Lack of proper error handling: The application did not have adequate error handling for job initialization failures, causing complete build failures rather than graceful degradation.
Insufficient monitoring: When a change in build success % happen within a 15min period, an alert should have been triggered.
These actions directly addressed the root cause by preventing the environment variable size from exceeding system limits while maintaining necessary service discovery functionality.