The One Where K8s Jobs Got Too Big

Date: 2025-04-29

Duration: 2:00PM to 7:30PM

Impact: Mastra Cloud Build and Deployment

Severity: [Critical]

Prepared by: [Abhi Aiyer, Yujohn Nattrass, Gavin Minami]

Issue Summary

At 2:00PM PT, the Mastra Cloud Builder service began failing to start new build jobs. Kubernetes Jobs were failing to initialize with E2BIG errors, preventing users from building and deploying their Mastra applications. The rate of successful builds had been decreasing over time.

This incident affected all users attempting to build or deploy applications using Mastra Cloud during the outage period.

The engineering team explored multiple mitigation strategies, including configMaps and file-based configuration approaches. The investigation was initially challenging because the Mastra server deployments continued to function normally, suggesting the issue might not be Kubernetes-related. Further analysis revealed that Kubernetes was automatically generating environment variables for service discovery, creating an ever-growing list of job pods that eventually exceeded the environment variable size limit. This condition had developed gradually over time and only manifested as a critical issue when increased platform usage pushed the system beyond the threshold.

Timeline

All times in Pacific Time (PT)

2:00PM: Builder service began failing with E2BIG errors
2:30PM: Statuspage set for Builder service
3:30PM: Abhi and Gavin theorize what could have changed to cause the issue
4:00PM: Yujohn and Abhi investigate env var mitigations
6:00PM: Able to run a container in isolation to inspect the environment variables pulled in
6:45PM: Discovered service discovery was the issue, tons of pod names were being added as environment variables
7:30PM: Disabled service links in all k8s jobs and deployed fix.
8:00PM: Builder service restored and operational

Root Cause Analysis

The root cause of this incident was Kubernetes automatically generating environment variables for service discovery that exceeded the maximum allowed size (E2BIG error).

The Mastra Cloud Builder service uses Kubernetes Jobs to execute build tasks. As the number of jobs in the cluster grew over time, Kubernetes automatically added environment variables for each job pod to facilitate service discovery. Eventually, the total size of these auto-generated environment variables exceeded the system limit (approximately 2MB), causing new job pods to fail during initialization with E2BIG errors.

The engineering team was initially misled because:

Misleading error context: We assumed the E2BIG error was related to user-defined environment variables, when in fact it was caused by Kubernetes-generated variables.
Selective failure pattern: Only the Builder service jobs were failing, while the Mastra server deployments continued to function normally. This led us to initially rule out Kubernetes as the source of the problem.
Gradual onset: The issue had likely existed for some time but only became critical when increased usage pushed the number of jobs over the threshold.

Three critical system deficiencies contributed to the extended duration of this incident:

Lack of proper error handling: The application did not have adequate error handling for job initialization failures, causing complete build failures rather than graceful degradation.
Insufficient monitoring: When a change in build success % happen within a 15min period, an alert should have been triggered.

Resolution

Service discovery configuration: Remove service discovery as it is unneeded for the builder service.

These actions directly addressed the root cause by preventing the environment variable size from exceeding system limits while maintaining necessary service discovery functionality.

Impact Assessment

Users Affected: All users attempting to build Mastra Cloud projects during the outage period
Services Affected: Mastra Cloud Builder service, preventing all build and deployment operations
Duration: 6 hours
Business Impact:
- Degraded user experience for all Mastra Cloud customers
- Potential loss of trust from community members who reported issues in Discord

Action Items

Immediate Actions (0-7 days)

Configure monitoring alerts: Set up alerts for job failure rates

Lessons Learned

What Went Well

Thorough investigation: Despite initial misleading errors, we continued investigating until finding the true root cause.
Effective resolution: Once the root cause was identified, we implemented a solution that addressed the immediate problem and prevented future occurrences.

What Went Wrong

Misleading assumptions: Initial assumption that E2BIG errors were related to user configurations led troubleshooting in the wrong direction.
Misunderstanding of Kubernetes internals: The team was unaware that Kubernetes auto-generates environment variables for service discovery, which delayed identifying the root cause.