Incident Post-Mortem: KNative Net Controller issues

Date: 2025-07-16 Duration: 12:22PM PST to 3:00PM PST Impact: Mastra Cloud services could not be routed to causing API downtime.
Severity: Critical Prepared by: Abhi, Ose, Noah

Issue Summary

The KNative net-controller started crash looping because it could not resolve the ingresses within the configured liveness probe period. This caused api runner pods to fail healthchecks and also crashloop.

Timeline

All times in Pacific Time (PT)

[12:22PM]: Report from customer that Mastra Cloud dashboard failing API requests
[1:32PM]: Identified net-controller is crash looping on "Priming ingress"
[2:20PM]: Modified liveness probe
[2:40PM]: Monitored and applied final configuration to net-controller

Root Cause Analysis

Net Controller pod restarted during a node rebalancing activity. Could not respond to liveness checks given 1k ksvc ingresses causing slow K8 Control Plane API calls when "Priming ingresses"

Resolution

Extended probes and added more resources

Impact Assessment

Users Affected: All
Services Affected: Mastra Runner API
Duration: 3 hours
Business Impact: N/A

Action Items

Make knative infrastructure highly available
Add alerts when knative components are misbehaving
Infrastructure as code for knative cluster
Create runbook for diagnosing "Customer Issue"
Create runbook for knative components
Synthetic testing

Lessons Learned

What Went Well

Able to debug situation quickly
Previous KNative experience

What Went Wrong

No alerts or metrics on Knative components
No infrastructure as code to know how the cluster was setup initially
New employees are still ramping up
No runbooks for operating these type of issues

Where We Got Lucky

We are good at what we do
We are in Beta