communications/postmortems/2025-07-16.md
Date: 2025-07-16
Duration: 12:22PM PST to 3:00PM PST
Impact: Mastra Cloud services could not be routed to causing API downtime.
Severity: Critical
Prepared by: Abhi, Ose, Noah
The KNative net-controller started crash looping because it could not resolve the ingresses within the configured liveness probe period. This caused api runner pods to fail healthchecks and also crashloop.
All times in Pacific Time (PT)
Net Controller pod restarted during a node rebalancing activity. Could not respond to liveness checks given 1k ksvc ingresses causing slow K8 Control Plane API calls when "Priming ingresses"
Extended probes and added more resources