Back to Mastra

Incident Post-Mortem: KNative Net Controller issues

communications/postmortems/2025-07-16.md

2025-12-181.9 KB
Original Source

Incident Post-Mortem: KNative Net Controller issues

Date: 2025-07-16 Duration: 12:22PM PST to 3:00PM PST Impact: Mastra Cloud services could not be routed to causing API downtime.
Severity: Critical Prepared by: Abhi, Ose, Noah

Issue Summary

The KNative net-controller started crash looping because it could not resolve the ingresses within the configured liveness probe period. This caused api runner pods to fail healthchecks and also crashloop.

Timeline

All times in Pacific Time (PT)

  • [12:22PM]: Report from customer that Mastra Cloud dashboard failing API requests
  • [1:32PM]: Identified net-controller is crash looping on "Priming ingress"
  • [2:20PM]: Modified liveness probe
  • [2:40PM]: Monitored and applied final configuration to net-controller

Root Cause Analysis

Net Controller pod restarted during a node rebalancing activity. Could not respond to liveness checks given 1k ksvc ingresses causing slow K8 Control Plane API calls when "Priming ingresses"

Resolution

Extended probes and added more resources

Impact Assessment

  • Users Affected: All
  • Services Affected: Mastra Runner API
  • Duration: 3 hours
  • Business Impact: N/A

Action Items

  • Make knative infrastructure highly available
  • Add alerts when knative components are misbehaving
  • Infrastructure as code for knative cluster
  • Create runbook for diagnosing "Customer Issue"
  • Create runbook for knative components
  • Synthetic testing

Lessons Learned

What Went Well

  • Able to debug situation quickly
  • Previous KNative experience

What Went Wrong

  • No alerts or metrics on Knative components
  • No infrastructure as code to know how the cluster was setup initially
  • New employees are still ramping up
  • No runbooks for operating these type of issues

Where We Got Lucky

  • We are good at what we do
  • We are in Beta