communications/postmortems/2025-04-27.md
Date: 2025-04-27
Duration: 12:00am to 9:00am
Impact: Mastra Cloud API + Dashboard
Severity: [High]
Prepared by: [Abhi Aiyer, Yujohn Nattrass]
At 12:00am PT, the Mastra Cloud API and dashboard began throwing 500 errors. API requests within Mastra Cloud's dashboards were rendering no UI with no indication of an error. The Cloud API did not trigger any alerts and the statuspage did not get updated.
This incident affected all users attempting to access the Mastra Cloud API and dashboard during the outage period. Without proper error notifications or status updates, users had no visibility into the system's status or expected resolution time.
The lack of alerting mechanisms also delayed the internal response, as the engineering team was not automatically notified of the service disruption. This resulted in extended downtime until the issue was manually discovered and addressed.
All times in Pacific Time (PT)
The root cause of this incident was the Clickhouse Analytics Database entering hibernation mode, which caused cascading failures throughout the Mastra Cloud API and dashboard.
A server restart on Google Cloud Run at 12:00am PT triggered the initial connection attempt to the Clickhouse database. However, since the Clickhouse instance had entered hibernation due to lack of payment (T_T), these connection attempts failed. Using Clickhouse was just introduced 4/2, so fresh in the codebase. The API server was unable to gracefully handle these database connection failures, resulting in 500 errors being returned to users instead of appropriate fallback behavior.
Three critical system deficiencies contributed to the extended duration of this incident:
Lack of proper error handling: The application did not have adequate error handling for database connection failures, causing complete API failures rather than degraded service with non-analytics features still functioning.
Insufficient monitoring: No alerts were configured to detect the Clickhouse database hibernation state or the resulting API failures. This prevented automated detection of the issue.
Missing status page integration: This new system was not properly integrated with the status page, resulting in no automatic updates to communicate the service disruption to users.
Additionally, the initial investigation was slowed by a misleading error about PORT 8080 connectivity, which was a symptom rather than the root cause. This highlights a need for improved logging and diagnostics to more quickly identify the true source of failures.
Immediate reactivation: The Clickhouse database was manually reactivated, restoring the connection between the API server and the analytics database.
Plan upgrade: The database plan was upgraded to prevent future hibernation due to inactivity or resource constraints.
These actions directly addressed the root cause by restoring database connectivity and implementing measures to prevent future hibernation events.