The One Where Clickhouse Took a Nap

Date: 2025-04-27

Duration: 12:00am to 9:00am

Impact: Mastra Cloud API + Dashboard

Severity: [High]

Prepared by: [Abhi Aiyer, Yujohn Nattrass]

Issue Summary

At 12:00am PT, the Mastra Cloud API and dashboard began throwing 500 errors. API requests within Mastra Cloud's dashboards were rendering no UI with no indication of an error. The Cloud API did not trigger any alerts and the statuspage did not get updated.

This incident affected all users attempting to access the Mastra Cloud API and dashboard during the outage period. Without proper error notifications or status updates, users had no visibility into the system's status or expected resolution time.

The lack of alerting mechanisms also delayed the internal response, as the engineering team was not automatically notified of the service disruption. This resulted in extended downtime until the issue was manually discovered and addressed.

Timeline

All times in Pacific Time (PT)

12:00am: Server restart occurred on Google Cloud Run
2:00am: Discord community began reporting issues with service uptime
7:30am: Abhi noticed Discord reports, checked alerts/statuspage and found no alerts triggered, began investigating user-reported issues
8:00am: API identified as down, initially diagnosed as failing to connect to PORT 8080 (later determined to be a red herring)
8:08am: During rebuild, logs revealed the actual issue was failing to connect to the Analytics DB (Clickhouse)
8:14am: Root cause identified - Clickhouse DB was found to be in hibernation state
8:21am: Clickhouse DB reactivated and plan upgraded
8:26am: Mastra Cloud services fully restored and operational

Root Cause Analysis

The root cause of this incident was the Clickhouse Analytics Database entering hibernation mode, which caused cascading failures throughout the Mastra Cloud API and dashboard.

A server restart on Google Cloud Run at 12:00am PT triggered the initial connection attempt to the Clickhouse database. However, since the Clickhouse instance had entered hibernation due to lack of payment (T_T), these connection attempts failed. Using Clickhouse was just introduced 4/2, so fresh in the codebase. The API server was unable to gracefully handle these database connection failures, resulting in 500 errors being returned to users instead of appropriate fallback behavior.

Three critical system deficiencies contributed to the extended duration of this incident:

Lack of proper error handling: The application did not have adequate error handling for database connection failures, causing complete API failures rather than degraded service with non-analytics features still functioning.
Insufficient monitoring: No alerts were configured to detect the Clickhouse database hibernation state or the resulting API failures. This prevented automated detection of the issue.
Missing status page integration: This new system was not properly integrated with the status page, resulting in no automatic updates to communicate the service disruption to users.

Additionally, the initial investigation was slowed by a misleading error about PORT 8080 connectivity, which was a symptom rather than the root cause. This highlights a need for improved logging and diagnostics to more quickly identify the true source of failures.

Resolution

Immediate reactivation: The Clickhouse database was manually reactivated, restoring the connection between the API server and the analytics database.
Plan upgrade: The database plan was upgraded to prevent future hibernation due to inactivity or resource constraints.

These actions directly addressed the root cause by restoring database connectivity and implementing measures to prevent future hibernation events.

Impact Assessment

Users Affected: All users attempting to access Mastra Cloud services during the outage period
Services Affected: Mastra Cloud API and Dashboard, including all dependent analytics features
Duration: 8 hours and 26 minutes (12:00am to 8:26am PT)
Business Impact:
- Degraded user experience for all Mastra Cloud customers
- Potential loss of trust from community members who reported issues in Discord

Action Items

Immediate Actions (0-7 days)

Implement database connection resilience: Add proper error handling for clickhouse connection failures.
Configure monitoring alerts: Set up alerts for Clickhouse, Mastra API
Update status page integration: Ensure components are properly registered and visible on the status page.

Lessons Learned

What Went Well

Rapid resolution once identified: Once the root cause was identified, we were able to reactivate the database and restore service within minutes.
Community engagement: The Discord community provided valuable early detection of the issue when automated monitoring failed.
Systematic troubleshooting: We followed a methodical approach to identify the root cause despite initial misleading errors.

What Went Wrong

Delayed detection: The incident was only discovered after users reported issues, resulting in extended downtime.
Insufficient monitoring: Critical infrastructure components lacked proper monitoring and alerting.
Error handling deficiencies: The application failed completely instead of gracefully degrading when the analytics database was unavailable.
Misleading error messages: Initial error messages about PORT 8080 led troubleshooting in the wrong direction.