Back to Mastra

Incident Post-Mortem: Cloudflare Job Marked Active Deployments as Inactive

communications/postmortems/2025-06-23.md

2025-12-183.7 KB
Original Source

Incident Post-Mortem: Cloudflare Job Marked Active Deployments as Inactive

Date: 2025-06-23 Duration: 2025-06-19 12:24 to 2025-06-22 09:30 Impact: 170 projects Severity: [Critical] Prepared by: [Yujohn Nattrass]

Issue Summary

In Mastra cloud, we have a job that cleans up inactive builds. The Cloudflare job has been failing to run for the past few days, however it ran successfully once on June 19, 2025. This was supposed to only clean up inactive deployments and mark them as archived but it also marked active deployments and archived them. This was due to a bug in our code where we were checking in batches to find the active deployments however one of the batches failed while the others succeeded. This caused us to miss active deployments and mistake them for inactive deployments. We triggered a rebuild for all the affected projects to redeploy them. We also created a new job which will fetch the deployments from the database and filter out builds that are the projects active build.

Timeline

All times in Pacific Time (PT)

  • 2025-06-19 12:24: Cloudflare job executed deployment cleanup and marked some active deployments as inactive
  • 2025-06-19 14:05: First report of user reporting their active deployment was marked as archived
  • 2025-06-20 02:29: Investigation found the issue was related to the cloudflare job and stopped the job from running
  • 2025-06-22 09:30: Triggered rebuild for all affected projects

Root Cause Analysis

This incident occurred due to improper error handling in our deployment cleanup process. When querying the routing database to identify active deployments, one batch query silently failed. Instead of aborting the operation, the system proceeded with incomplete information, resulting in misclassification of active deployments.

Specifically:

  1. The cleanup job queries our routing database for builds in batches
  2. For each batch, it identifies which deployments are currently active
  3. The batch containing active builds for 170 projects encountered a database timeout error
  4. The system incorrectly categorized these deployments as inactive since they weren't present in the "active" results
  5. These deployments were subsequently archived, making the deployment inaccessible

The core issue was a combination of silent batch failure and lack of validation safeguards before performing destructive operations.

Resolution

We created a new job which will fetch the deployments from the builds database and filter out builds that are the projects active build. The job will explicitly check if builds are active or inactive instead of finding active builds and assuming the rest are inactive.

Impact Assessment

  • Projects Affected: 170 out of 863 projects
  • Services Affected: Mastra Cloud
  • Duration: June 19, 2025 12:24 to June 22, 2025 09:30
  • Business Impact: Affected deployments were down unless they were rebuilt by either a manual trigger or from github webhook.

Action Items

Immediate Actions (0-7 days)

  • Triggered rebuild for all affected projects
  • Create new job to fetch deployments from the database and filter out builds that are the projects active build

Lessons Learned

What Went Well

  • Development of an improved job implementation that directly checks the database for active build status
  • Effective containment by stopping the Cloudflare job from continuing to archive active deployments
  • Comprehensive remediation through rebuilding 170 affected projects

What Went Wrong

  • Lack of validation checks to confirm active deployment status before marking deployments as inactive
  • Silent failure handling that didn't alert on batch query timeouts
  • Extended recovery time due to manual rebuild process