communications/postmortems/2025-06-23.md
Date: 2025-06-23 Duration: 2025-06-19 12:24 to 2025-06-22 09:30 Impact: 170 projects Severity: [Critical] Prepared by: [Yujohn Nattrass]
In Mastra cloud, we have a job that cleans up inactive builds. The Cloudflare job has been failing to run for the past few days, however it ran successfully once on June 19, 2025. This was supposed to only clean up inactive deployments and mark them as archived but it also marked active deployments and archived them. This was due to a bug in our code where we were checking in batches to find the active deployments however one of the batches failed while the others succeeded. This caused us to miss active deployments and mistake them for inactive deployments. We triggered a rebuild for all the affected projects to redeploy them. We also created a new job which will fetch the deployments from the database and filter out builds that are the projects active build.
All times in Pacific Time (PT)
This incident occurred due to improper error handling in our deployment cleanup process. When querying the routing database to identify active deployments, one batch query silently failed. Instead of aborting the operation, the system proceeded with incomplete information, resulting in misclassification of active deployments.
Specifically:
The core issue was a combination of silent batch failure and lack of validation safeguards before performing destructive operations.
We created a new job which will fetch the deployments from the builds database and filter out builds that are the projects active build. The job will explicitly check if builds are active or inactive instead of finding active builds and assuming the rest are inactive.