docs/handbook/engineering/postmortems/2026-03-16-infrastructure-upgrade.mdx
On March 16–17, 2026, Activepieces experienced a service disruption lasting approximately 12–24 hours (with the first ~6 hours being the most severe) after rolling out a new worker architecture on Kubernetes. Two cascading issues — a persistent volume provisioning problem and a dedicated worker misconfiguration — caused flow execution failures for most cloud customers and one enterprise customer.
All times are in UTC.
Mar 14–15 (Pre-incident): As part of our infrastructure upgrade, we moved enterprise dedicated workers one by one first and isolated them from shared infrastructure changes. We then began rolling out the new architecture for shared workers.
trustedEnvironment incorrectly set to false after the namespace migration, blocking npm libraries in their sandbox.The Kubernetes persistent volumes (PVCs) allocated to the new shared workers filled up quickly after deployment. Once full, there was no shell access available to diagnose or remediate the issue. Additionally, no rollback plan had been prepared for the Kubernetes deployment, which delayed recovery.
When moving enterprise dedicated workers to the new server, a code change accidentally set trustedEnvironment to false for one enterprise customer. This disabled npm package support in the sandbox, causing that customer's flows to fail. This code path had no test coverage at the time, so the misconfiguration went undetected until flows started failing.
trustedEnvironment for dedicated workers was not covered by tests, allowing a misconfiguration to ship undetected.| Action Item | Status |
|---|---|
| Implement a documented and tested rollback plan for all infrastructure migrations (GIT-911) | To do |
| Add test coverage for worker trust level and sandbox configuration | Done |
| Support canary deployments | To do |
trustedEnvironment code path that caused Issue 2, ensuring configuration changes are validated automatically.