We sincerely apologise for the outage you experienced.
What happened:
Yesterday, we rolled out an update to support ARM architecture across our platform.
As part of that, a migration was initiated today to extend support to x86 architecture within our database schemas.
This triggered a higher-than-expected volume of database rescheduling events, which in turn put excessive pressure on the underlying GCP infrastructure. Some nodes failed under this load, compounding the issue and causing degraded service for several customers.
Timeline of events (UTC):
15:36 — Northflank engineers were alerted to failing health checks on customer endpoints.
15:41 — Elevated 503 errors were detected from a specific cluster.
15:42 — An internal incident was formally declared.
15:57 — Replacement infrastructure began provisioning to recover failed nodes.
16:32 — All customer addons returned to a healthy state, with the exception of four databases which required manual intervention.
Thank you for your patience—we know how critical uptime is, and we are reviewing and improving our systems to ensure this doesn’t happen again.
Will Stewart
CEO @ Northflank