Northflank - Infrastructure stability issues – Incident details

All systems operational

Infrastructure stability issues

Resolved
Partial outage 0
Started 13 days agoLasted about 1 hour

Affected

Northflank Platform

Partial outage from 4:04 PM to 5:13 PM

Services

Partial outage from 4:04 PM to 5:13 PM

Addons

Partial outage from 4:04 PM to 5:13 PM

Networking

Partial outage from 4:04 PM to 5:13 PM

Updates
  • Resolved
    Resolved

    We sincerely apologise for the outage you experienced.

    What happened:
    Yesterday, we rolled out an update to support ARM architecture across our platform.
    As part of that, a migration was initiated today to extend support to x86 architecture within our database schemas.
    This triggered a higher-than-expected volume of database rescheduling events, which in turn put excessive pressure on the underlying GCP infrastructure. Some nodes failed under this load, compounding the issue and causing degraded service for several customers.

    Timeline of events (UTC):

    • 15:36 — Northflank engineers were alerted to failing health checks on customer endpoints.

    • 15:41 — Elevated 503 errors were detected from a specific cluster.

    • 15:42 — An internal incident was formally declared.

    • 15:57 — Replacement infrastructure began provisioning to recover failed nodes.

    • 16:32 — All customer addons returned to a healthy state, with the exception of four databases which required manual intervention.


    Thank you for your patience—we know how critical uptime is, and we are reviewing and improving our systems to ensure this doesn’t happen again.

    Will Stewart
    CEO @ Northflank

  • Monitoring
    Monitoring