Railway says Google Cloud suspension triggered 8-hour platform-wide outage

The team says an automated GCP suspension cascaded past Google to knock every Railway workload offline until early May 20.

By ·

Why it matters

A provider-level suspension took out Railway's control plane, turning multi-environment redundancy into a single point of failure. Founders running platforms can study this to harden control planes, route caches, and provider escalation paths.

A technical blueprint of a railway network encountering a critical system failure within its cloud infrastructure. (Architectural drafting blueprint — white linework on cyanotype blue paper, with technical annotations and ruler marks.)

Railway, the all-in-one cloud provider, says an automated action by Google Cloud incorrectly suspended its production account on May 19, triggering a platform-wide outage that lasted about eight hours, according to an incident report published May 20.

The report was authored by Railway team members Chandrika Khanduri and Cody De Arkland, who wrote that Google Cloud placed the account in a suspended status without prior outreach and that the action affected many customers across the provider. Railway runs customer workloads across multiple environments, including its own Railway Metal and burst capacity on AWS, but its edge proxies rely on a Google Cloud-hosted control plane. When that control plane went dark, cached routes began to expire and the outage cascaded beyond Google Cloud until every workload was unreachable.

What happened

Railway says the suspension hit at 22:20 UTC on May 19, immediately taking the Railway dashboard, API, and databases offline. Users saw 503s like "no healthy upstream" and could not log in. As route caches expired, workloads running outside Google Cloud returned 404s because the network control plane could no longer resolve routes to active instances. At peak impact, all regions and tiers were unreachable.

The team filed a P0 with Google Cloud at 22:22 and had account access restored by 22:29, but compute instances remained stopped and persistent disks were inaccessible. Disks came back to a ready state by 23:54, yet recovery remained blocked on Google Cloud networking. Edge traffic resumed at 01:38 UTC on May 20 as networking recovered, orchestration and build systems followed shortly after, and the incident moved to monitoring at 06:14. Railway marked the incident resolved at 07:58.

The recovery got messy

Bringing the platform back created secondary pressure. With caches cleared during the outage, a backlog of queued deployments began to flood the system, so Railway paused deploys and drained the queue gradually to avoid overloading services. GitHub then rate-limited Railway's OAuth and webhook integrations, temporarily blocking some logins and builds, and users were prompted to re-accept terms of service because those records had been reset as a side effect.

Owning the blast radius

Railway's post is explicit about the architectural trap: a single upstream provider action took down its Google Cloud control plane, and that single point of failure propagated to environments that were otherwise healthy. "We take full responsibility for the architectural decisions that allowed a single upstream provider action to cascade into a platform-wide outage," the team wrote, adding that the report may be updated pending Google Cloud's internal review.

Railway positions itself as an all-in-one alternative to stitching together deploy, network, scale, and observability tools, with automatic previews, a visual canvas for stacks, and managed databases. That value proposition relies on a control plane that keeps edge routing and orchestration in sync. When that brain runs inside one provider boundary, even multi-environment footprints like Railway Metal and AWS are vulnerable to a control-plane outage.

In the wake of the incident, Railway says it will detail changes aimed at preventing a repeat and is continuing to review the event with Google Cloud. For operators choosing a platform, the writeup reads like a case study in isolating control planes, setting sane cache behavior, and planning for provider-level surprises, even when workloads are spread across multiple clouds.

Reader comments

Conversation for this story loads after sign-in.