This week, a significant portion of the Web fell over when on Tuesday, sites powered by Fastly were impacted by a massive outage that affected around 85% of the network.
The near-total collapse — which was quickly identified and remedied — took out sites including GitHub, Stack Overflow, PayPal, Shopify, Stripe, Reddit, Amazon, and CNN. Furthermore, it was all but impossible to express rage on Twitter because the server that handles the social network’s emojis was also affected.
This outage was broad and severe, and we’re truly sorry for the impact to our customers and everyone who relies on them.
– Nick Rockwell, Senior VP of Engineering and Infrastructure, Fastly Inc.
The incident occurred at around 10:00 UST (06:00 EST) and prompted mass “Error 503” messages. It was identified by Fastly in less than a minute and patched within an hour.
Initial analysis indicates that the whole episode was triggered by a single customer updating their settings (in a perfectly valid way) — you know those nightmares you have about clicking the wrong button and deleting the whole Web? Yeah, imagine being that person. The precise combination of settings triggered a bug in an update that had been missed in Fastly’s QA and had been sitting in production code since May 12th.
If you’ve ever visited a serious server center, you’ll know the kind of security they employ in defense of potential criminal attacks. The only center I’ve visited in person was inside a nuclear-proof bunker, involved multiple security checks, and I wasn’t even allowed into the really secure part. But it turns out, all the terrorists need to do to crash the global economy is open a CDN account and update their settings.
Fastly actually reacted far faster than previous CDN mass-outages by its competitors — one possible reason its share price soared this week. But it is still trapped in a cycle of competition in which fast and cheap are easily compared, and good is somewhat abstract…until it’s not.
Most of us feel like seasoned hands at the Web when the truth is we’re very early adopters. It will be a century or more before the Web is truly integrated into society. Still, we are building the foundations now, and future generations need those foundations to be robust. We need less focus on clawing back a few pennies, less focus on delivering sites 3 nanoseconds before a user opens their browser, and a greater focus on resilience.
Like everyone, I love eye-peelingly fast sites, and I’m more than happy to get a good deal, but personally, I don’t feel either of those things is worth waking up to an Error 503 on a site I’m responsible for.
Image via Unsplash.