The Internet Was Supposed to be Decentralized and Resilient. So why all these Massive Outages?
While everyone (except the especially nerdy) was sleeping, an oxymoronic consolidation has crept into internet websites, commerce sites, and business applications' underlying systems. That consolidation has effectively created mini anti-internets weaved throughout, and when those anti-internets have large enough problems they end up crashing systems globally.
Such a disruption happened this past Thursday taking out a wide range of major corporate websites and back office systems, including FedEx, Bank of Montreal, British Airways, Royal Bank, HSBC, and Airbnb, all the way to Playstation, and Steam's game platforms.
The outages was determined to be caused by system disruptions from Akamai and Oracle , two key providers of internet "cloud" infrastructure services, ultimately entirely the result of Akamai's service disruption.
Akamai's Edge DNS service helps route web browsers to their correct destinations and in doing so, provides redundancy, some measure of failover, and provides security services.
Akamai released: "We have implemented a fix for this issue, and based on current observations, the service is resuming normal operations. We will continue to monitor to ensure that the impact has been fully mitigated." — a little less than an hour after the outages had started.
The determination was that the outage was caused by a "software configuration update triggered a bug in the DNS system" that lasted "up to an hour" and was not the result of a cyberattack.
... this is not about severing your ties with cloud infrastructure providers, rather it's how you rely on them
Most affected sites and system services were restored in less than an hour. But the damage was done. The question is: how can the very decentralized genius of the internet be brought down by such a routine (and boring) issue? The answer: it's because scaled-up systems such as Akamai's Edge DNS effectively circumvents that very decentralized genius of the internet -- an anti-internet.
At this point it's time to remember what the internet was invented for: to create a self-directed multi-routing network that, by its very architecture, can find a route to its destination server no matter what -- almost. Even if parts of the internet network connections were interrupted or unavailable, the DNS routing (among other systems) infrastructure knows how to find a viable path even if it's less than ideal. Hence the genius of its resilience.
The moment you centralize a critical service out of that DNS service, or put too many dependant technologies under one platform or infrastructure, you've created an all-too-important consolidation service as a point of primary truth. The moment that centralized service becomes unstable, you've taken on so much responsibility that what's left on the real internet isn't enough to keep you going. If you don't appear entirely broken, you're at least broken enough to be pretty much useless.
In 2019, the average cost of critical server outages ranged between $301 000 and $400 000 USD according to a Statista 2020 report. While this represents broad spectrum of companies and sizes, the message is clear. Online systems of all flavours represent an increasingly critical economic piece of infrastructure attention.
Bottom up, not Top down
The challenge with this model starts with what exactly is being centralized, especially where, in the stack that's powering your CMS, website, commerce site, or company applications, its magic kicks in.
That's the crux of the issue, that CDN and broader Cloud Infrastructure Data Service providers have morphed into magical pixie dust providers - at least in popular culture. It's become commonplace to assume that because "we use Cloud-provider-X" means that you're "just protected" from issues ... somehow. Because it's magic pixie dust.
But they're good services, aren't they?!
They're great providers - yes! We use them in strategic parts here and there, we encourage businesses to use them (details to come on "how"). So what's the problem you ask, weary reader? It's all your eggs in one basket; or baby out with the bathwater, whichever expression you prefer. Increasingly it appears, in a totally not-scientific study, that organizations are treating cloud infrastructure providers as the de facto guardian of all things redundancy. In other words, they believe if they sign on that one dotted line, that the cloud infrastructure provider will do the rest regardless of all harmful impactful events with the possible exception of nuclear annihilation.
And that's simply not so. Clearly.
Take charge of your own stack redundancy. As mentioned earlier, this is not about severing ties with cloud infrastructure providers, rather it's how you rely on them. At a simple, high level, most applications are going to need at least a base stack of technical services:
- website application: the part that you see when you first visit or login as customer or administrator
- DNS services: this is less known or obvious, but (obviously) your website or system services need to be accessible to the world. DNS, or Domain Name System, holds the addresses of how all of your services get routed when someone, or something calls on your site or services. More about this later.
- databases: where content lives; including inventory data, customer data, logs, purchases, all those important pieces of information that live under the hood; your website/application needs the databases to feed it information in real time
- assets: at the beginning with many systems, assets such as images and code live within the website application most often. As audience and demand grow and becomes more complex, these can break out (abstract) away from the main application to its own space; thus freeing up resources for the web application to do one thing: respond to customers and staff.
So ... how do we bring together the magic of an already redundant internet, with an online presence, application, or more, that uses that internet redundancy to its advantage rather than circumventing it? The answer is both easy and complicated (I'm sure if you'd read this far you probably already understand this).
Breaking this down then to primary service-performance points there are applications or services (as above), and there are the routing points *to* those applications or services. Then there is what happens when the primary provider of that service is no longer functioning. This brings us to the first decision point: break down each point of service where a failure means an interruption to your site, application, services, all or in part. For example:
- If primary DNS services stop, is there failover outside of your primary provider? (hint: it can)
- If your web server becomes unavailable, is there a secondary presence outside of your primary provider? (hint: it can)
- If your database server becomes unavailable, is there a sync'd clone outside of your primary provider ready to go? (hint: it can)
- If your systems call on services such as streaming, image and asset server libraries (CDN), external data lookups, are there secondary clones outside of your primary provider? (hint: it can)
If it's not already obvious, each piece of your site, application, or ecosystem, that relies 100% on a singular cloud provider is at risk of precisely the cloud outage events that have been occurring with greater frequency. By analyzing your key data interaction points, you can take greater control of your ecosystem robustness and stack flexibility. And the best thing? It doesn't have to cost a penny more than you're already spending; just more focused awareness.