What AWS outtage really means

So it’s been a couple of hours since the last major AWS outtage has been set to RESOLVED. This was a major outtage in the AWS us-east-1 region, an incident that went on for more than 15 hours!

What resulted was a global outtage of several major service providers, including accounting software, messaging software, even banking systems. There seems to be multiple millions or billions of loss estimated as a direct result of this outtage.

During and after this somewhat major event (it is a major event if you’re this business), I saw several reactions, several coping mechanisms. There are now multiple videos on the internet trying to explain the impact and why it’s AWS’s fault.

I’m here to present a somewhat different narrative.

In my somewhat educated opinion, the majority share of the fault lies on the service providers that had their service go down, because a single region went out of business. To be specific, it’s not an entire region, it’s mainly a couple of key services that had the most downtime.

there are multiple points I want to present here.

Any single region of ANY cloud provider is bound to go down at some point. There’s no such thing as 100% availability
Cloud is a shared responsibility model, it’s not “just someone else’s computer”. Uptime of the Cloud is the provider’s responsibility, Uptime IN the Cloud is yours!
This is business problem, not a technical one
No, this isn’t why the Cloud is bad. You just have to decide which single point of failure you will accept

I just want to say that I’m not just shilling for the cloud. Of course, there’s a financial incentive for me to talk about the cloud. I am a Cloud Architect after all. I do think there has to be justification on why your software should be on the cloud as opposed to being deployed on a simple VPS server. But on the other hand, if you are already in the cloud, then you should do things correctly. If not that’s just bad engineering, bad architecture, and bad risk management.

There’s no such thing as 100% availability

This is the first thing anyone learns about the Cloud, but it looks like there’s an increasing trend in selling Cloud with being always available as a positive trait of the Cloud. No one should in theory be doing this, but when you have those quotas, when that commission is almost reachable, what’s one misconception made as an offhand comment?

It is almost impossible, engineering-wise, to provide 100% availability for a service with just one data center, or even one region. There are single points of failure all across the entire stack, that the only thing that can be guaranteed is a failure, which you might notice is the opposite of 100% availability.

To compound the situation, us-east-1 is well known to be unstable. There’s probably a long list of technical reasons for this, but the most plausible is technical debt. The moment you go-live, you are bound to not change stuff, and that naturally means a lot of early stage bugs and wrong design decisions have to be maintained rather than fixed. It also means, newer features get rolled on top of the older technical debt stack, making things more complex as time goes on.

Also, scale-wise, major outtages in the Cloud will take a long time to recover. This isn’t a circular argument. If the event isn’t a few minutes of a downtime, it’s almost guaranteed the downtime would be multiple hours. This is because if the outtage goes long enough, upstream services like the Cloud accumulates a huge queue of (renewing) requests waiting for the service to come back up. When the service does come back up, it gets immediately swamped by these hungry requests, so much that incidents in the past have had a hard time recovering just because of this reason. These type of outages usually have cascading effects because of this, so naturally any outage long enough to collect such a buffer will take considerable time to recover.

With these well known points, it should be common knowledge to prepare for a region outage, whether you’re on AWS, Azure, or GCP (or even Vultr because of all those techfluencer chatter on Twitter). There’s all sorts of services and features in hyperscaler cloud providers to do this. For example, in AWS, you can use multi-AZ and multi-region resilient architectures easily. This is just a matter of architecture and engineering. This is something you should do, resilience IN the Cloud doesn’t happen just because you moved into the Cloud.

Shared Responsibility Model

This is why operating in the Cloud is a Shared Responsibility Model. High availability OF the Cloud is not the same thing as high availability IN the Cloud. The same goes for any other quality you’re expecting from operating in the Cloud, security, performance, fault tolerance etc.

Like I mentioned before, the growing trend to sell the Cloud for its “availability” also seems to produce the side effect of treating the Cloud as a blackbox that solves all complexity. This is not the case, you still have to practice proper architecture and engineering in the Cloud. It’s just another utility service.

Business Problem, NOT Technical

This is why the decision to expect and prepare for regional failure is a business problem, NOT a technical one. The cost in terms of engineering, architecture, and actual money that needs to be spent to prepare for downtime needs to be considered on how valuable the business loss of not preparing. No one can take this decision other than the business itself.

So, if you’re a bank, and your service went down because of a DNS failure in one AWS region, then there are some possibilities on how you ended up at this situation.

You knew the business loss and accepted it, deciding not to spend on fault tolerance. Perfectly reasonable, although not if you’re a bank.
You didn’t know about the single point of failure, so you didn’t know about the business risk. Again, possible scenario, but it shows you lack visibility into your business.
You were informed about the dependency, but didn’t know the business impact, so you decided it’s not a worth effort because “the cloud will always be available”. Honestly, seeing other decisions in real life, this would be the worst path to take.

Either way, prepping for an outage in one or more of the services in your dependency tree is a business decision. So, whatever the dollar amount calculated as lost because of the AWS outage lies on the service provider, not really AWS (excluding the service credits that AWS would be handing out).

Don’t get me wrong, there’s enough fault with AWS about this outage. Cloud services are increasingly becoming a utility, and there’s probably availability and security controls coming down in the future, considering how concentrated Cloud services are for critical services.

However, it doesn’t take away from the fact that business continuity planning can only be done by the business, not the utility services.

Cloud isn’t “Over Engineered”

There’s going to be an uptick of articles and videos about how this incident is indicative of how the Cloud is “over engineered” and this is why we should all pack our deployments and leave for VPSs off Vultr or some other random compute provider of the year.

The most neutral take I have on this is that most people who think the Cloud is over engineered are not really experienced enough to make that claim. The Cloud is not just compute, it’s not just a way to host your SPA. It’s a collection of services at various levels, enabled by scale and diversity of use cases.

I honestly think that people mistake complexity for over engineering. They are not the same. Complexity is a necessary quality for a system in some cases. Over engineering is about how the complexity of the system matches the complexity of the use case.

So no, this is not the signal the move away from the Cloud. The Cloud is not over engineered. You just have to improve how you engineer IN the Cloud.

Conclusion

It’s also important to note that the above points are not applicable to some of the scenarios. Some key global services did go down, which would not be technically covered with a multi-region approach. Also, it’s interesting that authentication for most AWS services as well as the main Amazon store itself went down. There is probably a technical debt and complexity related reason for the Amazon store outage, however I do think it’s time that AWS engineered a multi-region solution for services like IAM Identity Center.

Having said that, the above points do apply for almost 80%-90% of the deployments that will be done in the Cloud. These are mostly architectural and business concerns, rather than actual engineering related, so the drive for a resilient Cloud strategy should be a top down approach.