Amazon Cloud Outage: The Post Mortem Continues

It has been a little over a week since the Amazon cloud outage. Pundits continue to weigh in and will likely do so for some time to come. An internet search for “Amazon cloud outage” returns over 450,000 hits, several thousand of which were in the last 24 hours. It seems there is plenty of blame to go around.

I had the good fortune to be teaching Learning Tree’s introductory Cloud Computing course last week in Los Angeles. Naturally this topic came up when we were discussing barriers to cloud adoption. One of the students offered an analogy which I thought was quite appropriate. While perhaps not perfect we can consider public clouds vs. on premise data centers as being roughly comparable to flying vs. driving. Statistically you are much safer flying than driving. Each time a plane crashes, however, it makes headline news because of the magnitude of the disaster. In contrast we hear very little about the countless traffic fatalities which occur on a daily basis.

Amazon has released their official response in “message 65648“. It seems that the root cause of the outage was failure of some Elastic Block Storage (EBS) volumes within a single Availability Zone in their US East Region. Last week Amazon notified, by email, all affected customers (including yours truly) and indicated that there would be a 10 day credit equal to 100% of the usage of EBS Volumes, EC2 and RDS instances within the affected Availability Zone. In my case that is acceptable. Businesses that depended on Amazon’s cloud for their revenue may be less easy to mollify.

What has become clear is that by moving to the cloud an organization is not absolved of the ultimate responsibility for ensuring that their systems perform. Some organizations, such as Netflix, were able to survive the outage (albeit not without some pain) by careful up front planning and architecture. For others the Amazon outage was disastrous. The key, it seems, was a healthy dose of paranoia about the cloud and proper disaster recovery and contingency planning right from the beginning. These are good lessons to learn and for many the lessons were learned the hard way.

This incident is certainly an embarrassment for Amazon. It will, perhaps, cause some to proceed more cautiously in adopting cloud technologies. It is doubtful, however, that there will be reversal in the trend for cloud adoption. The benefits of cloud computing continue to outweigh the risks for many organizations, especially if those risks are well-managed.

In fact, Forrester research estimates that the market for cloud services will grow from $41 billion in 2010 to over $240 billion by 2020. The Amazon incident is a setback, to be sure, but it is only a speed bump on the road to cloud computing. The experience will be used by Amazon and consumers of cloud services to build better systems improve their planning and take steps to ensure that something like this does not happen again.

So, I will continue to look at cloud computing solutions for use at Learning Tree and for my consulting clients. The genie is out of the bottle and once that has happened there is no going back. We are reminded, however, that disasters can and do occur. The cloud does not change that and the onus is still on system developers, administrators and owners to ensure fail-safe conditions subject to the organization’s Recovery Point and Recovery Time Objectives.


Type to search

Do you mean "" ?

Sorry, no results were found for your query.

Please check your spelling and try your search again.