Last week I discussed the June 2012 derecho and Amazon’s response. As I said then, there may be something bigger to worry about.
Yes, some resources in one of the four Availability Zones in the US-East-1 region were down. But the storm was enormously powerful. 911 service was down. Seriously, how many of you in the region were completely unaffected?
We now rely on Amazon Web Services. Yes, all of us, unless you somehow manage to avoid doing business with companies using it. How’s life in that isolated cabin?
Meanwhile, it is important to realize that AWS is an enormous and unprecedented experiment.
That’s right, there has never been such a vast array of data centers operated largely automatically through unpredictable self-deployment by customers. Amazon is doing an amazing job of designing and operating this technology, but like any technology, it isn’t perfect.
As I mentioned last week, June’s weather-induced event involved some specific numerical values including tolerances on utility power characteristics and timing parameters for power switchover.
Those numerical values point to how Amazon’s summary of their April, 2011 outage is much more interesting. It was caused by EBS storage volumes getting into a “stuck” state where they were unable to service read and write requests, causing EC2 instances to also get “stuck” while blocked waiting on I/O.
It was caused by an inappropriate shift of traffic onto a lower capacity network within one Availability Zone in on region. The lowered network capacity caused EBS storage clusters to more aggressively search for available storage resources for re-mirroring. That, in turn, further saturated the network, and problems snowballed. The request queues on the control planes rapidly grew, exhausting those resources and thereby degrading service on other Availability Zones in the same region. This also spun out of control. Manual intervention was required, shutting down the now unstable automated processes.
Amazon quickly identified the underlying causes. Their report is interesting reading. That is, if you are an engineer and appreciate the importance of control theory, careful tuning of numerical parameters in feedback loops. See the Tacoma Narrows Bridge for an example of a physical system with unfortunate control values. Amazon identified and implemented needed changes in capacity buffer, retry logic, and automated backoff and reconnect timing parameters.
Meanwhile we are assuming that the changes were a step in the right direction. Learning Tree’s Cloud Security Essentials course discusses our reliance on cloud provider design decisions.