Amazon has a nice summary of the June event. An insurance adjuster would shrug and say ”Act of God, nothing we can do”, but Amazon provides a great deal of information, including admission of bugs and failures. This openness is intended to ensure us that they do know what they’re doing, they’re working to improve things, and nothing is being hidden.
A very powerful storm system moved through Virginia on the evening of June 29th, causing electrical power fluctuations to several of their US-East-1 data centers. One data center failed to transfer from utility to generator power. Despite extra staffing in anticipation of the storm, they were unable to manually transfer power to the generators before UPS storage depleted and servers lost power. This took out about 7% of the EC2 servers, EBS storage and RDS database service in the US-East-1 region, all of it within one Availability Zone.
Problems 1 and 2: Utility power quality tolerances required for switchover, and timing parameters for the automated switchover process.
AWS is power-hungry, the batteries were depleted in seven minutes (Amazon’s report includes precise event timing). Within twenty minutes, the generators were running and all gear had power. Now, to restart the underlying hypervisors and consistently synchronize storage.
Problem 3: A bottleneck in the server boot process slowed the restart to about two hours for most EC2 instances and three hours for most EBS volumes.
Meanwhile, customers were trying to start resources in other Availability Zones. While those resources were not directly affected, their control planes were overloaded.
Problems 4 and 5: A bug in the Elastic Load Balancer control plane caused a flood of requests for resources. Simultaneously, a bug in the failover control for Relational Database Service triggered a fail-safe that required manual intervention.
Now, all that being said, I am impressed with Amazon’s reaction. Those of you in the area that weekend know that the storm was intense. Once those generators were started, they ran flawlessly for the next 30 hours while utility power was down. Amazon’s engineers had analyzed the sequence of events and tracked down the software bugs and process failures within three days, and already had plans of what to do. Their system isn’t perfect, nothing is, but they certainly seem responsible.
What about your organization? How resilient are you in the face of natural disaster? Do cloud operators’ size advantages make them more or less attractive to you? We discuss these issues in Learning Tree’s Cloud Security Essentials course.
There is much more to this, natural catastrophes may not be our biggest worry. More on this next week!