I’ve been thinking a lot about the great AWS fail of April 2011. Thinking about the level of stress it incurred upon IT folk around the nation, their users and customers, and the Amazon staff themselves: I have a lot of sympathy for all of these people. When your services are disrupted, heads start to roll, money is lost, (sometimes) angry shouting commences. It’s not a good situation.
I’ve also been thinking about how the way companies communicate can really make or break a response. I’ll start with an example.
During the Christchurch NZ earthquake situation a month or so ago, mobile capacity was an issue for several days after the disaster. Instead of throwing their hands up, the NZ telecoms posted a plea with the citizens of NZ to please refrain from using their mobile phones for calls, to use texts instead to help lighten the load on the mobile infrastructure. The citizenry recognized that it was in their best interest to comply with this request of the telcos and fell back to texts instead of phone calls. It’s an example of the power of “just asking” for help in a time of crisis.
This brings me back to the AWS fail of 2011. The status updates from Amazon were confounding for techs, as evidenced by the myriad of forum posts proclaiming “my instance is stuck doing X and I can’t Y!” As techs, we are by nature wired to “DO SOMETHING” when our services go down. It is possibly the hardest thing we will ever have to do to sit on our hands and do nothing while our users suffer and money making/service opportunities for our companies wash down the drain.
What was missing in the response from Amazon were guidelines for the users (the IT folk in charge of these instances) on what would help the recovery effort, and what would hinder. Maybe it would have been better to tell people to just not try and reboot their instances, attach or detach volumes, stop/start/launch instances until Amazon gives the go ahead. As far as I could tell, no directives or speculations to that end were issued until very late in the recovery process. If you were reading the updates carefully, you could just sense the oozing desperation of the Amazon admins in charge: this would be going faster and easier if you lot would stop mucking about with your servers that are clearly broken, and that only we can fix.
Nobody likes to feel like they can’t do anything, but it’s also helpful, when dealing with an angry boss/customer/user, that “(service provider) has told us not to do X for the time being, and we are happy to help in the fast resolution to this issue by complying with their request. Of course we will continue to monitor the situation and will begin recovery the moment we are allowed to.”
This same principle translates to services that run local to you, lets say you have an outage on a local server, but your mass of users keeps incessantly hitting refresh, causing whatever problem you have to continue spiraling out of control. Issuing a “please desist, I will let you know when it’s ready” empowers people to say “you know what, it’s broken and there’s nothing I can do. I’ll just wait.” I, personally, like this approach very much, and I think it eases the stress for all involved: end users can say “it’s broken, I’ll do something else,” admins can have a bit of room to breathe (always time in short supply in a crisis), and the problem can be fixed in an orderly and rapid process. Timely updates are important, of course, but I think the most important thing is to say “we’re still working on it, it’s helping a lot that you’re not doing X. Here are the next steps for us (and how long it will take, if applicable). Thanks for your help, and we’ll let you know when you can start doing X again.”