Tale of Fail: Juice + laptop = sad laptop

Here’s a funny story for the weekend. Six or so weeks ago one of my librarians called me in a panic “I’ve spilled juice on my netbook and now it doesn’t work! What will I do?!” You and I know that juice is possibly one of the worst things you can spill in an electronic gadget: all of that sticky sugar really gums up the works and shorts things out. I didn’t give her much hope.

She decided to go ahead and replace the netbook, but asked me if there was anything I could think of that would bring the sticky netbook back to life. Since at that point we didn’t have much to lose, I told her to take the battery out and stick it, open, face down, in the dishwasher (no soap, please) and run it through a normal cycle (no heated dry, please). Once it’s done, take it out and put it in a ziploc filled with rice for at least 2 weeks, and see if it will boot. It can’t hurt it any more, and who knows, it might help.

So here we are, 6 weeks later. We are in possession of a netbook that has *been through the dishwasher* and yes, it still boots! The battery is fried, and the touchpad doesn’t work, but you can use a USB mouse with it and everything else works.

So there you have it: given enough time, it will not kill a netbook to send it through the dishwasher, and in this case, we may well have saved it to compute another day.

[update] We replaced the battery, and sure enough, with a replacement battery, the netbook is restored to cable free functionality (the trackpad still doesn’t work).

Helping Users Help Themselves

I’ve been thinking a lot about the great AWS fail of April 2011. Thinking about the level of stress it incurred upon IT folk around the nation, their users and customers, and the Amazon staff themselves: I have a lot of sympathy for all of these people. When your services are disrupted, heads start to roll, money is lost, (sometimes) angry shouting commences. It’s not a good situation.

I’ve also been thinking about how the way companies communicate can really make or break a response. I’ll start with an example.

During the Christchurch NZ earthquake situation a month or so ago, mobile capacity was an issue for several days after the disaster. Instead of throwing their hands up, the NZ telecoms posted a plea with the citizens of NZ to please refrain from using their mobile phones for calls, to use texts instead to help lighten the load on the mobile infrastructure. The citizenry recognized that it was in their best interest to comply with this request of the telcos and fell back to texts instead of phone calls. It’s an example of the power of “just asking” for help in a time of crisis.

This brings me back to the AWS fail of 2011. The status updates from Amazon were confounding for techs, as evidenced by the myriad of forum posts proclaiming “my instance is stuck doing X and I can’t Y!” As techs, we are by nature wired to “DO SOMETHING” when our services go down. It is possibly the hardest thing we will ever have to do to sit on our hands and do nothing while our users suffer and money making/service opportunities for our companies wash down the drain.

What was missing in the response from Amazon were guidelines for the users (the IT folk in charge of these instances) on what would help the recovery effort, and what would hinder. Maybe it would have been better to tell people to just not try and reboot their instances, attach or detach volumes, stop/start/launch instances until Amazon gives the go ahead. As far as I could tell, no directives or speculations to that end were issued until very late in the recovery process. If you were reading the updates carefully, you could just sense the oozing desperation of the Amazon admins in charge: this would be going faster and easier if you lot would stop mucking about with your servers that are clearly broken, and that only we can fix.

Nobody likes to feel like they can’t do anything, but it’s also helpful, when dealing with an angry boss/customer/user, that “(service provider) has told us not to do X for the time being, and we are happy to help in the fast resolution to this issue by complying with their request. Of course we will continue to monitor the situation and will begin recovery the moment we are allowed to.”

This same principle translates to services that run local to you, lets say you have an outage on a local server, but your mass of users keeps incessantly hitting refresh, causing whatever problem you have to continue spiraling out of control. Issuing a “please desist, I will let you know when it’s ready” empowers people to say “you know what, it’s broken and there’s nothing I can do. I’ll just wait.” I, personally, like this approach very much, and I think it eases the stress for all involved: end users can say “it’s broken, I’ll do something else,” admins can have a bit of room to breathe (always time in short supply in a crisis), and the problem can be fixed in an orderly and rapid process. Timely updates are important, of course, but I think the most important thing is to say “we’re still working on it, it’s helping a lot that you’re not doing X. Here are the next steps for us (and how long it will take, if applicable). Thanks for your help, and we’ll let you know when you can start doing X again.”

Tale of Fail: The Great AWS Fail of April 2011

Yesterday morning I was greeted by the message that makes all sysadmins quake: “Three libraries called, their websites are down. Oh, and our website is down too.”  Just typing it out puts my stomach in a knot. Further investigation of the problem pointed me straight back to Amazon, where the virtual server that hosts our statewide project for library web sites had been happily humming for over a year in an EBS (Elastic Block Storage) backed EC2 (Elastic Cloud Computing) instance. What’s this? Amazon has an outage? That’s unfortunate, but not a lot we can do about it but wait.

Except… the virtual server in question hosts the website for the Koha community, which was on the verge of releasing version 3.4 the very next day. The website being down was definitely not in the release day plans. We can’t get the site data (it’s on an inaccessible EBS volume on an Amazon availability zone that can’t create new volumes from snapshots. Lovely!) to launch on another server… what are we going to do?

Let’s back up about a year, to when the Koha community site first moved to http://www.koha-community.org. You may be aware that the koha.org domain is owned by PTFS. The community had no access to the site at koha.org, and it was getting more and more dangerously out of date. The decision was made to pull all of the Koha community properties out of the koha.org domain and put them in a new domain, koha-community.org. NEKLS offered to host the website (EC2), Equinox offered to host the wiki (ESI owned servers at QTS, using Xen), ByWater Solutions took the git repo (Rackspace), and Chris Cormack personally took the bug tracker (Linode). If you are curious, a complete list of who owns what Koha property can be found at http://wiki.koha-community.org/wiki/Website_Administration. It’s distributed geographically, by service provider, and in terms of ownership.

This turned out to be the very best possible solution to our problem of Amazon going down, at least in terms of the koha-community website: we’ll point the DNS for koha-community to http://download.koha-community.org which lives on a completely different server, put a tiny note there about what the deal was, and get on with our business of getting the release out, knowing that even if the website isn’t back, people will still have a way to get our software. We had a (very) brief discussion in IRC regarding the temporary change, and then we just did it. What a powerful and flexible community!

How did the rest of the websites that were on that server fare? Not as well, really. They had a downtime of over twelve hours… unacceptable by most standards. That said, using EC2 did allow me to, once the US-EAST-1 zone was stabilized, fire off a new instance from backup and restore the sites in relatively short order. The backups worked, the process itself wasn’t too painful, and for those things I was extremely thankful. I’m also thankful to have understanding and patient librarian owners of these sites.

The plan now is to work on having the snapshots transferred out of that single zone for purposes of quick recovery, and creating a failover plan for koha-community.org. I hope we never need it, but just in case we do, we’ll be ready!

So, there’s my tale of the Great Amazon Fail of April 2011. Did you have services in US-EAST-1 that were affected? I’d love to hear your stories if you did, and what you did to recover.