Tale of Fail: The Great AWS Fail of April 2011

Yesterday morning I was greeted by the message that makes all sysadmins quake: “Three libraries called, their websites are down. Oh, and our website is down too.” Just typing it out puts my stomach in a knot. Further investigation of the problem pointed me straight back to Amazon, where the virtual server that hosts our statewide project for library web sites had been happily humming for over a year in an EBS (Elastic Block Storage) backed EC2 (Elastic Cloud Computing) instance. What’s this? Amazon has an outage? That’s unfortunate, but not a lot we can do about it but wait.

Except… the virtual server in question hosts the website for the Koha community, which was on the verge of releasing version 3.4 the very next day. The website being down was definitely not in the release day plans. We can’t get the site data (it’s on an inaccessible EBS volume on an Amazon availability zone that can’t create new volumes from snapshots. Lovely!) to launch on another server… what are we going to do?

Let’s back up about a year, to when the Koha community site first moved to http://www.koha-community.org. You may be aware that the koha.org domain is owned by PTFS. The community had no access to the site at koha.org, and it was getting more and more dangerously out of date. The decision was made to pull all of the Koha community properties out of the koha.org domain and put them in a new domain, koha-community.org. NEKLS offered to host the website (EC2), Equinox offered to host the wiki (ESI owned servers at QTS, using Xen), ByWater Solutions took the git repo (Rackspace), and Chris Cormack personally took the bug tracker (Linode). If you are curious, a complete list of who owns what Koha property can be found at http://wiki.koha-community.org/wiki/Website_Administration. It’s distributed geographically, by service provider, and in terms of ownership.

This turned out to be the very best possible solution to our problem of Amazon going down, at least in terms of the koha-community website: we’ll point the DNS for koha-community to http://download.koha-community.org which lives on a completely different server, put a tiny note there about what the deal was, and get on with our business of getting the release out, knowing that even if the website isn’t back, people will still have a way to get our software. We had a (very) brief discussion in IRC regarding the temporary change, and then we just did it. What a powerful and flexible community!

How did the rest of the websites that were on that server fare? Not as well, really. They had a downtime of over twelve hours… unacceptable by most standards. That said, using EC2 did allow me to, once the US-EAST-1 zone was stabilized, fire off a new instance from backup and restore the sites in relatively short order. The backups worked, the process itself wasn’t too painful, and for those things I was extremely thankful. I’m also thankful to have understanding and patient librarian owners of these sites.

The plan now is to work on having the snapshots transferred out of that single zone for purposes of quick recovery, and creating a failover plan for koha-community.org. I hope we never need it, but just in case we do, we’ll be ready!

So, there’s my tale of the Great Amazon Fail of April 2011. Did you have services in US-EAST-1 that were affected? I’d love to hear your stories if you did, and what you did to recover.

News and tips from the dark basements

Tale of Fail: The Great AWS Fail of April 2011

2 Responses to Tale of Fail: The Great AWS Fail of April 2011

Interesting Things

Blogroll

Categories

Archives