A Disaster and a Recovery

Psonar had our first operational disaster on Wednesday night. The EC2 instance running the web service that handles the SongShifter client's connections locked up hard around 16:53 UTC. Now, I've never been a sysadmin by trade, so this kind of stuff scares the hell out of me normally, but in the end Amazon Web Services made it seem so easy.

First up was diagnosing the problem, which turned out to be pretty simple - the box was dead, no ssh, no ping, no pulse - dead. In theory, the SongShifter client running at the time would realise there was a problem and start to exponentially increase the time between server requests. That gave us some breathing space - as did the fact that the website was still up and running on another instance. So, lets turn to the plan we'd made for disaster recovery for this box. Ah... Slight snag. Its on the wiki. The wiki that runs on ... this box. Oops.

Never mind, lets go back to basics. I need a new instance so fire one up from a recent image using ElasticFox. Two minutes later I've got an identical box up and I'm logged in. Next up, I need an up to date copy of the database. There's a mirror on the Windows box we serve the website from, but its all innodb so that means a full dump (causing some website downtime) and a reindex to transfer to the linux box I'm fixing. Lets call that plan B then.

Plan A, has to involve getting the data from the, now deceased, server. Fortunately, the database files are all stored on EBS volumes so I can pull up ElasticFox again, make a new snapshot of the volumes (even though they're attached to a wedged machine) and create some fresh volumes from that snapshot. Hook them up to the new instance and bingo - database back up and running. At this point I fire up the (slightly out of date) copy of the server from the restored image and reassign the elastic IP from the original server over to the new. Rich's SongShifter leaps into action immediately and starts upload more tunes - 25 minutes in and we've got a basic service restored from scratch!

Some tidying up continues into the night - install a gem missing from the backup image and update the server to the latest release. Tim steps in and patches up the small data gap in the recovered MySQL database using maatkit. And its all over, phew!

Lessons Learnt:

  • Keep your disaster recovery plans printed out at home or on email anywhere but a server you might have to recover
  • AmazonWebServices rocks. Keep the databases on EBS and keep your AMIs up to date
  • Build resilience into your protocols - the SongShifters coped perfectly with the absence of a guiding server - so much so that no-one outside the devteam even noticed the incident
  • Backup your ssh keys - subversion (svn+ssh) relies on automatic logins in a lot of scripts - takes time to catch these and redistribute fresh keys
  • Keep calm