Tagged in: Daniel Doubrovkine

How to Write Great Outage Post-Mortems-etsy

An article from Daniel Doubrovkine at Etsy on how to write great post mortems.

The website is finally back up after crashing hard for 4 hours straight.

Recently AWS decided to reboot a few of your servers for a critical update. It didn’t seem like it was going to be a big deal, except that the schedule was only accommodating if you were in the Pacific Northwest. The first reboot took out a secondary replica of our MongoDB database. Unfortunately the driver handled that poorly and spent the first 400ms of every subsequent HTTP request trying to reconnect to the missing instance. That server came back up, but failed to find its storage volumes because of a human mistake in a past migration and the alerts were mistakenly silenced by someone monitoring the system. A few hours later the primary was being stepped down and rebooted, sending the driver into panic over another bug. The site went down.

Continue reading…