A really good article form Dan Luu of describing the knowledge he has gained from reading post mortems. One key takeaway – hardware errors and developers can cause issues, but to really flatten your infrastructure you need a config change.
I should also acknowledge my debt to Dan for a) giving me the idea for this site and b) providing a “seed” list of outage reports that I could used to populate this site.
Lessons Learned From Reading Postmortems
I love reading postmortems. They’re educational, but unlike most educational docs, they tell an entertaining story. I’ve spent a decent chunk of time reading postmortems at both Google and Microsoft.
I haven’t done any kind of formal analysis on the most common causes of bad failures (yet), but there are a handful of postmortem patterns that I keep seeing over and over again.