Lessons learnt from reading post mortems–Dan luu

A really good article form Dan Luu of describing the knowledge he has gained from reading post mortems. One key takeaway – hardware errors and developers can cause issues, but to really flatten your infrastructure you need a config change.

I should also acknowledge my debt to Dan for a) giving me the idea for this site and b) providing a “seed” list of outage reports that I could used to populate this site.

Lessons Learned From Reading Postmortems

I love reading postmortems. They’re educational, but unlike most educational docs, they tell an entertaining story. I’ve spent a decent chunk of time reading postmortems at both Google and Microsoft.

I haven’t done any kind of formal analysis on the most common causes of bad failures (yet), but there are a handful of postmortem patterns that I keep seeing over and over again.

