Category Archives: Post

How to Write Great Outage Post-Mortems-etsy

An article from Daniel Doubrovkine at Etsy on how to write great post mortems.

The website is finally back up after crashing hard for 4 hours straight.

Recently AWS decided to reboot a few of your servers for a critical update. It didn’t seem like it was going to be a big deal, except that the schedule was only accommodating if you were in the Pacific Northwest. The first reboot took out a secondary replica of our MongoDB database. Unfortunately the driver handled that poorly and spent the first 400ms of every subsequent HTTP request trying to reconnect to the missing instance. That server came back up, but failed to find its storage volumes because of a human mistake in a past migration and the alerts were mistakenly silenced by someone monitoring the system. A few hours later the primary was being stepped down and rebooted, sending the driver into panic over another bug. The site went down.

Continue reading…

This is How Effective Leaders Move Beyond Blame-FirstRound

An interview with Dave Zwieback on how to effectively get to the root cause of incidents by removing blame from the analysis.

Contains some great quotes:

“Say there’s an incident and five minutes into the postmortem, we find out what happened and who’s responsible: Bobby and Susan screwed up. That feels good because there’s an unambiguous explanation: the so-called ‘root cause’. In this case, we’ve found our ‘bad apples,’ and can deal with them punitively so that such failures will never happen again. We may even feel better about our company culture and our colleagues if the individuals accept the blame and own up to what they did to ‘cause’ the incident,” says Zwieback.

Continue reading…

Thoughts Evoked By CircleCI’s July 2015 Outage

Interesting article and analysis of the recent post mortem from the CircleCI team. It makes some good points about using a RDBMS as a queueing system…

After having a bit of downtime, CircleCI’s team have been very kind to post a very detailed Post Mortem. I’m a post mortem junkie, so I always appreciate when companies are honest enough to openly discuss what went wrong.

I also greatly enjoy analyzing these things, especially through the complex systems lens. Each one of these posts is an opportunity to learn and to reinforce otherwise abstract concepts.

NOTE: This post is NOT about what the CircleCI team should or shouldn’t have done – hindsight is always 20/20, complex systems are difficult, and hidden interactions actually are hidden. Everyone’s infrastructures are full of traps like the one that ensnared them, and some days, you just land on the wrong square. Basically, that PM made me think of stuff, so here is that stuff. Nothing more.

Continue reading…

How Complex Systems Fail–Unversity of chicago

A pdf from the Cognitive Technologies Laboratory in the University of Chicago.

(Being a Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety)

1) Complex systems are intrinsically hazardous systems.

All of the interesting systems (e.g. transportation, healthcare, power generation) are inherently and unavoidably hazardous by the own nature. The frequency of hazard exposure can sometimes be changed but the processes involved in the system are themselves intrinsically and irreducibly hazardous. It is the presence of these hazards that drives the creation of defenses against hazard that characterize these systems.

Continue reading…

Lessons learnt from reading post mortems–Dan luu

A really good article form Dan Luu of describing the knowledge he has gained from reading post mortems. One key takeaway – hardware errors and developers can cause issues, but to really flatten your infrastructure you need a config change.

I should also acknowledge my debt to Dan for a) giving me the idea for this site and b) providing a “seed” list of outage reports that I could used to populate this site.

Lessons Learned From Reading Postmortems

I love reading postmortems. They’re educational, but unlike most educational docs, they tell an entertaining story. I’ve spent a decent chunk of time reading postmortems at both Google and Microsoft.

Continue reading…

Top Internet Outages of 2015–Thousandeyes

An intersting article and infographic from ThousandEyes of the top internet outages of 2015.

Now that we’ve rung in the new year, let’s look back at the state of Internet performance in 2015. We went through our archives to find the most impactful application and network outages and selected eight whose effects reverberated through many services, users and geographies.

Continue reading…