How to Write Great Outage Post-Mortems-etsy

An article from Daniel Doubrovkine at Etsy on how to write great post mortems.

The website is finally back up after crashing hard for 4 hours straight.

Recently AWS decided to reboot a few of your servers for a critical update. It didn’t seem like it was going to be a big deal, except that the schedule was only accommodating if you were in the Pacific Northwest. The first reboot took out a secondary replica of our MongoDB database. Unfortunately the driver handled that poorly and spent the first 400ms of every subsequent HTTP request trying to reconnect to the missing instance. That server came back up, but failed to find its storage volumes because of a human mistake in a past migration and the alerts were mistakenly silenced by someone monitoring the system. A few hours later the primary was being stepped down and rebooted, sending the driver into panic over another bug. The site went down.

Continue reading…

This is How Effective Leaders Move Beyond Blame-FirstRound

An interview with Dave Zwieback on how to effectively get to the root cause of incidents by removing blame from the analysis.

Contains some great quotes:

“Say there’s an incident and five minutes into the postmortem, we find out what happened and who’s responsible: Bobby and Susan screwed up. That feels good because there’s an unambiguous explanation: the so-called ‘root cause’. In this case, we’ve found our ‘bad apples,’ and can deal with them punitively so that such failures will never happen again. We may even feel better about our company culture and our colleagues if the individuals accept the blame and own up to what they did to ‘cause’ the incident,” says Zwieback.

Continue reading…

Thoughts Evoked By CircleCI’s July 2015 Outage

Interesting article and analysis of the recent post mortem from the CircleCI team. It makes some good points about using a RDBMS as a queueing system…

After having a bit of downtime, CircleCI’s team have been very kind to post a very detailed Post Mortem. I’m a post mortem junkie, so I always appreciate when companies are honest enough to openly discuss what went wrong.

I also greatly enjoy analyzing these things, especially through the complex systems lens. Each one of these posts is an opportunity to learn and to reinforce otherwise abstract concepts.

NOTE: This post is NOT about what the CircleCI team should or shouldn’t have done – hindsight is always 20/20, complex systems are difficult, and hidden interactions actually are hidden. Everyone’s infrastructures are full of traps like the one that ensnared them, and some days, you just land on the wrong square. Basically, that PM made me think of stuff, so here is that stuff. Nothing more.

Continue reading…

January 28th Incident Report-GITHUb

Last week GitHub was unavailable for two hours and six minutes. We understand how much you rely on GitHub and consider the availability of our service one of the core features we offer. Over the last eight years we have made considerable progress towards ensuring that you can depend on GitHub to be there for you and for developers worldwide, but a week ago we failed to maintain the level of uptime you rightfully expect. We are deeply sorry for this, and would like to share with you the events that took place and the steps we’re taking to ensure you’re able to access GitHub.

The Event

Continue reading…

How Complex Systems Fail–Unversity of chicago

A pdf from the Cognitive Technologies Laboratory in the University of Chicago.

(Being a Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety)

1) Complex systems are intrinsically hazardous systems.

All of the interesting systems (e.g. transportation, healthcare, power generation) are inherently and unavoidably hazardous by the own nature. The frequency of hazard exposure can sometimes be changed but the processes involved in the system are themselves intrinsically and irreducibly hazardous. It is the presence of these hazards that drives the creation of defenses against hazard that characterize these systems.

Continue reading…

Summary of the AWS Service Event in the US East Region

We’d like to share more about the service disruption which occurred last Friday night, June 29th, in one of our Availability Zones in the US East-1 Region. The event was triggered during a large scale electrical storm which swept through the Northern Virginia area. We regret the problems experienced by customers affected by the disruption and, in addition to giving more detail, also wanted to provide information on actions we’ll be taking to mitigate these issues in the future. Continue reading…

1 2 3 6