Tagged in: Analysis

Thoughts Evoked By CircleCI’s July 2015 Outage

Interesting article and analysis of the recent post mortem from the CircleCI team. It makes some good points about using a RDBMS as a queueing system…

After having a bit of downtime, CircleCI’s team have been very kind to post a very detailed Post Mortem. I’m a post mortem junkie, so I always appreciate when companies are honest enough to openly discuss what went wrong.

I also greatly enjoy analyzing these things, especially through the complex systems lens. Each one of these posts is an opportunity to learn and to reinforce otherwise abstract concepts.

NOTE: This post is NOT about what the CircleCI team should or shouldn’t have done – hindsight is always 20/20, complex systems are difficult, and hidden interactions actually are hidden. Everyone’s infrastructures are full of traps like the one that ensnared them, and some days, you just land on the wrong square. Basically, that PM made me think of stuff, so here is that stuff. Nothing more.

Continue reading…

Lessons learnt from reading post mortems–Dan luu

A really good article form Dan Luu of describing the knowledge he has gained from reading post mortems. One key takeaway – hardware errors and developers can cause issues, but to really flatten your infrastructure you need a config change.

I should also acknowledge my debt to Dan for a) giving me the idea for this site and b) providing a “seed” list of outage reports that I could used to populate this site.

Lessons Learned From Reading Postmortems

I love reading postmortems. They’re educational, but unlike most educational docs, they tell an entertaining story. I’ve spent a decent chunk of time reading postmortems at both Google and Microsoft.

Continue reading…