Thoughts Evoked By CircleCI’s July 2015 Outage

Interesting article and analysis of the recent post mortem from the CircleCI team. It makes some good points about using a RDBMS as a queueing system…

After having a bit of downtime, CircleCI’s team have been very kind to post a very detailed Post Mortem. I’m a post mortem junkie, so I always appreciate when companies are honest enough to openly discuss what went wrong.

I also greatly enjoy analyzing these things, especially through the complex systems lens. Each one of these posts is an opportunity to learn and to reinforce otherwise abstract concepts.

NOTE: This post is NOT about what the CircleCI team should or shouldn’t have done – hindsight is always 20/20, complex systems are difficult, and hidden interactions actually are hidden. Everyone’s infrastructures are full of traps like the one that ensnared them, and some days, you just land on the wrong square. Basically, that PM made me think of stuff, so here is that stuff. Nothing more.

Database As A Queue

The post mortem states:

Our build queue is not a simple queue, but must take into account customer plans and container allocation in a complex platform. As such, it’s built on top of our main database.

As soon as I read that, I knew exactly what happened. I’d lived this exact problem before, so here’s that story:

At Flickr, we would put everything into MySQL until it didn’t work anymore. This included the Offline Tasks queue (aside: good grief, this post was written in 2008). One day, we had an issue that slowed down the processing of tasks. The queue filled up like it was supposed to, but when we finished fixing the original problem, we noticed that the queue was not draining. In fact, it was still filling up at almost the same rate as during the outage.

Read the Rest of the Post Here…