Link to Original Report
On Thursday, December 17th UTC, failures in an internal event queueing system led to the API being partially degraded for 9 minutes and then completely degraded for 44 minutes. During this time, users were unable to use the Stripe API, Checkout, or the Dashboard.
We’re continuing to investigate the underlying issues that led to this API degradation, but we want to be transparent and share what we know so far and how we have responded.
CircleCI is a platform for continuous delivery. This means (among other things) we’re building serious distributed systems: hundreds of servers managing thousands of containers, coordinating between all the moving parts, and taking care of all the low-level details so that you have the simplest, fastest continuous integration and deployment possible.
Last Tuesday, we experienced a severe and lengthy downtime, during which our build queue was at a complete standstill. The entire company scrambled into firefighting mode to get the queue unlocked and customer builds moving again. Here’s what happened….