Outage postmortem (2015-12-17 UTC) – Stripe

Link to Original Report

Summary

On Thursday, December 17th UTC, failures in an internal event queueing system led to the API being partially degraded for 9 minutes and then completely degraded for 44 minutes. During this time, users were unable to use the Stripe API, Checkout, or the Dashboard.

We’re continuing to investigate the underlying issues that led to this API degradation, but we want to be transparent and share what we know so far and how we have responded.

Timeline

Beginning at 02:25 UTC, we experienced an increase in API traffic which, due to an unusual configuration, generated a very high amount of write amplification in our internal event queueing system.

Between 02:25 and 02:55, this gradually slowed the event queueing system that was receiving events from the API servers. Even though performance of the event queueing system had degraded, our API had not experienced any degradation. Eventually, our API servers hit a write timeout and associated retry loop. A bug in the retry logic led to a feedback loop where each retry increased the load, reducing performance, and causing more retries.

By 02:59, the Stripe API started experiencing a small rate of errors, which paged our engineers. At 03:01 we identified the event queueing system as the source of the degradation and initiated a response plan.

By 03:08, the retry feedback loop and associated performance degradation prevented us from accepting new API requests at all. We responded by spinning up more capacity for our event queueing system, as well as firewalling the queueing system from producers and allowing it to drain its accumulated backlog.

By 03:52, our response plan succeeded in restoring the event queueing system and we were able to resume service to the API.

Remediations

We’ve identified and fixed the bug in the retry logic that led to cascading performance degradations, we’ve increased the capacity of our event queueing system to handle high event volume, and we’ve fixed the particular configuration that led to high write amplification.

We’re pursuing further analysis and remediation.