A really good article form Dan Luu of describing the knowledge he has gained from reading post mortems. One key takeaway – hardware errors and developers can cause issues, but to really flatten your infrastructure you need a config change.
I should also acknowledge my debt to Dan for a) giving me the idea for this site and b) providing a “seed” list of outage reports that I could used to populate this site.
Lessons Learned From Reading Postmortems
I love reading postmortems. They’re educational, but unlike most educational docs, they tell an entertaining story. I’ve spent a decent chunk of time reading postmortems at both Google and Microsoft.
High API latency caused by multiple issues in AWS including SQS API errors
AWS has declared their SQS issues resolved. DynamoDB is mostly fixed – AWS is still throttling requests, but not to a level that impacts our service. We are closing this incident.
An article from David Mytton, CEO of Server Density, on how they write their postmortems.
When sufficiently elaborate systems begin to scale it’s only a matter of time for some sort of failure to happen.
An intersting article and infographic from ThousandEyes of the top internet outages of 2015.
Now that we’ve rung in the new year, let’s look back at the state of Internet performance in 2015. We went through our archives to find the most impactful application and network outages and selected eight whose effects reverberated through many services, users and geographies.
We’d like to follow up with more information regarding Steam’s troubled Christmas.
On December 25th, a configuration error resulted in some users seeing Steam Store pages generated for other users. Between 11:50 PST and 13:20 PST store page requests for about 34k users, which contained sensitive personal information, may have been returned and seen by other users.
Link to Original Report
On Thursday, December 17th UTC, failures in an internal event queueing system led to the API being partially degraded for 9 minutes and then completely degraded for 44 minutes. During this time, users were unable to use the Stripe API, Checkout, or the Dashboard.
We’re continuing to investigate the underlying issues that led to this API degradation, but we want to be transparent and share what we know so far and how we have responded.
Link to Original Report
On November 25 at 4:30 AM, our consumer apps stopped working because of some issue with our backend API. This article is a postmortem of what happened that night.
Our product search and navigation is served from Elasticsearch. We create daily index of products, related merchant data and locality data in one index but under different mappings. This index is built on a daily basis and then the latest index is switched with an existing one under a fixed alias. This works well.
Most of our requests are geolocation dependent, which is the reason why we are so heavily dependent on Elasticsearch.
The current state of our systems is that we have an API (called Consumer API) which serves product data curated using another system called Content Management System (CMS). CMS is the system which we use for curating our catalog and other content that changes frequently (on a day-to-day basis or multiple times a day) and is published to our consumer app. The responsibility of curating this data using CMS is of our Content Team.
On Monday 7 December 2015, Google Container Engine customers could not
create external load balancers for their services for a duration of 21
hours and 38 minutes. If your service or application was affected, we
apologize — this is not the level of quality and reliability we strive to
offer you, and we have taken and are taking immediate steps to improve the
platform’s performance and availability.