Category Archives: Outage

January 28th Incident Report-GITHUb

Last week GitHub was unavailable for two hours and six minutes. We understand how much you rely on GitHub and consider the availability of our service one of the core features we offer. Over the last eight years we have made considerable progress towards ensuring that you can depend on GitHub to be there for you and for developers worldwide, but a week ago we failed to maintain the level of uptime you rightfully expect. We are deeply sorry for this, and would like to share with you the events that took place and the steps we’re taking to ensure you’re able to access GitHub.

The Event

Continue reading…

Summary of the AWS Service Event in the US East Region

We’d like to share more about the service disruption which occurred last Friday night, June 29th, in one of our Availability Zones in the US East-1 Region. The event was triggered during a large scale electrical storm which swept through the Northern Virginia area. We regret the problems experienced by customers affected by the disruption and, in addition to giving more detail, also wanted to provide information on actions we’ll be taking to mitigate these issues in the future. Continue reading…

503 Service Unavailable – Spreedly

Incident Report for Spreedly

 A mistake was made last night configuring firewalls in our new rack. While these devices are not yet a part of the service, they are connected to the switches, so the changes prevented packets from reaching places they were intended to go. This was an oversight, and we apologize for causing a disruption.

Update on Christmas Issues – Steam

We’d like to follow up with more information regarding Steam’s troubled Christmas.

What happened

On December 25th, a configuration error resulted in some users seeing Steam Store pages generated for other users. Between 11:50 PST and 13:20 PST store page requests for about 34k users, which contained sensitive personal information, may have been returned and seen by other users.

Continue reading…

Outage postmortem (2015-12-17 UTC) – Stripe

Link to Original Report

Summary

On Thursday, December 17th UTC, failures in an internal event queueing system led to the API being partially degraded for 9 minutes and then completely degraded for 44 minutes. During this time, users were unable to use the Stripe API, Checkout, or the Dashboard.

We’re continuing to investigate the underlying issues that led to this API degradation, but we want to be transparent and share what we know so far and how we have responded.

Continue reading…

Outage due to Elasticsearch’s flexibility and our carelessness – Grofers

Link to Original Report

On November 25 at 4:30 AM, our consumer apps stopped working because of some issue with our backend API. This article is a postmortem of what happened that night.

Some Background

Our product search and navigation is served from Elasticsearch. We create daily index of products, related merchant data and locality data in one index but under different mappings. This index is built on a daily basis and then the latest index is switched with an existing one under a fixed alias. This works well.

Most of our requests are geolocation dependent, which is the reason why we are so heavily dependent on Elasticsearch.

The current state of our systems is that we have an API (called Consumer API) which serves product data curated using another system called Content Management System (CMS). CMS is the system which we use for curating our catalog and other content that changes frequently (on a day-to-day basis or multiple times a day) and is published to our consumer app. The responsibility of curating this data using CMS is of our Content Team.

The Incident

Continue reading…

400 errors when trying to create an external (L3) Load Balancer for GCE/GKE services-Google

SUMMARY:

On Monday 7 December 2015, Google Container Engine customers could not 
create external load balancers for their services for a duration of 21 
hours and 38 minutes. If your service or application was affected, we 
apologize — this is not the level of quality and reliability we strive to 
offer you, and we have taken and are taking immediate steps to improve the 
platform’s performance and availability.

Continue reading…

1 2 3 5