Tagged in: Testing

Outage due to Elasticsearch’s flexibility and our carelessness – Grofers

Link to Original Report

On November 25 at 4:30 AM, our consumer apps stopped working because of some issue with our backend API. This article is a postmortem of what happened that night.

Some Background

Our product search and navigation is served from Elasticsearch. We create daily index of products, related merchant data and locality data in one index but under different mappings. This index is built on a daily basis and then the latest index is switched with an existing one under a fixed alias. This works well.

Most of our requests are geolocation dependent, which is the reason why we are so heavily dependent on Elasticsearch.

The current state of our systems is that we have an API (called Consumer API) which serves product data curated using another system called Content Management System (CMS). CMS is the system which we use for curating our catalog and other content that changes frequently (on a day-to-day basis or multiple times a day) and is published to our consumer app. The responsibility of curating this data using CMS is of our Content Team.

The Incident

Continue reading…

400 errors when trying to create an external (L3) Load Balancer for GCE/GKE services-Google

SUMMARY:

On Monday 7 December 2015, Google Container Engine customers could not 
create external load balancers for their services for a duration of 21 
hours and 38 minutes. If your service or application was affected, we 
apologize — this is not the level of quality and reliability we strive to 
offer you, and we have taken and are taking immediate steps to improve the 
platform’s performance and availability.

Continue reading…