Tagged in: Human Error

Outage due to Elasticsearch’s flexibility and our carelessness – Grofers

Link to Original Report

On November 25 at 4:30 AM, our consumer apps stopped working because of some issue with our backend API. This article is a postmortem of what happened that night.

Some Background

Our product search and navigation is served from Elasticsearch. We create daily index of products, related merchant data and locality data in one index but under different mappings. This index is built on a daily basis and then the latest index is switched with an existing one under a fixed alias. This works well.

Most of our requests are geolocation dependent, which is the reason why we are so heavily dependent on Elasticsearch.

The current state of our systems is that we have an API (called Consumer API) which serves product data curated using another system called Content Management System (CMS). CMS is the system which we use for curating our catalog and other content that changes frequently (on a day-to-day basis or multiple times a day) and is published to our consumer app. The responsibility of curating this data using CMS is of our Content Team.

The Incident

Continue reading…

Postmortem for outage of us-east-1 – Joyent

 

Link to Original Report

We would like to share the details on what occurred during the outage on 5/27/2014 in our us-east-1 datacenter, what we have learned, and what actions we are taking to prevent this from happening again. On behalf of all of Joyent, we are extremely sorry for this outage, and the severe inconvenience it may have caused to you, and your customers.

Background

In order to understand the event, first we need to explain a few basics about the architecture of our datacenters. All of Joyent’s datacenters run our SmartDataCenter product, which provides centralized management of all administrative services, and compute nodes (servers) used to host customer instances. The architecture of the system is built such that the control plane, which includes both the API and boot sequences, is highly-available within a single datacenter and survives any two failures. In addition to this control plane stack, every server in the datacenter has a daemon on it that responds to normal, machine generated requests for things like provisioning, upgrades, and changes related to maintenance.

Continue reading…

In the Matter of Knight Capital Americas LLC Respondent–Post MorTem

This is a post mortem by the SEC of the system errors at Knight Capital that ultimately caused a $440m loss. The actual report is in PDF format but I have reproduced the technical sections below.


August 1, 2012 and Related Events

Preparation for NYSE Retail Liquidity Program

To enable its customers’ participation in the Retail Liquidity Program (“RLP”) at the New York Stock Exchange,5 which was scheduled to commence on August 1, 2012, Knight made a number of changes to its systems and software code related to its order handling processes. These changes included developing and deploying new software code in SMARS. SMARS is an automated, high speed, algorithmic router that sends orders into the market for execution. A core function of SMARS is to receive orders passed from other components of Knight’s trading platform (“parent” orders) and then, as needed based on the available liquidity, send one or more representative (or “child”) orders to external venues for execution.

Continue reading…

Today’s Outage Post Mortem–Cloudflare

Today's Outage Post  
Mortem

This morning at 09:47 UTC CloudFlare effectively dropped off the Internet. The outage affected all of CloudFlare’s services including DNS and any services that rely on our web proxy. During the outage, anyone accessing CloudFlare.com or any site on CloudFlare’s network would have received a DNS error. Pings and Traceroutes to CloudFlare’s network resulted in a “No Route to Host” error.

Continue reading…

Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region – Amazon

Link to Original Report

Now that we have fully restored functionality to all affected services, we would like to share more details with our customers about the events that occurred with the Amazon Elastic Compute Cloud (“EC2”) last week, our efforts to restore the services, and what we are doing to prevent this sort of issue from happening again. We are very aware that many of our customers were significantly impacted by this event, and as with any significant service issue, our intention is to share the details of what happened and how we will improve the service for our customers.

Continue reading…