Link to Original Report
On November 25 at 4:30 AM, our consumer apps stopped working because of some issue with our backend API. This article is a postmortem of what happened that night.
Our product search and navigation is served from Elasticsearch. We create daily index of products, related merchant data and locality data in one index but under different mappings. This index is built on a daily basis and then the latest index is switched with an existing one under a fixed alias. This works well.
Most of our requests are geolocation dependent, which is the reason why we are so heavily dependent on Elasticsearch.
The current state of our systems is that we have an API (called Consumer API) which serves product data curated using another system called Content Management System (CMS). CMS is the system which we use for curating our catalog and other content that changes frequently (on a day-to-day basis or multiple times a day) and is published to our consumer app. The responsibility of curating this data using CMS is of our Content Team.
Since Wednesday, we have been working to help a subset of customers take final steps to fully recover from Tuesday’s storage service interruption. The incident has now been resolved and we are seeing normal activity in the system.
Link to Original Report
We would like to share the details on what occurred during the outage on 5/27/2014 in our us-east-1 datacenter, what we have learned, and what actions we are taking to prevent this from happening again. On behalf of all of Joyent, we are extremely sorry for this outage, and the severe inconvenience it may have caused to you, and your customers.
In order to understand the event, first we need to explain a few basics about the architecture of our datacenters. All of Joyent’s datacenters run our SmartDataCenter product, which provides centralized management of all administrative services, and compute nodes (servers) used to host customer instances. The architecture of the system is built such that the control plane, which includes both the API and boot sequences, is highly-available within a single datacenter and survives any two failures. In addition to this control plane stack, every server in the datacenter has a daemon on it that responds to normal, machine generated requests for things like provisioning, upgrades, and changes related to maintenance.
This is a post mortem by the SEC of the system errors at Knight Capital that ultimately caused a $440m loss. The actual report is in PDF format but I have reproduced the technical sections below.
August 1, 2012 and Related Events
Preparation for NYSE Retail Liquidity Program
To enable its customers’ participation in the Retail Liquidity Program (“RLP”) at the New York Stock Exchange,5 which was scheduled to commence on August 1, 2012, Knight made a number of changes to its systems and software code related to its order handling processes. These changes included developing and deploying new software code in SMARS. SMARS is an automated, high speed, algorithmic router that sends orders into the market for execution. A core function of SMARS is to receive orders passed from other components of Knight’s trading platform (“parent” orders) and then, as needed based on the available liquidity, send one or more representative (or “child”) orders to external venues for execution.
This morning at 09:47 UTC CloudFlare effectively dropped off the Internet. The outage affected all of CloudFlare’s services including DNS and any services that rely on our web proxy. During the outage, anyone accessing CloudFlare.com or any site on CloudFlare’s network would have received a DNS error. Pings and Traceroutes to CloudFlare’s network resulted in a “No Route to Host” error.
Link to Original Report
Now that we have fully restored functionality to all affected services, we would like to share more details with our customers about the events that occurred with the Amazon Elastic Compute Cloud (“EC2”) last week, our efforts to restore the services, and what we are doing to prevent this sort of issue from happening again. We are very aware that many of our customers were significantly impacted by this event, and as with any significant service issue, our intention is to share the details of what happened and how we will improve the service for our customers.
If you did a Google search between 6:30 a.m. PST and 7:25 a.m. PST this morning, you likely saw that the message “This site may harm your computer” accompanied each and every search result. This was clearly an error, and we are very sorry for the inconvenience caused to our users.