Tagged in: Control Plane

Postmortem for outage of us-east-1 – Joyent

 

Link to Original Report

We would like to share the details on what occurred during the outage on 5/27/2014 in our us-east-1 datacenter, what we have learned, and what actions we are taking to prevent this from happening again. On behalf of all of Joyent, we are extremely sorry for this outage, and the severe inconvenience it may have caused to you, and your customers.

Background

In order to understand the event, first we need to explain a few basics about the architecture of our datacenters. All of Joyent’s datacenters run our SmartDataCenter product, which provides centralized management of all administrative services, and compute nodes (servers) used to host customer instances. The architecture of the system is built such that the control plane, which includes both the API and boot sequences, is highly-available within a single datacenter and survives any two failures. In addition to this control plane stack, every server in the datacenter has a daemon on it that responds to normal, machine generated requests for things like provisioning, upgrades, and changes related to maintenance.

Continue reading…

Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region – Amazon

Link to Original Report

Now that we have fully restored functionality to all affected services, we would like to share more details with our customers about the events that occurred with the Amazon Elastic Compute Cloud (“EC2”) last week, our efforts to restore the services, and what we are doing to prevent this sort of issue from happening again. We are very aware that many of our customers were significantly impacted by this event, and as with any significant service issue, our intention is to share the details of what happened and how we will improve the service for our customers.

Continue reading…