Early Sunday morning, September 20, we had a DynamoDB service event in the US-East Region that impacted DynamoDB customers in US-East, as well as some other services in the region. The following are some additional details on the root cause, subsequent impact to other AWS services that depend on DynamoDB, and corrective actions we’re taking.
Tagged in: Monitoring
Earlier today, most Google users who use logged-in services like Gmail, Google+, Calendar and Documents found they were unable to access those services for approximately 25 minutes. For about 10 percent of users, the problem persisted for as much as 30 minutes longer. Whether the effect was brief or lasted the better part of an hour, please accept our apologies—we strive to make all of Google’s services available and fast for you, all the time, and we missed the mark today.
(Note: this is being posted with Foursquare’s permission.)
As many of you are aware, Foursquare had a significant outage this
week. The outage was caused by capacity problems on one of the
machines hosting the MongoDB database used for check-ins. This is an account of what happened, why it happened, how it can be prevented, and how 10gen is working to improve MongoDB in light of this outage.
Another pdf file to the final analysis of the 2003 power blackour in the NE United States. The relevant root cause information is included below.
How and Why the Blackout Began in Ohio
This chapter explains the major events—electrical, computer, and human—that occurred as the blackout evolved on August 14, 2003, and identifies the causes of the initiation of the blackout. The period covered in this chapter begins at 12:15 Eastern Daylight Time (EDT) on August 14, 2003 when inaccurate input data rendered MISO’s state estimator (a system monitoring tool) ineffective. At 13:31 EDT, FE’s Eastlake 5 generation unit tripped and shut down automatically. Shortly after 14:14 EDT, the alarm and logging system in FE’s control room failed and was not restored until after the blackout. After 15:05 EDT, some of FE’s 345-kV transmission lines began tripping out because the lines were contacting overgrown trees within the lines’ right-of-way areas.
By around 15:46 EDT when FE, MISO and neighboring utilities had begun to realize that the FE system was in jeopardy, the only way that the blackout might have been averted would have been to drop at least 1,500 MW of load around Cleveland and Akron. No such effort was made, however, and by 15:46 EDT it may already have been too late for a large load-shed to make any difference. After 15:46 EDT, the loss of some of FE’s key 345-kV lines in northern Ohio caused its underlying network of 138-kV lines to begin to fail, leading in turn to the loss of FE’s Sammis-Star 345-kV line at 16:06 EDT. The chapter concludes with the loss of FE’s Sammis-Star line, the event that triggered the uncontrollable 345 kV cascade portion of the blackout sequence.
The loss of the Sammis-Star line triggered the cascade because it shut down the 345-kV path into northern Ohio from eastern Ohio. Although the area around Akron, Ohio was already blacked out due to earlier events, most of northern Ohio remained interconnected and electricity demand was high. This meant that the loss of the heavily overloaded Sammis-Star line instantly created major and unsustainable burdens on lines in adjacent areas, and the cascade spread rapidly as lines and generating units automatically tripped by protective relay action to avoid physical damage.