Steam Outage: How to Monitor Data Center Connectivity–THOUSAND EYES

This is not an official Valve post mortem, rather an analysis by the company Thousand Eyes who make network troubleshooting tools.


Data centers are the factories of the Internet, churning out computation and services that customers demand. Apple is currently converting a 1.3M sq ft factory into a data center. And like factories, data centers rely on critical services to stay running. In the case of a data center, the most important inputs are electricity, cooling and network connectivity. Typically, redundant supplies of each input are available should there be a problem. Let’s examine an outage that occurred yesterday to see the importance of monitoring data center ingress, or network connectivity into the data center.

Continue reading…

Kafkapocalypse: a postmortem on our service outageKafkapocalypse: a postmortem on our service outage-parse.ly

On Thur, Mar 26 2015 and Fri, Mar 27 2015, Parse.ly experienced several outages of its data processing backend. The result was several hours of no new data appearing in our analytics dashboards, both for existing customers of our production analytics product at dash.parsely.com, and for select beta customers of our new product, at preview.parsely.com.

Continue reading…

Network Issues within London Datacenter-Linode

Incident Report

On January 31, 2015, at approximately 2:05 AM EST until 3:30 AM EST, a subset of London Linodes experienced packet loss on all of their network communication. Network and operations engineers were immediately contacted and troubleshooting began approximately 20 minutes after the start of the disruption, once Linode’s engineers were briefed on the symptoms and facts that were known at the time.

Continue reading…

Facebook Outage Deep Dive–POST MOrtem

In the late evening of January 26th, Facebook had its largest outage in more than four years. For approximately an hour both Facebook and Instagram (owned by Facebook) were completely down, along with numerous other affected sites such as Tinder and HipChat.

Facebook’s own post mortem and statements suggested the outage occurred “after we introduced a change that affected our configuration systems.”

Now, three days later a lot has been written about the outage, much of it only partially accurate. Let’s take the Facebook post mortem as a starting point and see how the outage unfolded. Follow along the blog post with the interactive data set using this share link of the event. You’ll want to take a look at the HTTP Server and Path Visualization views. Continue reading…

Incident Report: DNSimple DDoS Attack-codeship

Please accept our apologies for the recent issue affecting your service, starting at 19:09 UTC on December 1, 2014.

What happened?

The vendor that provides Codeship’s domain name service (DNS), DNSimple, experienced a major volumetric distributed denial of service (DDoS) attack which impacted their service availability. DNSimple has issued an incident report detailing their outage as result of the DDoS attack.

Continue reading…

FCA Final Notice 2014: Royal Bank of Scotland Plc, National Westminster Bank Plc and Ulster Bank Ltd–RBS

The full report is contained in a pdf on the linked page, but I have reproduced the IT-relevant sections below.


THE Root Cause of the it incident

The batch scheduler failure.

Banks generally update that day’s transactions in the evening. They use a software tool known as a batch scheduler to process those updates. A batch scheduler coordinates the order in which data underlying the updates is processed. The data includes information about customer withdrawals and deposits, interbank clearing, money market transactions, payroll processing, and requests to change standing orders and addresses. The processes underlying the updates are called “jobs”. Batch schedulers place the jobs into queues and ensure that each job is processed in the correct sequence. That day’s batch processing is complete when all balances are final.

Continue reading…

The one where we accidentally DDOS-ed Ourselves- Keen.IO

Link to Original Report

An old boss of mine had a phrase he used whenever shit hit the fan. You know, those special and sometimes amazing circumstances that inevitably happen when you’re trying to run a bunch of computers toward a common goal. He was fond of saying: “opportunities arise”.

This summer at Keen IO, we a had a series of postmortem-worthy “opportunities” that threatened everything we hold near and dear: the stability of our platform, the trust of our customers, and the sanity of our team. Please allow us to tell you about them in cathartic detail. Continue reading…