The actual post mortem of the NATS outage is a pdf file, but I have included the software related section below. The incident closed runways at Heathrow and Gatwick airports and caused over 70 flights to be canellted and 50+ to be delayed.
This is not an official Valve post mortem, rather an analysis by the company Thousand Eyes who make network troubleshooting tools.
Data centers are the factories of the Internet, churning out computation and services that customers demand. Apple is currently converting a 1.3M sq ft factory into a data center. And like factories, data centers rely on critical services to stay running. In the case of a data center, the most important inputs are electricity, cooling and network connectivity. Typically, redundant supplies of each input are available should there be a problem. Let’s examine an outage that occurred yesterday to see the importance of monitoring data center ingress, or network connectivity into the data center.
Kafkapocalypse: a postmortem on our service outageKafkapocalypse: a postmortem on our service outage-parse.ly
On Thur, Mar 26 2015 and Fri, Mar 27 2015, Parse.ly experienced several outages of its data processing backend. The result was several hours of no new data appearing in our analytics dashboards, both for existing customers of our production analytics product at dash.parsely.com, and for select beta customers of our new product, at preview.parsely.com.
On January 31, 2015, at approximately 2:05 AM EST until 3:30 AM EST, a subset of London Linodes experienced packet loss on all of their network communication. Network and operations engineers were immediately contacted and troubleshooting began approximately 20 minutes after the start of the disruption, once Linode’s engineers were briefed on the symptoms and facts that were known at the time.
In the late evening of January 26th, Facebook had its largest outage in more than four years. For approximately an hour both Facebook and Instagram (owned by Facebook) were completely down, along with numerous other affected sites such as Tinder and HipChat.
Now, three days later a lot has been written about the outage, much of it only partially accurate. Let’s take the Facebook post mortem as a starting point and see how the outage unfolded. Follow along the blog post with the interactive data set using this share link of the event. You’ll want to take a look at the HTTP Server and Path Visualization views. Continue reading…
Please accept our apologies for the recent issue affecting your service, starting at 19:09 UTC on December 1, 2014.
The vendor that provides Codeship’s domain name service (DNS), DNSimple, experienced a major volumetric distributed denial of service (DDoS) attack which impacted their service availability. DNSimple has issued an incident report detailing their outage as result of the DDoS attack.
FCA Final Notice 2014: Royal Bank of Scotland Plc, National Westminster Bank Plc and Ulster Bank Ltd–RBS
The full report is contained in a pdf on the linked page, but I have reproduced the IT-relevant sections below.
THE Root Cause of the it incident
The batch scheduler failure.
Banks generally update that day’s transactions in the evening. They use a software tool known as a batch scheduler to process those updates. A batch scheduler coordinates the order in which data underlying the updates is processed. The data includes information about customer withdrawals and deposits, interbank clearing, money market transactions, payroll processing, and requests to change standing orders and addresses. The processes underlying the updates are called “jobs”. Batch schedulers place the jobs into queues and ensure that each job is processed in the correct sequence. That day’s batch processing is complete when all balances are final.
Since Wednesday, we have been working to help a subset of customers take final steps to fully recover from Tuesday’s storage service interruption. The incident has now been resolved and we are seeing normal activity in the system.
An old boss of mine had a phrase he used whenever shit hit the fan. You know, those special and sometimes amazing circumstances that inevitably happen when you’re trying to run a bunch of computers toward a common goal. He was fond of saying: “opportunities arise”.
This summer at Keen IO, we a had a series of postmortem-worthy “opportunities” that threatened everything we hold near and dear: the stability of our platform, the trust of our customers, and the sanity of our team. Please allow us to tell you about them in cathartic detail. Continue reading…
On August 25, 2014 there was an outage of all Stack Exchange sites (Q&A sites as well as Careers) from 7:26 pm to 7:32 pm UTC (approximately 6 minutes). The cause was an incorrect change to network firewall configuration – specifically, iptables running on our HAProxy load balancers.