Tagged in: Network

503 Service Unavailable – Spreedly

Incident Report for Spreedly

 A mistake was made last night configuring firewalls in our new rack. While these devices are not yet a part of the service, they are connected to the switches, so the changes prevented packets from reaching places they were intended to go. This was an oversight, and we apologize for causing a disruption.

Cloud Platform Status – Google

SUMMARY

On Tuesday, 10 November 2015, outbound traffic going through one of our  European routers from both Google Compute Engine and Google App Engine  experienced high latency for a duration of 6h43m minutes. If your service  or application was affected, we apologize — this is not the level of  quality and reliability we strive to offer you, and we have taken and are  taking immediate steps to improve the platform’s performance and
availability.

Continue reading…

Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region – Amazon

Link to Original Report

Early Sunday morning, September 20, we had a DynamoDB service event in the US-East Region that impacted DynamoDB customers in US-East, as well as some other services in the region. The following are some additional details on the root cause, subsequent impact to other AWS services that depend on DynamoDB, and corrective actions we’re taking.

Continue reading…

Steam Outage: How to Monitor Data Center Connectivity–THOUSAND EYES

This is not an official Valve post mortem, rather an analysis by the company Thousand Eyes who make network troubleshooting tools.


Data centers are the factories of the Internet, churning out computation and services that customers demand. Apple is currently converting a 1.3M sq ft factory into a data center. And like factories, data centers rely on critical services to stay running. In the case of a data center, the most important inputs are electricity, cooling and network connectivity. Typically, redundant supplies of each input are available should there be a problem. Let’s examine an outage that occurred yesterday to see the importance of monitoring data center ingress, or network connectivity into the data center.

Continue reading…

Kafkapocalypse: a postmortem on our service outageKafkapocalypse: a postmortem on our service outage-parse.ly

On Thur, Mar 26 2015 and Fri, Mar 27 2015, Parse.ly experienced several outages of its data processing backend. The result was several hours of no new data appearing in our analytics dashboards, both for existing customers of our production analytics product at dash.parsely.com, and for select beta customers of our new product, at preview.parsely.com.

Continue reading…

Network Issues within London Datacenter-Linode

Incident Report

On January 31, 2015, at approximately 2:05 AM EST until 3:30 AM EST, a subset of London Linodes experienced packet loss on all of their network communication. Network and operations engineers were immediately contacted and troubleshooting began approximately 20 minutes after the start of the disruption, once Linode’s engineers were briefed on the symptoms and facts that were known at the time.

Continue reading…

Facebook Outage Deep Dive–POST MOrtem

In the late evening of January 26th, Facebook had its largest outage in more than four years. For approximately an hour both Facebook and Instagram (owned by Facebook) were completely down, along with numerous other affected sites such as Tinder and HipChat.

Facebook’s own post mortem and statements suggested the outage occurred “after we introduced a change that affected our configuration systems.”

Now, three days later a lot has been written about the outage, much of it only partially accurate. Let’s take the Facebook post mortem as a starting point and see how the outage unfolded. Follow along the blog post with the interactive data set using this share link of the event. You’ll want to take a look at the HTTP Server and Path Visualization views. Continue reading…

Turbocharging Solr Index Replication with BitTorrent–Post mortem

 

It’s mentioned in passing, but by changing how they replicate indexes to their search servers Etsy managed to take their site offline.


Many of you probably use BitTorrent to download your favorite ebooks, MP3s, and movies.  At Etsy, we use BitTorrent in our production systems for search replication.

Search at Etsy

Search at Etsy has grown significantly over the years. In January of 2009 we started using Solr for search. We used the standard master-slave configuration for our search servers with replication.

Continue reading…

Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region – Amazon

Link to Original Report

Now that we have fully restored functionality to all affected services, we would like to share more details with our customers about the events that occurred with the Amazon Elastic Compute Cloud (“EC2”) last week, our efforts to restore the services, and what we are doing to prevent this sort of issue from happening again. We are very aware that many of our customers were significantly impacted by this event, and as with any significant service issue, our intention is to share the details of what happened and how we will improve the service for our customers.

Continue reading…