Tagged in: Config

Steam Outage: How to Monitor Data Center Connectivity–THOUSAND EYES

This is not an official Valve post mortem, rather an analysis by the company Thousand Eyes who make network troubleshooting tools.


Data centers are the factories of the Internet, churning out computation and services that customers demand. Apple is currently converting a 1.3M sq ft factory into a data center. And like factories, data centers rely on critical services to stay running. In the case of a data center, the most important inputs are electricity, cooling and network connectivity. Typically, redundant supplies of each input are available should there be a problem. Let’s examine an outage that occurred yesterday to see the importance of monitoring data center ingress, or network connectivity into the data center.

Continue reading…

Facebook Outage Deep Dive–POST MOrtem

In the late evening of January 26th, Facebook had its largest outage in more than four years. For approximately an hour both Facebook and Instagram (owned by Facebook) were completely down, along with numerous other affected sites such as Tinder and HipChat.

Facebook’s own post mortem and statements suggested the outage occurred “after we introduced a change that affected our configuration systems.”

Now, three days later a lot has been written about the outage, much of it only partially accurate. Let’s take the Facebook post mortem as a starting point and see how the outage unfolded. Follow along the blog post with the interactive data set using this share link of the event. You’ll want to take a look at the HTTP Server and Path Visualization views. Continue reading…

Today’s outage for several Google services–Google

Earlier today, most Google users who use logged-in services like Gmail, Google+, Calendar and Documents found they were unable to access those services for approximately 25 minutes. For about 10 percent of users, the problem persisted for as much as 30 minutes longer. Whether the effect was brief or lasted the better part of an hour, please accept our apologies—we strive to make all of Google’s services available and fast for you, all the time, and we missed the mark today.

Continue reading…

Turbocharging Solr Index Replication with BitTorrent–Post mortem

 

It’s mentioned in passing, but by changing how they replicate indexes to their search servers Etsy managed to take their site offline.


Many of you probably use BitTorrent to download your favorite ebooks, MP3s, and movies.  At Etsy, we use BitTorrent in our production systems for search replication.

Search at Etsy

Search at Etsy has grown significantly over the years. In January of 2009 we started using Solr for search. We used the standard master-slave configuration for our search servers with replication.

Continue reading…

Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region – Amazon

Link to Original Report

Now that we have fully restored functionality to all affected services, we would like to share more details with our customers about the events that occurred with the Amazon Elastic Compute Cloud (“EC2”) last week, our efforts to restore the services, and what we are doing to prevent this sort of issue from happening again. We are very aware that many of our customers were significantly impacted by this event, and as with any significant service issue, our intention is to share the details of what happened and how we will improve the service for our customers.

Continue reading…