This is not an official Valve post mortem, rather an analysis by the company Thousand Eyes who make network troubleshooting tools.
Data centers are the factories of the Internet, churning out computation and services that customers demand. Apple is currently converting a 1.3M sq ft factory into a data center. And like factories, data centers rely on critical services to stay running. In the case of a data center, the most important inputs are electricity, cooling and network connectivity. Typically, redundant supplies of each input are available should there be a problem. Let’s examine an outage that occurred yesterday to see the importance of monitoring data center ingress, or network connectivity into the data center.
Steam Outage on May 7th
Starting at 11:45 Pacific on May 7th and lasting for 2 hours, the popular online gaming service Steam suffered a widespread outage. Steam regularly has more than 8 million concurrent users, so it’s one of the bigger web services out there (Figure 1).
There were reports of users around the world of being unable to login or use the Steam store. Let’s break down what happened.
Figure 1: Steam players online May 7th 2015, with a 2 hour gap.
A (Nearly) Global Outage
We were monitoring the Steam API at the time of the outage. You can follow the interactive data here. The first thing we noticed was a dramatic drop in API availability, as shown in Figure 2.
Figure 2: Two hour widespread outage of the Steam API.
The Steam API was unavailable from more than half of the 15 locations we tested over the 2 hour outage (Figure 3).
Figure 3: Only 2 of 15 locations were able to reach the Steam API at 1:00pm Pacific time.
So what went wrong? Let’s dig into the diagnostics. The first clue is in the HTTP errors that we recorded. They were all TCP connection errors, typically indicating some sort of network error or congestion (Figure 4).
Figure 4: Lack of Steam API availability due to TCP connection errors.
To confirm this, we can take a look at packet loss and latency. We see that packet loss spikes at the same time that availability dips, from 11:45am-1:45pm (Figure 5).
Figure 5: Packet loss of up to 87% coincides with availability drop.
Troubles in the Data Center
So we have a situation with high levels of packet loss from many locations around the world. Let’s take a look at a Path Visualization to see where the network instability is coming from.
While Steam content delivery is handled from many distributed locations, the authentication and storefront services for the platform are served out of a Seattle-area CenturyLink data center. We can see this data center in the visualization in light green on the right hand side (Figure 6).
Figure 6: Only traffic from Qwest and Comcast (highlighted in yellow) is able to access Steam’s data center.
Steam’s data center is typically served by five primary ISPs. During the outage, traffic paths from Comcast and Qwest are successfully reaching the Steam data center, but paths from other upstream ISPs—Level 3, Telia and Abovenet/Zayo—are not (Figure 7).
Figure 7: Nodes in three ISPs where traffic is terminating just before the Steam data center.
Only Some Roads Lead to Seattle
When traffic from some networks completely terminates, while traffic from others is properly delivered, its usually a physical network failure (broken link or router in an IXP) or an issue with routing. So let’s look into the routing plane. At the BGP layer we can see massive routing instability (Figure 8).
Figure 8: AS path changes spike during the outage, as traffic is rerouted.
Looking at the AS paths themselves, we see over time that Qwest is being preferred while other routes to Steam (Valve network AS32590) are no longer advertised (Figure 9).
Figure 9: AS paths to Steam’s network AS32590 (in green) are being routed only through Qwest AS209.
Red links are active paths, dotted links were active in the previous time period.
Based on the evidence, our best guess is that a routing configuration change by Steam was the cause of the outage. We will update this if Steam releases a post-mortem.
Monitoring Data Center Ingress
Keeping the network flowing to the data center is critical. Without the network your data center, at least for most applications, is as good as useless. Understanding the ingress routes to your data center is just as critical as checking on the physical fibers that rise up through the data center vaults. You can easily monitor your own data center ingress, including Path Visualization and Route Visualization shown above, with the free version of ThousandEyes.