Tagged in: Configuration
Incident Report for Spreedly
We’d like to follow up with more information regarding Steam’s troubled Christmas.
On December 25th, a configuration error resulted in some users seeing Steam Store pages generated for other users. Between 11:50 PST and 13:20 PST store page requests for about 34k users, which contained sensitive personal information, may have been returned and seen by other users.
This is not an official Valve post mortem, rather an analysis by the company Thousand Eyes who make network troubleshooting tools.
Data centers are the factories of the Internet, churning out computation and services that customers demand. Apple is currently converting a 1.3M sq ft factory into a data center. And like factories, data centers rely on critical services to stay running. In the case of a data center, the most important inputs are electricity, cooling and network connectivity. Typically, redundant supplies of each input are available should there be a problem. Let’s examine an outage that occurred yesterday to see the importance of monitoring data center ingress, or network connectivity into the data center.
On August 25, 2014 there was an outage of all Stack Exchange sites (Q&A sites as well as Careers) from 7:26 pm to 7:32 pm UTC (approximately 6 minutes). The cause was an incorrect change to network firewall configuration – specifically, iptables running on our HAProxy load balancers.
This morning at 09:47 UTC CloudFlare effectively dropped off the Internet. The outage affected all of CloudFlare’s services including DNS and any services that rely on our web proxy. During the outage, anyone accessing CloudFlare.com or any site on CloudFlare’s network would have received a DNS error. Pings and Traceroutes to CloudFlare’s network resulted in a “No Route to Host” error.
Early today Facebook was down or unreachable for many of you for approximately 2.5 hours. This is the worst outage we’ve had in over four years, and we wanted to first of all apologize for it. We also wanted to provide much more technical detail on what happened and share one big lesson learned.