Cloud Platform Status – Google

SUMMARY

On Tuesday, 10 November 2015, outbound traffic going through one of our  European routers from both Google Compute Engine and Google App Engine  experienced high latency for a duration of 6h43m minutes. If your service  or application was affected, we apologize — this is not the level of  quality and reliability we strive to offer you, and we have taken and are  taking immediate steps to improve the platform’s performance and
availability.

Continue reading…

Linux build queue backing up – CircleCI

Incident Report for CircleCI

CircleCI is a platform for continuous integration and continuous delivery. We take care of all the low-level details so that you have the simplest, fastest continuous integration and deployment possible.

We are sincerely sorry for the outage that prevented builds from running late Wednesday and early Thursday. We know you rely on us to deploy, and that downtime is painful for you and your customers. We take our responsibility to you very seriously, and we’re sorry we let you down.

Here’s what happened, what we learned, and what actions we’re taking to prevent this from happening again:

What We Saw
Continue reading…

Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region – Amazon

Link to Original Report

Early Sunday morning, September 20, we had a DynamoDB service event in the US-East Region that impacted DynamoDB customers in US-East, as well as some other services in the region. The following are some additional details on the root cause, subsequent impact to other AWS services that depend on DynamoDB, and corrective actions we’re taking.

Continue reading…

Outage report: 5 September 2015-pythonanywhere

From 20:30 to 23:50 UTC on 5 September, there were a number of problems on PythonAnywhere. Our own site, and those of our customers, were generally up and running, but were experiencing intermittent failures and frequent slowdowns. We’re still investigating the underlying cause of this issue; this blog post is an interim report.

What happened?

Continue reading…

High queue times on OSX builds (.com and .org)–travisci

Incident Report for Travis CI

On Tuesday, August 4th we had a significant period of instability and outage to our OS X build environment for both open source and private repositories. We want to take the time to explain what happened. We recognize that this was a significant disruption to the workflow and productivity of all of our users who rely on us for OS X building and testing. This is not at all acceptable to us. We are very sorry that it happened and our entire Engineering team has implemented a number of changes to help prevent similar problems in the future.

Continue reading…

DB performance issue-circleci

CircleCI is a platform for continuous delivery. This means (among other things) we’re building serious distributed systems: hundreds of servers managing thousands of containers, coordinating between all the moving parts, and taking care of all the low-level details so that you have the simplest, fastest continuous integration and deployment possible.

Last Tuesday, we experienced a severe and lengthy downtime, during which our build queue was at a complete standstill. The entire company scrambled into firefighting mode to get the queue unlocked and customer builds moving again. Here’s what happened….

Continue reading…

Serious Platform Outage-divshot

This morning around 7:20am Pacific time, several platform EC2 instances began failing and our load balancer began returning 503 errors. Ordinarily our scaling configuration would terminate and replace unhealthy instances, but for an as-yet-undetermined reason all instances became unhealthy and were not replaced. This caused a widespread outage that lasted for nearly two hours.

Continue reading…