Please accept our apologies for the recent issue affecting your service, starting at 19:09 UTC on December 1, 2014.
The vendor that provides Codeship’s domain name service (DNS), DNSimple, experienced a major volumetric distributed denial of service (DDoS) attack which impacted their service availability. DNSimple has issued an incident report detailing their outage as result of the DDoS attack.
At 19:09 UTC on December 1st, when DNSimple’s DNS service became unavailable, Codeship customers were unable to resolve codeship.com domain names once they expired from cache. TTL on Codeship’s DNS records were set to 3600 seconds, so by 20:09 UTC, most of the world could not connect to Codeship despite all systems operating normally. Our monitoring service alerted us to the failure, and our status page and twitter accounts were updated with detail. Unfortunately, the DNS failure did affect external access to our status page, as well, preventing us from doing a better job communicating the status updates.
How did we respond and recover?
As we monitored DNSimple’s progress in recovering from the DDoS attack, it became clear that the timeframe that they expected to restore service was unacceptable and would require us to take more drastic action. We made the decision to make a temporary move to another DNS provider. This process was complicated by the fact that DNSimple’s management interface was unavailable. Fortunately, our continuous deployment process for DNS records meant we had easy access to the info we needed. At 3:14 UTC on December 2, Codeship’s DNS records were successfully being returned by the new provider, and service was restored.
Several actions were taken and planned in response to this incident to prevent future occurrences. First, the TTL on Codeship’s DNS records were increased to 24 hours allowing for much greater resiliency to a failure like this one, thanks to record caching. Second, the Codeship status page was moved to a new address on a new domain name registered with a different provider. Additionally, we now have an immediate solution to moving Codeship’s DNS records from one provider to another in case of failure. Longer term, we are researching a more robust solution with secondary DNS servers on a new provider that would be able to slave our primary servers at DNSimple, allowing for automatic failover with failure. DNSimple is working on the necessary feature that would allow this.