Starting last Thursday, Heroku suffered the worst outage in the nearly four years we’ve been operating. Large production apps using our dedicated database service may have experienced up to 16 hours of operational downtime. Some smaller apps using shared databases may have experienced up to 60 hours of operational downtime. Code deploys were unavailable across some parts of the platform for almost 76 hours – over three days. In short: this was an absolute disaster.
It’s no secret that there was a huge Amazon EC2 outage exactly corresponding to the beginning of our downtime; so one can easily surmise that this was the root cause of Heroku’s downtime as well. This post will reference the AWS services that we use behind the scenes so that we can be very specific. Note that although we will be discussing various AWS service failures, we don’t blame them for what our customers experienced in any way. Heroku takes 100% of the responsibility for the downtime affecting our customers last week.
What Happened: First 12 Hours
On April 21, 2011 at 8:15 UTC (or 1AM in our timezone), alerts began coming in from our monitoring systems. We opened an incident on our status page (follow @herokustatus to get these updates via Twitter). We saw what appeared to be widespread network errors that were resulting in timeouts in our web, caching, and routing services. We began investigating and immediately opened a support ticket with AWS at the highest priority.
For the first several hours of the outage, we tried shutting down misbehaving instances and replacing them with new ones. This is our standard handling of EC2 issues of this nature. Our platform is designed this way and it typically works very well, producing very minimal disruption to our customers. In this case, however, we found that things were getting worse.
The biggest problem was our use of EBS drives, AWS’s persistent block storage solution. We use this on instances which require state, mainly databases, but a few other types of nodes as well. Our EBS drives were becoming more and more unpredictable in their behavior, in some cases becoming completely unresponsive, even after detaching from their current instance and re-attaching to a new one.
We were in direct contact with our technical account manager at AWS the entire time, who provided us potential workarounds. Unfortunately, these workarounds were not helping, and the failures grew even more widespread.
Historically, the best move for us in these incidents is to do our best to keep things running (killing unhealthy instances, etc.) and wait for AWS to resolve things. Rarely has that taken more than an hour or two.
In this case, the EC2 outage lasted a total of about 12 hours. In the afternoon on Thursday, we were able to begin starting new instances en masse and we believed we’d be just an hour or two away from recovery. The majority of applications were back up on Thursday afternoon, but it took us much longer to recover the remaining ones.
What Happened: The Long Haul
Unfortunately, while EC2 was more or less fully operational again, the EBS system was not. As you can see from the AWS status page, the EBS outage lasted a total of 80+ hours. Heroku was able to get back online more quickly than that thanks to help from our contacts at AWS and hard work from our engineers.
While most applications were back online within 16 hours, there were still many applications on the affected shared database servers. We were also having problems with some of the the servers that we use to process git pushes for deployment, which meant that the applications hosted on those servers could not have new code deployed, even if they were otherwise online.
The next 48 hours were spent with our engineers working closely with AWS to restore service as quickly as possible. We saw slow but steady progress for 36 hours, with servers continually returning to service as their underlying EBS disks started responding again.
We also worked with customers that had applications that were online but weren’t deployable because of the ongoing problems with some of our git servers. We were able to, on a case-by-case basis, create new repositories for these customers to push to, which allowed them to deploy while we worked to bring the original servers back online.
Our monitoring systems picked up the problems right away. The on-call engineer quickly determined the magnitude of the problem and woke up the on-call Incident Commander. The IC contacted AWS, and began waking Heroku engineers to work on the problem.
Once it became clear that this was going to be a lengthy outage, the Ops team instituted an emergency incident commander rotation of 8 hours per shift, keeping a fresh mind in charge of the situation at all times. Our support, data, and other engineering teams also worked around the clock.
We prioritize getting top-paying customers back online over our larger base of free users, which is why customers (particularly those with dedicated databases) were back online much more quickly than free apps. While we think this prioritization makes sense, we do strive to provide a high level of service to everyone. Even though the outage was much shorter (less than 16 hours in most cases) for our top customers than for our free users (as much as 3 days in some cases), we measure our downtime as the time it took to get 100% of apps back online.
We updated our status page throughout the incident. Some folks have complained that our updates lack detail, or (in many cases) were repetitious of previous updates. This is something we’ll strive to improve, but it’s actually a lot harder than it sounds. There are large swaths of time where it’s simply a matter of continuing to restore databases from backups and otherwise bring replacement systems online – one hour doesn’t look too different from the next. What matters is that we’ve got a full crew working hard at bringing everything back online, and the status updates are there to let everyone know we’re still hard at work.
It hardly needs stating, but we never want to put our customers, our users, or our engineering team through this again.
Failures at the IaaS layer will happen. It’s Heroku’s responsibility to shield our customers from this; part of our value proposition is to abstract away these concerns. We failed at this in a big way this weekend, and our engineers are even as we speak hard at work on architectural changes that will allow us to handle infrastructure outages of this magnitude with less or no disruption to our customers in the future.
There are three major lessons about IaaS we’ve learned from this experience:
1) Spreading across multiple availability zones in single region does not provide as much partitioning as we thought.
Therefore, we’ll be taking a hard look at spreading to multiple regions. We’ve explored this option many times in the past – not for availability reasons, but for customers wishing to have their infrastructure more physically nearby for latency or legal reasons. We’ve always chosen to prioritize it below other ways we could spend our time. It’s a big project, and it will inescapably require pushing more configuration options out to users (for example, pointing your DNS at a router chosen by geographic homing) and to add-on providers (latency-sensitive services will need to run in all the regions we support, and find some way to propagate region information between the app and the services). These are non-trivial concerns, but now that we have such dramatic evidence of multi-region’s impact on availability, we’ll be considering it a much higher priority.
2) Block storage is not a cloud-friendly technology.
EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we’ve been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employs some of the best infrastructure engineers in the world: if they can’t make it work, then probably no one can. Block storage has physical locality that can’t easily be transferred. That makes it not a cloud-friendly technology. With this information in hand, we’ll be taking a hard look on how to reduce our dependence on EBS.
3) Continuous database backups for all.
One reason why we were able to fix the dedicated databases quicker has to do with the way that we do backups on them. In the new Heroku PostgreSQL service, we have a continuous backup mechanism that allows for automated recovery of databases. Once we were able to provision new instances, we were able to take advantage of this to quickly recover the dedicated databases that were down with EBS problems.
We have been porting this continuous backup system to our shared database servers for some time and were finishing up testing at the time of the outage. We’ve previously relied on point backups of individual databases in the event of a failure rather than the continuous full server backups that the new system makes use of. We are in the process of rolling out this updated backup system to all of our shared database servers; it’s already running on some of them and we are aiming to have it deployed to the remainder of our fleet in the next two weeks.
This outage was not acceptable. We don’t want to ever put our customers through something like this again and we’re working as hard as we can on making sure that we won’t ever have to. The level of support and patience we received from customers who had every right to be frustrated was amazing. We appreciate your trust in us and we’re going to live up to it.
On the bright side, we couldn’t be more proud of the work of our Ops, Database, and Support teams, and all of our engineers during this incident. Whether or not AWS suffers an outage of this magnitude ever again, we’re glad to have the extra impetus to build Heroku into an ever-more resilient platform.