Diagnosing and Handling EC2 EBS Issues

Last Tuesday morning, October 12th, our API service and many of our backend systems were down. This was particularly bad timing as it coincided with a scheduled maintenance window where we upgraded one of our database systems, leading us to believe for some time that this was the cause of the outage. After much investigation we determined the root cause was severe EBS volume performance degradation.

What follows is an account of how we handled the incident and what steps we are taking to more appropriately mitigate the problem should it arise again. Hopefully this information can help anyone else using EC2 and EBS.

Post-mortem Summary

On October 11 at about 11:30pm PDT Amazon began experiencing severe issues with EBS disk volumes in an availability zone in the US-East-1 region, which we use extensively. Each of the three database instances we relied upon contained at least one bad volume in its RAID array. Once the root cause was determined, we submitted a support request to AWS, and prepared our two slaves to take over. After working through the night, our services were restored at around 5:30am the following morning.

Our EBS Volume Setup

To ensure proper disk performance, our EBS volumes are structured as striped RAID arrays on our database servers. Performance across each volume in the array should be roughly symmetric if everything is going well. However, when one drive performs poorly, the performance of the whole array is significantly degraded.

Symptoms

Our problem was characterized by high I/O utilization on a subset (usually one) of the EBS volumes in an array. While observing iostat, we could see that some of the EBS volumes consistently had higher %util readings than others. Device saturation occurs when this value is close to 100%. This behavior usually manifested as a large spike in await (the average time for I/O requests issued to the device to be served), followed by %util pegging to 100% at full load and remaining fully utilized until there were no more operations directed toward the drive, despite returning only a few kilobytes of data per second. Unfortunately, we don’t have access to the actual root cause information for this failure as it’s all internal to AWS, but it appears to have something to do with high latency between our instances and these volumes.

Solution

This was a pervasive problem throughout our availability zone. Not only was our new master database server severely impaired, but so were two other hot-spares. We decided to spin up a new instance and hope that we got enough EBS volumes which were not affected to start our service again. We frequently ran into “capacity unavailable” errors while trying to provision volumes. When we finally had enough volumes to start another database server and had copied over a data dump, our testing indicated that its EBS volumes were similarly impacted.

AWS said they were ‘close’ to resolving the issue many times, but would not offer detailed information regarding the problem or a timeline for resolution, only that their team was working on it. As a safeguard, we decided to launch another server in a different availability zone and rsync the data over, knowing this would take an hour and a half just to copy the data. About halfway through this process, EBS latency seemed to improve some in our original availability zone, and we turned one of the slaves on as master.

We consistently provided information to AWS engineers hoping it would assist them in diagnosing and fixing the issues. Unfortunately, the communication was quite often delayed by the AWS team and the information they gave was mostly vague. Even after the problem was identified and publicly announced, we would wait between 40 minutes and multiple hours between responses. In the end we noticed improved performance and recovered our sites, but it wasn’t until 16 hours later that we received another update from Amazon. Several times during that period they posted small public updates on their status page, but would not reply to updates and requests for details.

Throughout this process, we were simultaneously working on bringing up new nodes, rebuilding indexes on others, or restoring from dumps. Unfortunately, because our dataset is fairly large, the process of copying it around took hours so we were pretty limited in what we were able to accomplish until EBS latency improved.

Action Plan

Things we’re doing to make sure this doesn’t happen again:

Use Multiple Availability Zones For Core Services

We found ourselves using one availability zone primarily to keep latency down, and throughput high. As we change to different database solutions, and re-engineer our architecture, we’re ready to start deploying in several availability zones for our core services. Naturally, this is a core requirement for surviving degraded services from hosting providers. Redundancy, redundancy, redundancy.

Don’t Rebuild Indexes, Store Them

Once our hot-spares had also been deemed unfit for service, we had only the option of starting a new instance from a database dump. This isn’t so bad, usually, but rebuilding the indexes afterward is extremely time intensive (and we rely heavily on primary and secondary indexes).

Work With AWS to Improve Communication

We’ll have some serious discussions with Amazon to improve communication during incidents like these. While we appreciate that engineers need to do their work to restore functionality, it’s unacceptable to go sixteen hours without updating a support ticket for such a serious issue, even if one believes it to be resolved.

Conclusion

We take service outages very seriously. We’re investing a lot of time in making sure that our infrastructure is protected from edge cases like slow EBS volumes and general outages within an availability zone. Naturally, we’re not just stopping there. We’ll continue to look at new solutions and to strive to provide great service to our customers.

Update:

Before we posted this we had another EBS volume failure (this morning). The volume exhibited the same failure — accepting reads and writes, but with extremely high latency backing up all access to 100% utilization. Luckily this time it was only the single volume on the single server, and we were able to promote the slave.