Tarsnap outage

At approximately 2012-06-30 03:02 UTC, the central Tarsnap server (hosted in Amazon's EC2 US-East region) went offline due to a power outage. According to Amazon's post-mortem of this incident this was caused by two generator banks independently failing to provide stable voltages after utility (grid) power was lost during a severe electrical storm.

When the Tarsnap EC2 instance came back online about an hour later, I found that the abrupt loss of power had resulted in filesystem corruption. As I explained in a blog post in December 2008, Tarsnap stores all user data in Amazon S3, but keeps metadata cached in EC2. While I could not see any evidence that the power loss had resulted in this cached metadata being corrupted, I could not absolutely rule out the possibility -- so keeping in mind that the first responsibility of a backup system is to avoid any possibility of data loss or corruption, I decided to err on the side of caution by treating the state on EC2 as "untrustworthy".

As I described in my December 2008 blog post, losing all state on EC2 is a failure mode which Tarsnap is designed to survive, by reading data back from S3 and replaying all of the operation log entries. That said, losing local state and needing to recover completely from off-site backups (which Amazon S3 counts as, since it's replicated to multiple datacenters) is very much a worst-case scenario. It's also, quite literally, a nightmare scenario: A few days before this outage, I had a nightmare about this happening, probably provoked by the 2012-06-15 power outage which affected instances in a different EC2 availability zone.

I restored Tarsnap to a fully operational state at 2012-07-01 22:50 UTC, slightly less than 44 hours after the outage started. Obviously any outage is bad and an outage of this length is unacceptable; over the course of the recovery process I learned several lessons which should make recovery from any future "complete metadata loss" incidents faster (see below).

While Tarsnap does not have any formal SLA, I have a (rather ill-defined) policy of issuing credits to Tarsnap users affected by outages or bugs in the Tarsnap code, based on my personal sense of fairness. While the original cause of this outage was out of my control, the outage should have been much shorter, and as a result I have credited Tarsnap users' accounts with 50% of a month of storage costs.

Timeline of events

(Some of the times below are approximate; I was not worrying about keeping notes as this process was underway so I've had to reconstruct the timeline based on my memory, log files, and file timestamps.)

2012-06-30 03:02 UTC: Power is lost to the Tarsnap server, and I observe it ceasing to respond to network traffic. Due to two recent outages in the EC2 US-East region -- the power outage on June 15th, after which Amazon wrote in a post-mortem that "We have also completed an audit of all our back-up power distribution circuits" -- and a network outage earlier on June 30th (note that these two earlier outages were far more limited in scope, and neither affected Tarsnap) my initial presumption was that the outage was caused by a network issue.

2012-06-30 03:21 UTC: My presumption that the outage was network-related is reinforced by Amazon posting to its status page that they were "investigating connectivity issues".

2012-06-30 03:40 UTC: Amazon posts to its status page that "a large number of instances [...] have lost power". I start attempting to launch a replacement EC2 instance in case it becomes necessary to perform a full state recovery from S3 (as eventually proved to be the case), but my attempts fail due to EC2 APIs being offline as a result of the power outage.

2012-06-30 04:03 UTC: Power is restored to the Tarsnap server, and I SSH in to find that it suffered filesystem corruption as a result of the power outage. Since I cannot rule out the possibility that the local state was corrupted, I continue with plans to perform a full state recovery.

2012-06-30 04:37 UTC: I succeed in launching a replacement Tarsnap server and start configuring it and installing the Tarsnap server code. This process includes creating and attaching Elastic Block Store disks using the AWS Management Console, which is slowed down by repeated timeouts and errors.

2012-06-30 05:25 UTC: I finish configuring the replacement Tarsnap server and start the process of regenerating its local state from S3. The first phase of this process involves reading millions of stored S3 objects; unfortunately, these reads were performed in sequential order, triggering a worst-case performance behaviour in S3. As a result, this phase of recovery took much longer than I had anticipated; unfortunately, the design of the code meant that changing the order in which objects were read was not something I could do "on the fly".

2012-06-30 15:00 UTC: After inspecting the "outaged" Tarsnap server, I concluded that the state corruption was almost certainly limited to archives committed in the last few seconds before the power loss. Consequently, I brought the server back online in a read-only mode, so that anyone who needed their data urgently could retrieve it before the full recovery process was complete.

2012-07-01 01:45 UTC: The first phase of Tarsnap recovery -- retrieving bits from S3 -- completes, and the second phase -- reconstructing a "map" database which identifies the location of each block of data within S3 -- starts.

2012-07-01 02:30 UTC: I notice that the map database reconstruction is running anomalously slowly. This turns out to be caused by some I/O rate-limiting code I had put in place to prevent back-end processes starving the front-end Tarsnap daemon (which has much stricter scheduling requirements) for I/O -- obviously, those limits were not necessary during the reconstruction stage when there was no front-end daemon running.

2012-07-01 02:58 UTC: I restart the second phase of Tarsnap recovery (without I/O rate limits in place).

2012-07-01 11:18 UTC: The third (final) phase of Tarsnap recovery -- replaying logs to reconstruct the server-side cached state for each machine -- starts.

2012-07-01 12:41 UTC: The third phase of Tarsnap recovery fails due to requests to S3 timing out. Normally the Tarsnap code retries failed S3 requests, but in this particular code path that functionality was missing. I increase the S3 timeouts by a factor of 10 and restart the third recovery phase.

2012-07-01 21:59 UTC: The third phase of Tarsnap recovery completes. I look over the reconstructed state (and compare it against the mostly-correct state on the outaged server) to confirm that the reconstruction worked properly; then start the Tarsnap server code and run some tests.

2012-07-01 22:50 UTC: I'm satisfied that everything is working properly, and switch the Tarsnap server's "Elastic IP" address over to point at the new server instance.

Lessons learned

1. Test disaster recovery processes *at scale*. I have always tested the process for recovering from state stored on S3 every time I prepare to roll out new code -- it's part of my "make test" -- but I have always done this on a small data set. A test with six orders of magnitude less data may help to confirm that a recovery process works, but it certainly doesn't test that the process works *quickly*; and both success and speed are important.

2. Disaster recovery processes should not rely on the AWS Management Console. According to Amazon the EC2 and EBS control plane was restored by 2012-06-30 04:10 UTC, but I was experiencing timeouts and errors (most commonly "too many requests") from the Console over an hour beyond that point. Obviously when a disaster occurs there is a large influx of users into the Management Console; evidently it needs work to improve its scalability.

3. Disaster recovery processes should start at the first sign of an outage. In this particular case it wouldn't have made any difference since for the first hour after the outage began it was impossible to launch a replacement EC2 instance; but that is an EC2 failure mode which Amazon says they are addressing, so in the future starting the recovery immediately rather than waiting to find out if the outage is a transient network issue or more severe could save time.

4. Sequential accesses to Amazon S3 are bad. This isn't so much a "lesson learned" as a "lesson I remembered applies here" -- I was aware of it but hadn't realized that it would slow down the recovery process so much.

5. The appropriate behaviours during state recovery are not always the same as the appropriate behaviours during normal operations. Disk I/O rate limiting of back-end processes and filesystem syncing are normally useful, but not so much when it is important to recover state and get back online as quickly as possible. Code shared between "normal" and "recovery" operations -- which is, in fact, most of the Tarsnap server code -- should be aware of which mode it's running in and control those behaviours appropriately.

6. Tarsnap users are amazing people. (Ok, I knew this already, but still, this episode reinforced it.) I didn't see a single irate email, tweet, or comment on IRC during the entire outage; people politely asked if there was an outage and if it was related to the EC2 outage, and aside from that the only communications I had were very positive, mostly thanking me for my efforts and status reports.

Final words

After writing so much about Tarsnap, I'd like to take a moment to provide some wider context to this outage. The power outage which knocked Tarsnap offline was big -- judging by Amazon's statement of "7% of instances" and the number of IP addresses in the EC2 US-East region, somewhere around 50,000 instances went offline -- and as I'm sure everybody reading this is aware, also affected such "big names" as Netflix, Pinterest, Instagram, and Heroku. Most people stop at this point, but there's more to the story than that.

This power outage was caused by a "derecho" thunderstorm system which is believed to be one of the most severe non-hurricane storm systems in North American history. 22 people are believed to have died as a direct result of the storm, and over 3 million homes and businesses lost power. Approximately a million people are still without power now, five days later.

Even worse, this storm system was caused by a heat wave which has set record temperatures in hundreds of locations across the Eastern US, with many areas exceeding 38 C (100 F) for several days in a row. For the elderly, the young, and individuals with chronic diseases, prolonged exposure to these temperatures can be life-threatening -- and without electricity, air conditioning is not available. By the time this heat wave is over, it could easily be responsible for hundreds or even thousands of deaths.

For many of us, a datacenter losing power is the only effect we will see from this storm. For most of the people who were directly affected by the storm, it's the least of their worries.

Posted at 2012-07-04 22:00 | Permanent link | Comments
blog comments powered by Disqus

Recent posts

Monthly Archives

Yearly Archives


RSS