I will not be acquihired

There's a story I notice playing out increasingly often these days: A startup company is founded and provides a service lots of people want and use; the company is acquired by one of the tech giants (most often Google or Facebook); and a few months later the startup's founders are shuffled off to work on other projects and the service they created is shut down. In corporate-speak, this is a "talent acquisition"; in the world of internet startups, it's more often called an "acquihire".

This, quite reasonably, tends to make people nervous about using services which are provided by small startup companies. After "what if you get hit by a bus", the most common question people ask me about my online backup service is some variation on "what if Google buys Tarsnap and doesn't want to keep it running?" I've answered this privately many times, but I think it's time to answer it publicly: Tarsnap will not be acquihired.

Announcements of startup acquisitions rarely provide much in the way of back story, but reading between the lines there's two common sub-plots I see. I call these the "job hunter" story and the "failure to lift-off" story.

The job hunter story runs roughly as follows: Some smart kids graduate from college and start looking for jobs. They go around to the usual list of high profile companies — Google, Facebook, Microsoft, Amazon, etc. — but either aren't offered jobs or are offered relatively low salaries. While it's obvious to their interviewers that they are smart, they've never built anything beyond the scope of a one-term college course, and employers are naturally hesitant about hiring anyone without practical work experience.

So lacking any job offers commensurate to their talents, our job hunters set out to demonstrate that they have practical skills in addition to academic knowledge. They build a company much like an artist or architect puts together a portfolio: As a way to show off their talents. It doesn't matter if the company makes any money — that's not the point, since they're trying to demonstrate their suitability for software development jobs, not their ability to run companies. A year or so later, they get the jobs they were looking for; a large software firm gets employees with demonstrated practical skills; and the only people who aren't happy are the users who now find that the service they were using no longer exists.

The failure to lift-off story is a bit uglier. Some people have a business idea, and decide to set out on their own to create it. Usually it turns out that the market isn't quite as large as they thought; in some cases a lot of people want what they're providing, but aren't willing to pay enough to make the business very profitable. Maybe the company is approaching bankruptcy and can't convince any venture capital companies to invest any more money; maybe the company is limping along barely profitably but the founders aren't getting rich and the investors are getting impatient.

Either way, some calls are made, and the company is "acquired"; most often it seems that the only connection between the acquired and acquiring companies is that they share an investor in common. (From a fiduciary perspective, these deals stink: It seems very implausible that they would all occur if it weren't for personal relationships greasing the wheels.) This story doesn't end with anyone being very happy, but at least it's a cleaner and less embarassing wrap-up than a company going bankrupt and its founders being unemployed.

Tarsnap doesn't fit either of these stories. I started Tarsnap shortly after receiving my doctorate, so it might look like a "job hunter" story; but I only started Tarsnap after being offered a well-paid job. In fact, without that job offer, I wouldn't have started Tarsnap — it was the security of having "get a job at a big internet company" as a proven backup plan which made me willing to take the risk of starting my own company. Nor does Tarsnap fit the "failure to lift-off" story: While Tarsnap is nowhere near as large as Dropbox, AirBnb, or Heroku, it doesn't need to be. As a "bootstrapped" company, Tarsnap has no investors who could get impatient; and it's sufficiently profitable that I'd be satisfied with running it indefinitely even if it never grows any further (which is, of course, a very unlikely scenario).

In short, Tarsnap won't be acquihired because I'd have nothing to gain from it. I didn't built Tarsnap in the hopes of attracting a job offer, and it's successful enough that I'd be a fool to ever abandon it. I'm in this for the long haul; your backups are safe.

Posted at 2012-07-21 05:05 | Permanent link | Comments

Tarsnap outage

At approximately 2012-06-30 03:02 UTC, the central Tarsnap server (hosted in Amazon's EC2 US-East region) went offline due to a power outage. According to Amazon's post-mortem of this incident this was caused by two generator banks independently failing to provide stable voltages after utility (grid) power was lost during a severe electrical storm.

When the Tarsnap EC2 instance came back online about an hour later, I found that the abrupt loss of power had resulted in filesystem corruption. As I explained in a blog post in December 2008, Tarsnap stores all user data in Amazon S3, but keeps metadata cached in EC2. While I could not see any evidence that the power loss had resulted in this cached metadata being corrupted, I could not absolutely rule out the possibility -- so keeping in mind that the first responsibility of a backup system is to avoid any possibility of data loss or corruption, I decided to err on the side of caution by treating the state on EC2 as "untrustworthy".

As I described in my December 2008 blog post, losing all state on EC2 is a failure mode which Tarsnap is designed to survive, by reading data back from S3 and replaying all of the operation log entries. That said, losing local state and needing to recover completely from off-site backups (which Amazon S3 counts as, since it's replicated to multiple datacenters) is very much a worst-case scenario. It's also, quite literally, a nightmare scenario: A few days before this outage, I had a nightmare about this happening, probably provoked by the 2012-06-15 power outage which affected instances in a different EC2 availability zone.

I restored Tarsnap to a fully operational state at 2012-07-01 22:50 UTC, slightly less than 44 hours after the outage started. Obviously any outage is bad and an outage of this length is unacceptable; over the course of the recovery process I learned several lessons which should make recovery from any future "complete metadata loss" incidents faster (see below).

While Tarsnap does not have any formal SLA, I have a (rather ill-defined) policy of issuing credits to Tarsnap users affected by outages or bugs in the Tarsnap code, based on my personal sense of fairness. While the original cause of this outage was out of my control, the outage should have been much shorter, and as a result I have credited Tarsnap users' accounts with 50% of a month of storage costs.

Timeline of events

(Some of the times below are approximate; I was not worrying about keeping notes as this process was underway so I've had to reconstruct the timeline based on my memory, log files, and file timestamps.)

2012-06-30 03:02 UTC: Power is lost to the Tarsnap server, and I observe it ceasing to respond to network traffic. Due to two recent outages in the EC2 US-East region -- the power outage on June 15th, after which Amazon wrote in a post-mortem that "We have also completed an audit of all our back-up power distribution circuits" -- and a network outage earlier on June 30th (note that these two earlier outages were far more limited in scope, and neither affected Tarsnap) my initial presumption was that the outage was caused by a network issue.

2012-06-30 03:21 UTC: My presumption that the outage was network-related is reinforced by Amazon posting to its status page that they were "investigating connectivity issues".

2012-06-30 03:40 UTC: Amazon posts to its status page that "a large number of instances [...] have lost power". I start attempting to launch a replacement EC2 instance in case it becomes necessary to perform a full state recovery from S3 (as eventually proved to be the case), but my attempts fail due to EC2 APIs being offline as a result of the power outage.

2012-06-30 04:03 UTC: Power is restored to the Tarsnap server, and I SSH in to find that it suffered filesystem corruption as a result of the power outage. Since I cannot rule out the possibility that the local state was corrupted, I continue with plans to perform a full state recovery.

2012-06-30 04:37 UTC: I succeed in launching a replacement Tarsnap server and start configuring it and installing the Tarsnap server code. This process includes creating and attaching Elastic Block Store disks using the AWS Management Console, which is slowed down by repeated timeouts and errors.

2012-06-30 05:25 UTC: I finish configuring the replacement Tarsnap server and start the process of regenerating its local state from S3. The first phase of this process involves reading millions of stored S3 objects; unfortunately, these reads were performed in sequential order, triggering a worst-case performance behaviour in S3. As a result, this phase of recovery took much longer than I had anticipated; unfortunately, the design of the code meant that changing the order in which objects were read was not something I could do "on the fly".

2012-06-30 15:00 UTC: After inspecting the "outaged" Tarsnap server, I concluded that the state corruption was almost certainly limited to archives committed in the last few seconds before the power loss. Consequently, I brought the server back online in a read-only mode, so that anyone who needed their data urgently could retrieve it before the full recovery process was complete.

2012-07-01 01:45 UTC: The first phase of Tarsnap recovery -- retrieving bits from S3 -- completes, and the second phase -- reconstructing a "map" database which identifies the location of each block of data within S3 -- starts.

2012-07-01 02:30 UTC: I notice that the map database reconstruction is running anomalously slowly. This turns out to be caused by some I/O rate-limiting code I had put in place to prevent back-end processes starving the front-end Tarsnap daemon (which has much stricter scheduling requirements) for I/O -- obviously, those limits were not necessary during the reconstruction stage when there was no front-end daemon running.

2012-07-01 02:58 UTC: I restart the second phase of Tarsnap recovery (without I/O rate limits in place).

2012-07-01 11:18 UTC: The third (final) phase of Tarsnap recovery -- replaying logs to reconstruct the server-side cached state for each machine -- starts.

2012-07-01 12:41 UTC: The third phase of Tarsnap recovery fails due to requests to S3 timing out. Normally the Tarsnap code retries failed S3 requests, but in this particular code path that functionality was missing. I increase the S3 timeouts by a factor of 10 and restart the third recovery phase.

2012-07-01 21:59 UTC: The third phase of Tarsnap recovery completes. I look over the reconstructed state (and compare it against the mostly-correct state on the outaged server) to confirm that the reconstruction worked properly; then start the Tarsnap server code and run some tests.

2012-07-01 22:50 UTC: I'm satisfied that everything is working properly, and switch the Tarsnap server's "Elastic IP" address over to point at the new server instance.

Lessons learned

1. Test disaster recovery processes *at scale*. I have always tested the process for recovering from state stored on S3 every time I prepare to roll out new code -- it's part of my "make test" -- but I have always done this on a small data set. A test with six orders of magnitude less data may help to confirm that a recovery process works, but it certainly doesn't test that the process works *quickly*; and both success and speed are important.

2. Disaster recovery processes should not rely on the AWS Management Console. According to Amazon the EC2 and EBS control plane was restored by 2012-06-30 04:10 UTC, but I was experiencing timeouts and errors (most commonly "too many requests") from the Console over an hour beyond that point. Obviously when a disaster occurs there is a large influx of users into the Management Console; evidently it needs work to improve its scalability.

3. Disaster recovery processes should start at the first sign of an outage. In this particular case it wouldn't have made any difference since for the first hour after the outage began it was impossible to launch a replacement EC2 instance; but that is an EC2 failure mode which Amazon says they are addressing, so in the future starting the recovery immediately rather than waiting to find out if the outage is a transient network issue or more severe could save time.

4. Sequential accesses to Amazon S3 are bad. This isn't so much a "lesson learned" as a "lesson I remembered applies here" -- I was aware of it but hadn't realized that it would slow down the recovery process so much.

5. The appropriate behaviours during state recovery are not always the same as the appropriate behaviours during normal operations. Disk I/O rate limiting of back-end processes and filesystem syncing are normally useful, but not so much when it is important to recover state and get back online as quickly as possible. Code shared between "normal" and "recovery" operations -- which is, in fact, most of the Tarsnap server code -- should be aware of which mode it's running in and control those behaviours appropriately.

6. Tarsnap users are amazing people. (Ok, I knew this already, but still, this episode reinforced it.) I didn't see a single irate email, tweet, or comment on IRC during the entire outage; people politely asked if there was an outage and if it was related to the EC2 outage, and aside from that the only communications I had were very positive, mostly thanking me for my efforts and status reports.

Final words

After writing so much about Tarsnap, I'd like to take a moment to provide some wider context to this outage. The power outage which knocked Tarsnap offline was big -- judging by Amazon's statement of "7% of instances" and the number of IP addresses in the EC2 US-East region, somewhere around 50,000 instances went offline -- and as I'm sure everybody reading this is aware, also affected such "big names" as Netflix, Pinterest, Instagram, and Heroku. Most people stop at this point, but there's more to the story than that.

This power outage was caused by a "derecho" thunderstorm system which is believed to be one of the most severe non-hurricane storm systems in North American history. 22 people are believed to have died as a direct result of the storm, and over 3 million homes and businesses lost power. Approximately a million people are still without power now, five days later.

Even worse, this storm system was caused by a heat wave which has set record temperatures in hundreds of locations across the Eastern US, with many areas exceeding 38 C (100 F) for several days in a row. For the elderly, the young, and individuals with chronic diseases, prolonged exposure to these temperatures can be life-threatening -- and without electricity, air conditioning is not available. By the time this heat wave is over, it could easily be responsible for hundreds or even thousands of deaths.

For many of us, a datacenter losing power is the only effect we will see from this storm. For most of the people who were directly affected by the storm, it's the least of their worries.

Posted at 2012-07-04 22:00 | Permanent link | Comments

Recent posts

Monthly Archives

Yearly Archives