The S3 SLA is broken (and how to fix it)

At approximately 2009-01-14 05:26, Amazon's Simple Storage Service suffered some form of internal failure, resulting in a sharp increase in the rate of request failures. According to Amazon, there were "increased error rates"; according to my logs, 100% of the PUT requests the tarsnap server made to S3 failed. For somewhat more than half an hour (I don't know the exact duration) it was impossible for the tarsnap server to store any data to S3, effectively putting it out of service as far as storing backups was concerned; and presumably other S3 users met a similar fate.

At approximately 2009-01-16 15:20, the S3 PUT error rate jumped from its usual level of less than 0.1% up to roughly 1%; and as I write this, the error rate remains at that elevated level. However, the tarsnap server, like all well-designed S3-using applications, retries failed PUTs, so aside from a very slight increase in effective request latency, this prolonged period of elevated error rates has had no effect on tarsnap whatsoever; nor, presumably, has it had any significant impact on any other well-designed S3-using applications.

According to the S3 Service Level Agreement, these two outages -- one which rendered applications inoperative for half an hour, and the other which had little or no impact -- are equal in severity.

This peculiar situation is caused by the overly simplistic form which the SLA takes: It provides a guarantee on the average Error Rate, completely neglecting to consider the fact that -- given that applications can retry failures -- the impact of errors is a very non-linear function of the error rate. I observed no outages in S3 during December 2008, yet even without using tricks which can be used to arficially raise the computed error rate, the occasional failures which result from S3's nature as a distributed system -- failures which occur by design -- were enough that the error rate I experienced (as computed in accordance with the SLA) was 0.098% -- just barely short of the 0.1% which would have triggered a refund. At the same time, 0.1% of a month is 40-44 minutes (depending on the number of days in the month), so if S3 failed completely for 30 minutes but every request made outside of that interval succeeded, nobody would get a refund under the SLA.

Put simply, the design of the SLA results in refunds being given in response to harmless failures, yet not being given in response to harmful failures: The wrong people get refunds.

If I were in charge at Amazon, I would adjust the S3 SLA as follows:

  • "Failed Request" means: A request for which S3 returned either "InternalError" or "ServiceUnavailable" error status.
  • "Non-GET Request" means: Requests other than GET requests, e.g., PUT, COPY, POST, LIST, or DELETE requests.
  • "Severely Errored Interval" for an S3 account means: A five-minute period, starting at a multiple of 5 minutes past an hour, during which either
    • At least 5 GET requests associated with the account are Failed Requests, and the number of GET requests associated with the account which are Failed Requests is more than 0.5% of the total number of GET requests associated with the account; or
    • At least 5 Non-GET Requests associated with the account are Failed Requests, and the number of Non-GET Requests associated with the account which are Failed Requests is more than 5% of the total number of Non-GET Requests associated with the account.
  • "Monthly Uptime Percentage" means: 100% minus the number of Severely Errored Intervals divided by the total number of five-minute periods in the billing cycle (i.e., 288 times the number of days).

Three notes are in order here:

  1. The use of Severely Errored Intervals as a metric in place of simply computing the average Error Rate would distinguish the low baseline rate of errors which result from S3's design (and are mostly harmless) from the exceptional periods where S3's error rate spikes upwards (often, but not always, to 100%). In so doing, this change would make it possible to increase the guaranteed Monthly Uptime Percentage without increasing the number of SLA credits given.
  2. I distinguish between GET failures and non-GET failures for two simple reasons: First, GET failures are far less common, so it wouldn't hurt Amazon to offer a strengthened guarantee for GETs; and second, because in many situations a GET failure is more problematic than a PUT failure -- not least because web browsers downloading public files from S3 don't automatically retry failed requests.
  3. The dual requirement that at least 5% (or 0.5% for GETs) of requests fail AND that there be at least 5 failed requests makes it extremely unlikely that Error Rate increasing tricks could be used to artificially raise an interval across the threshold required to qualify as Severely Errored.

Now, I don't expect Amazon to adopt this suggestion overnight, and I suspect that even if they are inspired to fix the SLA they'll do it in such a way that the result is at most barely recognizable as being related to what I've posted here; but I hope this will at least spark some discussions about making the set of people who receive SLA credits better reflect the set of people affected by outages.

And Amazonians -- I know you're going to be reading this, since I logged hundreds of you reading my last post about the S3 SLA -- if this does open your eyes a bit, could you let me know? It's always a bit unsettling to see a deluge of traffic coming from an organization but not to hear anything directly. :-)

Posted at 2009-01-19 11:50 | Permanent link | Comments

Tarsnap news

It has been just over two months since I opened the tarsnap beta to the public, and I've been busy making improvements to tarsnap -- a few bug fixes, but mostly added features. While some of the features I've added to tarsnap recently resulted from suggestions I've received in the past two months, the majority are things I've had planned for a long time; but even with those, input I've received from tarsnap beta testers has been useful, as many of the features I've added were much lower priorities until I started getting lots of emails asking for them.

A detailed listing of the improvements in individual versions of the tarsnap client code is available from the tarsnap beta client download page (and an even more detailed listing can be produced using diff), but in my eyes the most important changes in the past two months are:

While tarsnap is much better now than it was two months ago, I still have a long list of improvements waiting to be made -- and I'm sure there is an even longer list of improvements which I haven't thought of and nobody has asked for yet. So go get started with tarsnap and then send me your ideas for how to make it better!

Posted at 2009-01-17 13:50 | Permanent link | Comments

New GPG key

After using the same GPG key for over five years, I decided that it was time to create a new key. I still hold the private part of my old GPG key, but I will not be using it to sign anything in the future; and when it is necessary for people to send me encrypted email, I would prefer that they use my new GPG key.

My new GPG key is signed by my old key and by the FreeBSD Security Officer key, and can be downloaded here.

I have also generated a revocation certificate for my old key, which can be downloaded here.

Posted at 2009-01-15 12:25 | Permanent link | Comments

Recent posts

Monthly Archives

Yearly Archives