## Hacking the Amazon S3 SLA

The Simple Storage Service (S3) provided by Amazon comes with a Service Level Agreement: If the Monthly Uptime Percentage is between 99% and 99.9%, you get a 10% refund; if the Monthly Uptime Percentage is below 99%, you get a 25% refund. The Monthly Uptime Percentage is computed in a fairly straightforward manner: Divide the month into 5-minute intervals and compute the Error Rate (failed requests divided by total requests, treating 0/0 as 0) for each interval; compute the average Error Rate over all the 5-minute intervals in the month; and subtract this value from 100%.

If the probability of a request failing during the n th 5-minute interval is p(n), and the number of requests issued during the n th interval is determined solely by p(n), the expected value of the Monthly Uptime Percentage is 100% minus the average value of p(n) over all the intervals; put another way, you can't cheat by waiting for a high p(n) and then quickly running up the failure count by issuing lots and lots of requests. However, this uncheatability applies only if the number of requests issued is independent of the success or failure of individual requests; if we can see whether one request succeeded before issuing the next one, we can cheat the SLA -- quite extravagently, in fact.

Consider a hypothetical attacker who has a very large amount of data stored on S3, but doesn't need to access it (perhaps said data is a backup of his world domination plans). Consequently, he doesn't have to issue any requests; but if issuing requests will allow him to get a refund via the SLA, said refund will be large enough that the cost of performing the requests doesn't matter in comparison. Furthermore, assume for the purpose of simplicity that S3 has a constant failure rate p for the entire month, and the attacker knows what p is (if necessary, he can figure it out quickly by sending a very large number of requests). Finally, assume that the attacker can issue as many requests as he likes within each 5-minute interval, and that he can see if each request succeeds or fails before deciding whether to issue another request. What should this attacker do to maximize the chance that he will get a refund?

Well, at the beginning of each 5-minute interval, he should issue a request. There's no point not doing this -- if he doesn't issue any requests, he'll get an Error Rate of zero for the interval. Suppose that request fails; should he issue any more requests? Absolutely not: If his first request fails, the attacker has achieved a 100% Error Rate for that 5-minute interval. Suppose the request succeeds; should he issue any more requests? Absolutely: Zero failed requests out of one request means an Error Rate of zero, and issuing more requests can't make that any worse. In fact, no matter how many requests he has made, if he hasn't seen any failures yet, he should issue more requests -- they can't decrease the Error Rate, and they might increase it. This provides a very simple strategy to the attacker: Keep issuing requests until at least one failure occurs.

He can do better than this. Suppose he sees his first error after making 2/p requests; if he stops at that point, he will have an Error Rate of p/2 for the interval, much less than the expected p if he issues a very large number of requests. The value i - pj for increasing values of j, where i, j are the number of failed requests and the total number of requests respectively, behaves as a random walk, and it is well known that a balanced random walk will take positive values infinitely many times with probability one; so no matter how many successful requests he has encountered, if the attacker issues enough requests he will eventually (with probability one) be able to stop with an Error Rate of more than p. This provides a better strategy to the attacker: Keep issuing requests until the Error Rate i/j is greater than p.

Can the attacker do any better than this? Yes. Suppose p = 0.01 and the attacker's 99th request fails. He now has i/j = 0.0101 > p; but suppose he decides to continue making requests. If he gets lucky and encounters another failure within the next 98 requests -- an event with probability 63% -- he will increase his Error Rate. In fact, he has a 50% chance of encountering his second failure within the next 69 requests -- which would give him an Error Rate of 0.0119 or more, which more than balances the less than 50% chance that he will end up with an Error Rate of less than 0.0101. With a bit of thought, we can see that for any p there must be a sequence x[1] , x[2], x[3]... such that the optimal strategy is to stop issuing requests if the i th failure is encountered when less than x[i] total requests have been made.

How can we determine these x[i]? With great difficulty. We can, however, approximate them by considering a continuous limit. Defining F(i, j) as the expected Error Rate if there are currently i failures out of j requests and the optimal strategy is followed, we can define G[i](x) for positive x as the limit, as p tends to zero, of F(i, x/p) * exp(-x) / p, and define X[i] as the limit as p tends to zero of x[i] / p. (Yes, the limits do exist.) This provides us with a series of differential equations: For each i, G[i](x) = exp(-x) * i/x for x <= X[i], and G'[i](x) = - G[i+1](x) for x > X[i].

These differential equations don't appear to have any closed-form symbolic solution, but they can be solved numerically. The first few values X[i] are approximately X[1] = 0.507, X[2] = 1.197, X[3] = 1.950, X[4] = 2.740, X[5] = 3.554, X[6] = 4.386, X[7] = 5.232, X[8] = 6.089, X[9] = 6.956, X[10] = 7.830. Assymptotically, X[i] appears to be approximately i - sqrt(i/2); this shouldn't be surprising, since with probability 1 - epsilon the attacker will encounter i + O(sqrt(i)) failures if he performs i/p requests (where the implicit constant depends on epsilon). Given these values, the attacker's limiting strategy for small p is to stop issuing requests as soon as he has seen i failures out of less than X[i] / p requests within the current 5-minute interval.

How large an expected Error Rate do these strategies produce? Much larger than the p which might be expected.
 Strategy Expected Error Rate Minimum p to get a 25% refund Minimum p to get a 10% refund One request per 5-minute interval p 1% 0.1% Stop after 1 failure p ln(1/p) / (1-p) 0.154% 0.0110% Stop when Error Rate > p p ln(1/p) + 0.191 p (approximately) 0.149% 0.0107% Optimal strategy p ln(1/p) + 0.292 p (approximately) 0.147% 0.0106%
Put another way, in order to avoid giving a refund to our hypothetical attacker, Amazon has to maintain a failure rate almost 10 times lower than what might be naively assumed from reading the SLA.

Speaking of failure rates, what are the request failure rates for S3? The failure rates for GETs and DELETEs are very low -- on the order of one failure per 100,000 requests according to my logs. PUTs, however, are far more prone to failure: In the last six months, I've seen PUT failure rates of 0.030%, 0.449%, 0.089%, 0.085%, 0.192%, and 0.178% respectively. None of these are below the 0.0106% rate needed to avoid paying a 10% refund to our hypothetical attacker, and in three of these months, the error rate was high enough that our attacker would have received a 25% refund.

S3 is a great service... but if Amazon wants to avoid paying refunds under the SLA, they've got some work to do.