A strange S3 performance anomaly

Tarsnap, my online backup service, stores data in Amazon S3 by synthesizing a log-structured filesystem out of S3 objects. This is a simple and very effective approach, but it has an interesting effect on Tarsnap's S3 access pattern: Most of its GETs are the result of log cleaning, and so while Tarsnap doesn't make a very large number of requests relative to the amount of data it has stored in S3 — backups are the classic "write once, read maybe" storage use case — it makes requests for a large number of distinct S3 objects. This makes Tarsnap quite sensitive to any S3 glitches which affect a small proportion of S3 objects.

One such glitch happened earlier this week. The Tarsnap server code has a 3 second timeout for S3 GETs, and normally 99.7% of them complete within this window; the few which do not are usually due to transient network glitches or sluggish S3 endpoints, and usually succeed when they are retried. Between 2012-10-15 05:00 and 2012-10-16 10:00, however — a period of 27 hours — Tarsnap attempted to GET one particular S3 object 40 times, and every single attempt timed out. Clearly there was something unusual about that S3 object, but it wasn't anything I did -- that object was no larger than millions of other S3 objects Tarsnap has stored.

I'm guessing that what I saw here was an artifact of Amazon S3's replication process. S3 is documented as being able to "sustain the concurrent loss of data in two facilities", but Amazon does not state how this is achieved. The most obvious approach would be to have three copies of each S3 object residing in different datacenters, but this would result in a significant cost increase; a cheaper alternative which maintains the same "concurrent loss of two facilities" survivability criterion would be to utilize some form of erasure correction code: Given N+2 datacenters, take N equally-sized objects, apply a (N+2, N) code, and store each of the N+2 resulting parts in a different datacenter. While providing much cheaper storage, this does have one disadvantage: If the datacenter where a specific object was stored (they would presumably use a code which kept N objects intact and created two new "parity" objects) went offline (or its copy of the object was corrupted) then they would need to issue N reads to different datacenters in order to reconstruct that object — making that object read very much slower than normal.

Depending on price/performance considerations, this might not be the optimal strategy; I would not be surprised if they instead stored intact copies of each S3 object in two datacenters and only resorted to remote reads and erasure-correction reconstruction if both datacenters failed to satisfy the request. Whatever the precise implementation, based on my working assumption that Amazon has people who are at least as smart as I am — and given the performance anomaly I observed — I think it's quite likely that S3 uses some form of erasure-correction and as a result has a slow "data was almost but not quite lost" read path.

This raises two interesting points. First, the existence of such "black swan" objects changes the optimal retry behaviour. If there were merely "black swan" requests — that is, some small fraction of randomly selected requests are very slow — then a strategy of timing out quickly and re-issuing requests with the same timeout is optimal, since in such a case there's no reason to think that a request timing out once will make a retry more likely to time out. If the time needed for a request to complete is entirely determined by the object being read, on the other hand, there's no point timing out and retrying at all, since you'll only postpone the inevitable long wait. With both slow S3 objects and slow S3 hosts, a balanced approach is optimal — retries with increasingly long timeouts.

The second point is more subtle, but wider in its implications: The S3 SLA is meaningless. The S3 SLA provides for partial refunds in the event that S3 returns too many errors over a month — a 10% credit if more than 0.1% of requests fail, or a 25% credit if more than 1% of requests fail. However, no guarantee is provided concerning the time taken to return a response — and in a distributed system without response time guarantees, a crash failure is indistinguishable from (and thus must be treated the same as) a very slow response. Amazon could "cheat" the S3 SLA very easily by simply detecting outgoing InternalError and ServiceUnavailable responses, blocking them, and instead holding those TCP connections open without sending any response until the S3 client gives up.

Do I think Amazon is doing this? Not deliberately, to be sure — but I can't help noticing that whereas in 2008 I saw 0.3% of my GETs returning HTTP 500 or 503 responses and almost zero connection timeouts, I now see almost zero 500 or 503 responses but 0.3% of my GETs time out. Something happened within S3 which reduced the number of "failures" while lengthening the response-time tail; whether this was a good trade-off or not depends on how you're using S3.

Amazon is famously secretive, and I hate it. While I understand the reasoning of not wanting to share their "secret sauce" with competitors, I think it's ultimately futile; and as a user of Amazon Web Services, there have been plenty of times when knowing more about their internals could have allowed me to be a better customer. Sure, Amazon provides some assistance with such questions via their premium support, and it's quite possible that if I paid them $15,000/month they would have warned me that occasionally a few S3 objects will be persistently slow to access for a period of many hours; but to quote Jeff Bezos, even well-meaning gatekeepers slow innovation — so I firmly believe that the ecosystem of AWS users, and ultimately Amazon itself, would benefit from greater openness.

But for now, I'll keep watching my logs, trying to reverse-engineer the behaviour of Amazon's services, and adjusting my code based on my findings. And, of course, writing the occasional blog post about my observations.

Posted at 2012-10-18 07:30 | Permanent link | Comments
blog comments powered by Disqus

Recent posts

Monthly Archives

Yearly Archives


RSS