How Tarsnap uses Amazon Web Services

Regular readers of these Daemonic Dispatches will no doubt have noticed that I have mentioned Amazon Web Services on many occasions, and it's no secret that my tarsnap online backup service is built on top of Amazon Web Services. Over the month since tarsnap reached public beta, a number of people have asked me questions about AWS and how tarsnap uses it, so I think now is a good time to provide some insight into how the tarsnap service works behind the scenes.

The tarsnap server provides a transactional key -> blob store to tarsnap clients. The keys are a fixed 33 bytes (a one-character type plus a 256-bit unique ID generated using SHA256), while blobs are an average of about 30 kB but can be as large as 256 kB. In order to create a new archive, the tarsnap client sends a "write transaction start" request, many "write data" requests, and a "commit transaction" request to the tarsnap server; deleting an archive is similar (except with a "delete transaction start" and "delete data" requests).

The tarsnap server has no concept of separate tarsnap archives or of eliminating data which is duplicated between archives; instead, it is up to the tarsnap client to recognize duplicate data and avoid storing it again, and to only delete data once no remaining archives require it (this is important for both performance and security). As a result, it is essential that the tarsnap server provide a storage system which is both transactional and strongly consistent: Without this guarantee, it would be possible for partial archives to be stored but "orphaned" in the event that the tarsnap client crashed in the middle of writing an archive (in which case an unlucky user would be stuck paying to store inaccessible and useless bits), or -- even worse -- for the tarsnap client to delete some bits which were still required by an archive, rendering that archive unreadable.

Of course, providing strong consistency comes at a price: Tarsnap sacrifices availability. If, at any point, the tarsnap server finds that it is unable to service a request, it simply drops the connection to the tarsnap client; the tarsnap client will then re-connect and retry. In practice, tarsnap client re-connections occur far more often due to client-server connections falling victim to network outages and packet loss; but the tarsnap client handles a server failure and a network outage identically (in fact, the client code isn't even aware of the distinction).

Once data reaches the tarsnap server, it is stored to Amazon S3; and the tarsnap server only acknowledges the client request once the S3 PUT has successfully completed -- that is, once the data has been stored on disks in multiple geographically diverse datacenters. However, S3 by itself doesn't provide either the consistency guarantees or the transactionality required by tarsnap. To provide these, the tarsnap server implements a log-structured filesystem on S3, but holds all of the relevant metadata on an EC2 instance. Because log entry numbers are strictly increasing, each object which the tarsnap server stores on S3 only has one possible value (if it exists at all); and because metadata is stored on EC2 (which makes it easy to provide strong consistency guarantees), this makes it possible to recognize if S3 provides "stale" data (since the only possible stale response is a 404 error).

Now, holding (meta)data on an EC2 instance means that we must accept the possibility of the EC2 instance dying; but the use of a log-structured filesystem makes this an easy problem to solve: All of the metadata is implicit in the individual log entries, so to regenerate the metadata one must merely read the log entries back from S3. In fact, this provides a very good "safety net" to protect against any unforseen glitches in the tarsnap service: If all else fails, I can do a complete reboot of the service by throwing away everything except the data stored on S3 and then reconstructing all of the transient state.

Naturally, this all comes at a cost, and this is part of why tarsnap's price for storage ($0.30 / GB / month) and bandwidth ($0.30 / GB) is higher than S3's prices ($0.15 / GB / month for storage, and $0.10 -- $0.17 / GB for bandwidth). However, the price difference isn't as large as it seems: In addition to the price of storage and bandwidth, S3 charges a per-request fee of $0.00001 (for PUTs) or $0.000001 (for GETs). While this seems small, it adds up: If the tarsnap client wrote data directly to S3 (ignoring, for the purpose of argument, the fact that S3 doesn't provide consistency and transactionality) the added cost of S3 PUTs would make it more expensive than writing data via the tarsnap server. Because the tarsnap server services requests from many clients at once, it is able to "bundle" multiple writes together, with the result that the cost of PUTs which I end up paying S3 is considerably lower on a per-GB basis.

That said, I do make a few cents of profit out of the $0.30; but unlike Jungle Disk, I don't charge an up-front fee for the tarsnap client code (the Jungle Disk software costs $20) or a monthly fee for the service (Jungle Disk Plus costs a flat $1/month beyond the S3 costs), and given that I spent two years working full-time on writing all the code for tarsnap, I don't think it's unreasonable for me to add a small markup to the service in order to pay for my time. :-)

Speaking of money, I would be remiss if I failed to mention SimpleDB. While I wrote last year that one should not try to use [SimpleDB] to store any sort of accounting information, this is in fact exactly what I'm doing: Tarsnap users' current and historical account balanaces along with how much storage and bandwidth they have used, are all stored in SimpleDB. The fact that SimpleDB lacks any useful consistency guarantee means that it is theoretically impossible to do this without a risk that a user's usage, or worse, a payment made will be "lost" and not reflected in a user's current balance; but I've written the code in such a way that the tarsnap accounting code will never lose anything providing that SimpleDB always reaches consistency within 24 hours. Given that I've been told that SimpleDB usually reaches consistency within a few seconds, the danger of losing some accounting data is low enough that I'm not particularly concerned. I would never be so cavalier about someone's data, but accounting data... well, it's only money, after all.

There are still improvements which I'd like to see made to Amazon Web Services: Removing "eventual consistency" or at least replacing it with eventually known consistency (the distinction being that with eventually known consistency you can query an API to ask if updates have propagated yet) and adding support for running FreeBSD on EC2 (I've signed an NDA, so I can't say much here except that we're working on it) are at the top of my list, closely followed by expanding the Flexible Payments Service to Canada. For all its quirks and limitations, though, Amazon Web Services is a great platform which I'd recommend to anyone interested in building an online service.

Posted at 2008-12-14 02:10 | Permanent link | Comments