Thoughts on Glacier pricing

While writing my last blog post I spent a long time looking at the pricing of Amazon's new Glacier archival storage service; or more precisely, the pricing of the "Data Retrieval" component. After much thought, I have come to the conclusion that Glacier's pricing is incomprehensible, broken, and fundamentally un-Amazonian.

Let's take "incomprehensible" first. The Glacier pricing page states (in a footnote, no less!) that:

You can retrieve up to 5% of your average monthly storage (pro-rated daily) for free each month. If you choose to retrieve more than this amount of data in a month, you are charged a retrieval fee starting at $0.01 per gigabyte.
Ok, that's a start (although, as it turns out, not an entirely accurate one), but "starting at" is hardly precise; let's head over to the Glacier FAQ for some more details:
Each month you can retrieve up to 5% of the data you store in Glacier for free. This allowance is pro-rated daily. For example, in a 30 day month, you can retrieve approximately 0.17% of your stored data for free daily (5% / 30 days = 0.17% per day). This means if you store 12 terabytes of data you can retrieve 20.5 gigabytes a day for free. You are charged a retrieval fee when your retrievals exceed your daily allowance.
That's good to know, although it contradicts the pricing page — assuming the FAQ is accurate, exceeding your daily allowance is enough to run up a bill, even if you were well within your allowance for the month. Moving on:
If, during a given month, you do exceed your daily allowance, we calculate your fee based upon the peak hourly usage from the days in which you exceeded your allowance. [...] Next we subtract your free allowance from the peak hourly retrieval for the month. To determine the amount of data you get for free, we look at the amount of data retrieved during your peak day and calculate the percentage of data that was retrieved during your peak hour. We then multiply that percentage by your free daily allowance. [...] We then subtract your free allowance from your peak usage to determine your billable peak. [...] The amount you pay is your billable peak, multiplied by the number of hours in the month, multiplied by the retrieval fee.
Leaving aside the confusion of interleaving the pricing algorithm with an example (which I have elided, along with the embarrassing lapse in arithmetic which led to 20.5 / 24 being equal to 0.82 instead of 0.85), there is a glaring lack of definition: What is the "peak hourly usage"?

In most AWS services, this is straighforward: Add up the requests you issued in each hour. Glacier is unlike most AWS services, however — its retrieval requests "typically complete within 3-5 hours". Is an hour's usage defined to be the requests issued in that hour? The requests completed in that hour? The portions completed in that hour? In a thread on the AWS forum, an Amazonian states that you can "expect the total amount of data retrieved for a single retrieval to be spread evenly across four hours for the purposes of working out the cost of the retrieval", but it's not clear what the word "expect" means here: Is this is how the pricing is defined, or merely a merely a rule of thumb? Either way, the web forum is not where this information belongs: The Glacier pricing page ought to have enough information to allow Amazon Web Services users to figure out how much they will end up paying — just like the pages for all the other AWS services do.

Next up: "broken". It's clear that the purpose of this retrieval charge is to bill people for their peak usage. I don't know all the details of how Amazon Glacier is implemented, but whether it's hard drives which are spun down most of the time or tape robots with a limited number of drives, Amazon has a limited amount of read throughput. Problem is, Amazon isn't always charging for peak usage. Consider someone who has 9 TB of data stored, and downloads data on two days: On day 1, he downloads 24 GB, at a rate of exactly 1 GB per hour; on day 2, he downloads 10 GB all at once. This hypothetical user has a free daily retrieval allowance of 15 GB, so his burst of 10 GB in a single hour on day 2 is ignored; instead, he is billed based on the 1 GB/hour he downloaded on day 1 (which, after the free retrieval usage is considered, ends up as 0.375 GB/hour of billed usage, for a cost of $2.70). Of course, this is a somewhat contrived example, and in many cases the "peak hours" used for computing a customer's bill will, in fact, be their peak hours; but as far as reliably capturing peak usage goes, the Glacier pricing model falls quite neatly into the "almost but not quite" category.

There's another problem with the Glacier retrival charges which is far more serious, albeit rather subtle: The pricing is non-concave. Much to the annoyance of economists — who like to see strictly convex pricing models — aside from "freemium" and other limited-usage "trial" plans, all the services we use are concave. The common term for this is "volume discounts", but in the context of Amazon Web Services, what it really means is this: If you have multiple accounts and sign up for Consolidated Billing, you might end up paying less (due to moving into a higher volume / lower price tier for S3 storage or outgoing bandwidth) but you should never end up paying more. Glacier violates this property.

Consider two Amazon Glacier users, Alice and Bob, who each have 600 GB of data stored. For simplicity, we'll assume that every month has 30 days, so that each of them has a daily free retrieval allowance of 1 GB. On the first day of each month, Alice downloads 1 GB of data at 1 PM, while Bob downloads 1 GB of data at 1 PM and another 1 GB of data at 2 PM; for the rest of the month, neither of them downloads any data from Glacier. Alice's bill is simple: She's within her free retrieval allowance every day, so she just pays $6.00 for the storage. Bob is also paying $6.00 each month for storage, but he has to pay for retrieval bandwidth as well, since he downloaded 2 GB on a day when his free daily retrieval allowance was only 1 GB. His peak hour was 1 GB out of the 2 GB for the day, so he gets 0.5 GB of his free retrieval allowance attributed to that hour; the remaining 0.5 GB is his billable peak hourly retrieval rate, so he ends up paying $3.60 for retrieval, yielding a total bill of $9.60 for the month.

Between them, Alice and Bob were paying a total of $15.60 each month; but now they decide that arranging illicit liaisons cryptographically is too much work, and decide to bring their relationship into the open and get married. A few days after their honeymoon, they sign up for Amazon Consolidated Billing. They have a combined 1200 GB of data stored, so they're paying $12.00 for storage — that part hasn't changed — and they get a daily free retrieval allowance of 2 GB instead of two separate allowances of 1 GB. Their peak hour is now 2 GB out of a total of 3 GB for the day, so of their free retrieval allowance, 1.33 GB is attributed to that peak hour, leaving them with a billable peak hourly retrieval rate of 0.67 GB/hour — which works out to $4.80 of retrieval bandwidth. By signing up for Consolidated Billing, they increased their combined bill from $15.60 up to $16.80. (Epilogue: They each blame the other for the increased bill they're receiving from Amazon, leading to a breakdown in their relationship, and they get divorced a few months later. It was inevitable anyway: Cryptographers never really trust anyone.)

Finding the worst case is an interesting optimization problem which I leave as an exercise to the reader; suffice it to say that if Alice and Bob can choose arbitrary data retrieval patterns, they can arrange that Consolidated Billing will cost them 76.8% more than being each billed separately. As a pricing model, I'm calling this broken.

Finally, "fundamentally un-Amazonian". It may seem presumptious for me to tell Amazon what is or is not un-Amazonian, but I do have some justification: I've been using Amazon Web Services — and evangelizing it within my corner of the open source and startup worlds — since 2006, which is a fair bit longer than most Amazon Web Services employees. More importantly, however, I remember a line which might well have appeared in some form in every single Amazon Web Services sales pitch in history:

With traditional hosting, you have to provision to meet peak demand. Amazon Web Services saves you money by letting you rapidly scale up and down to match your needs, without paying for excess unused capacity.
Glacier's data retrieval pricing throws that out the window: Unlike every other service Amazon provides, with Glacier you're not paying for your net usage throughout each month: Instead, you're paying for your peak hour — and once that peak hour has come and gone, you're still paying for that peak until the end of the month, no matter how low your usage might fall.

Glacier is a fantastic service. The pricing for data retrieval sucks. You guys can do better.

Posted at 2012-09-04 16:45 | Permanent link | Comments

Why Tarsnap doesn't use Glacier

Two weeks ago, Amazon announced its new Glacier storage service, providing "archival" storage for as little as $0.01 per GB per month. Since I run an online backup service, this is naturally of interest to me, and in the day following Amazon's announcement I had about two dozen tweets and emails asking me if Tarsnap would be using Glacier. The answer? No. Not yet. But maybe some day in the future — if people want to use what a Glacierized Tarsnap would end up being.

On the surface, Tarsnap sounds like the perfect use case for Glacier. Out of every TB of data stored on Tarsnap, in any given month approximately 100 GB of data will be deleted, while only 33 GB of data will be downloaded; in other words, Tarsnap is very much a "write once, read maybe" storage system. Tarsnap's largest operational expense is S3 storage, which is roughly ten times as expensive as Glacier (most of the rest is the EC2 instances which run the Tarsnap server code) and a large majority of Tarsnap revenue is the per-GB storage pricing. If I could switch Tarsnap over to a cheaper storage back-end, I could correspondingly reduce the prices I charge Tarsnap users, which — in spite of some of my friends advising me that Tarsnap is too cheap already — is definitely something I'd like to do. The downside to Glacier, and the reason that it is much cheaper than S3, is that retrieval is slow — you have to request data and then come back four hours later — but there are many Tarsnap users who would be willing to wait a few hours extra to retrieve their backups if it meant they could store ten times as much data for the same price.

As usual, the devil is in the details; in this case, there's one detail which makes things particularly devilish: Deduplication. Taking my personal laptop as an example: Every hour, Tarsnap generates an archive of 38 GB of files; I currently have about 1500 such archives stored. Instead of uploading the entire 38 GB — which would require a 100 Mbps uplink, far beyond what Canadian residential ISPs provide — Tarsnap splits this 38 GB into somewhere around 700,000 blocks, and for each of these blocks, Tarsnap checks if the data was uploaded as part of an earlier archive. Typically, there's around 300 new blocks which need to be uploaded; the rest are simply handled by storing pointers to the previous blocks and incrementing reference counters. (The reference counters are needed so that when an archive is deleted, Tarsnap knows which blocks are still being used by other archives.)

As a result, extracting an archive isn't simply a matter of downloading a single 38 GB blob; it involves making 700,000 separate block read requests. Retrieving data from Glacier isn't cheap: Like S3, Glacier has a per-request fee; but while S3's per-GET fee is $1 per million requests, Glacier's per-RETRIEVAL fee is $50 per million requests. I would pay $5.13 to extract that archive from Tarsnap right now; if it were stored on Glacier, it would cost Tarsnap $35 just for the Glacier retrievals alone, and Tarsnap's pricing for downloads would have to increase dramatically as a result.

It gets worse. Tarsnap doesn't merely deduplicate blocks of data; it also deduplicates blocks containing lists of blocks, and blocks containing lists of blocks of lists of blocks. This is important for reducing Tarsnap's bandwidth and storage usage — the amount of data Tarsnap uploads from my laptop each hour is less than what it would take just to list the 700,000 blocks which make up an hourly 38 GB archive — but it makes Glacier's four hour round trip from requesting data to being able to read it much worse, since you would need to read some blocks — and wait four hours — before knowing which other blocks you need to read. Clearly reading Tarsnap archives directly out of Glacier is not feasible.

But maybe reading archives which are stored in Glacier is something we can avoid. After all, when you need to restore your backups, you usually want your most recent backup. Sure, there are cases when you realize that you need an important file which you deleted two months ago, and that's why it's important to keep some older backups as well; but could Tarsnap save money by offloading those older rarely-needed backups to Glacier? Here too Tarsnap's deduplication gets in the way: Of the aforementioned 700,000 blocks comprising my latest hourly backup, only a few thousand were uploaded earlier today; the vast majority were uploaded weeks or months ago. The Tarsnap server can't simply offload "old" data to Glacier, since many of the oldest blocks of data are still included in the newest archives — and it's the most recent archives which are likely to be retrieved, not just the most recent blocks. Backup systems which work with a "full plus incrementals" approach have an advantage here: Since extracting a recent archive is never going to need data from prior to the last complete backup, older archives can be placed into "cold storage"... of course, that is counterbalanced by the fact that such a system will end up performing many "full" backups over its lifetime, dramatically increasing the amount of bandwidth and storage space used.

So it isn't feasible for the Tarsnap server to move old archives out from "fast" S3 storage to "slow" Glacier storage; but what about the Tarsnap client? Could it potentially tell the server "here's a list of blocks I don't expect to need any time soon"? It turns out that a problem arises there too — not with archive retrieval, but instead with archive creation. Consider what happens if a block of data is in Glacier, but Tarsnap's deduplication code decides that block is needed in a new archive. If the block is referenced in its location in Glacier, you would have a situation where immediately after uploading an archive, you have to wait four hours before you can download it again. Could the Tarsnap client re-upload the block so that it can be stored in S3 without waiting for it to be fetched out of Glacier? Yes, at the expense of using extra bandwidth — but if the Tarsnap server code accepted that, it would be allowing a block to be overwritten by new data, which would violate Tarsnap's security requirement that archives are immutable until they are deleted.

There is only one feasible way I can see for Tarsnap to use Glacier; in a sense, it's the simplest and most obvious one. Rather than operating at a fine-grained level with some archives being in "cold storage" and others being "warm", Tarsnap could support "glaciation" at a per-machine level. If a machine was "glaciated", it would be possible to create more archives (after all, glaciers can grow by having more snow fall on top of them!) and the storage would cost significantly less than it currently does (probably around 3-5 cents per GB per month), but you would not be able to read or delete any data. To do either of those, you would need to "deglaciate" the machine — which would take several hours and cost somewhere around 15-20 cents per GB of stored data — after which all the stored data would be back in S3 and accessible for normal random accesses. You would then be charged Tarsnap's normal storage pricing until you "glaciated" the machine again (which would most likely be free of charge).

Is this be a useful model? I'm not sure. It's not a model which is going to happen in the near future, since migrating data between S3 and Glacier in this way would involve a significant amount of careful design and coding; but if enough people are interested, it's a goal I can move towards. So tell me, dear readers: If Tarsnap allowed you to "glaciate" a machine, temporarily losing the ability to read or delete archives, but significantly reducing its monthly storage bill, would you do it?

Posted at 2012-09-04 16:30 | Permanent link | Comments

Recent posts

Monthly Archives

Yearly Archives