AWS signature version 1 is insecureThe important bit first: If you are making Query (aka REST) requests to Amazon SimpleDB, to Amazon Elastic Compute Cloud (EC2), or to Amazon Simple Queue Service (SQS) over HTTP, and there is any way for an attacker to provide you with data which you use to construct your request, switch to HTTPS or start using AWS signature version 2 now. For example, if you allow users to add arbitrary "tags" to documents, and you use SimpleDB to store those tags, this means you. (Amazon Flexible Payments Service (FPS) and Amazon Devpay also use the same insecure signature method, but they already require the use of HTTPS. Amazon S3 and other services use different signature methods.)
I've been sitting on this blog post since May 1st, when I was reading documentation in preparation for writing the accounting code for my tarsnap online backup service and I first noticed that AWS signature version 1 was insecure; but now that the cat is out of the bag thanks to Amazon announcing the new signature version, it's time to publish the details of how their signature version 1 is broken.
AWS signature version 1 signs an HTTP query string as follows:
- Split the query string based on '&' and '=' characters into a series of key-value pairs.
- Sort the pairs based on the keys.
- Append the keys and values together, in order, to construct one big string (key1 + value1 + key2 + value2 + ... ).
- Sign that string using HMAC-SHA1 and your secret access key.
When Amazon invented this signature scheme, they forgot about one of the foremost design principles relating to cryptographic signatures: Collisions are BAD! In a well-designed signature system, it should be computationally infeasible to construct two different messages which have the same signature; this prevents substitution attacks where an attacker convinces the key holder to sign a "harmless" message, and then attaches that signature to a different message. Looking at how AWS signature version 1 is computed, it's easy to see how to construct collisions: Because there are no delimiters between the keys and values, the signature for "foo=bar" is identical to the signature for "foob=ar"; moreover, the signature for "foo=bar&fooble=baz" is the same as the signature for "foo=barfooblebaz".
To see how this could be exploited, let's return to my earlier example of a website which allows users to add tags to documents. Suppose that each document is identified as a single item in SimpleDB, and that each document has attributes associated with it including an "owner" and one or more "tags" (SimpleDB allows multiple values to be associated with each parameter name). To add the tag X to a document, the website would normally issue the SimpleDB request
?Action=PutAttributes(whitespace added for clarity and to avoid page-width problems) which would be signed with the HMAC of ActionPutAttributesAttribute .0.NametagsAttribute.0.ValueX.... Now consider what happens if someone asks the website to add the tag "fooAttribute.1.NameownerAttribute .1.ReplacetrueAttribute.1.ValueDr.Evil" to a document. The website issues the SimpleDB request
... other stuff
?Action=PutAttributeswhich is signed with the HMAC of ActionPutAttributesAttribute .0.NametagsAttribute.0.ValuefooAttribute .1.NameownerAttribute.1.ReplacetrueAttribute .1.ValueDr.Evil... -- which would also be the signature for the request
... other stuff
?Action=PutAttributesProviding that this request is sent over HTTP, Dr. Evil just has to capture this request (via network sniffing or ARP / IP / DNS / BGP attacks) and he can attach the signature from the real request to his fake request, whereupon he sets himself as the "owner" of the document in question. If the request is sent over HTTPS, in contrast, Dr. Evil won't be able to see the signature (unless he has an SSL certificate for sdb.amazonaws.com, which is unlikely) and so he won't be able to apply this attack.
&Attribute.1.Name=owner&Attribute.1.Replace=true&Attribute.1.Value=Dr.Evil ... other stuff
I reported this issue to Amazon via an email to Jeff Barr, the "Lead Web Services Evangelist" at Amazon on May 1st, and while it took a long time -- 7.5 months -- for it to be fixed, I'm happy to say that Amazon took this issue seriously at all times, and the lengthy timeline was simply because of the large amount of work involved. Jeff forwarded my email to someone working on SimpleDB (I've been asked not to mention names), who confirmed that they agreed that this was a problem. As part of their review of my findings, Amazon's security people realized that this also affected EC2 and SQS -- in my initial investigation I had only looked at SimpleDB -- and at the beginning of July they agreed to send me their planned signature version 2 so that I could review it.
Aside from some minor clarifications to the documentation, I saw no problems with the new signature method, and at that point Amazon started the lengthy process of implementation, testing, and rolling out the new signature method. In September, they allowed me to perform some basic interoperability tests between my code (written based on the documentation) and their back-end code; this proved very useful, as it uncovered an ambiguity in the documentation. Amazon then returned to their processes -- including updating their many client libraries in order to make sure that everybody would be able to switch to signature version 2 as soon as it was announced. Now, in mid-December, they've finished updating their servers, documentation, and libraries, and the new signature is finally being announced.
I must congratulate Amazon on a highly professional response to this issue. Companies very frequently have difficulty handling externally-discovered security problems, both because of a temptation to downplay the significance of the issues, and also because of a desire to keep potentially sensitive information out of the hands of anyone outside of the company -- Intel's response to the Hyperthreading information leakage problem is a good example of both of these. The fact that Amazon not only accepted that there was a problem, but was willing to keep me informed during throughout the process of fixing it -- even going to the extent of allowing me to review their intended solution, which is more than the FreeBSD Security Team usually does -- is quite exceptional.
People inevitably make mistakes from time to time. Security problems happen. But when they do happen, Amazon's response is a good example to follow.
How Tarsnap uses Amazon Web ServicesRegular readers of these Daemonic Dispatches will no doubt have noticed that I have mentioned Amazon Web Services on many occasions, and it's no secret that my tarsnap online backup service is built on top of Amazon Web Services. Over the month since tarsnap reached public beta, a number of people have asked me questions about AWS and how tarsnap uses it, so I think now is a good time to provide some insight into how the tarsnap service works behind the scenes.
The tarsnap server provides a transactional key -> blob store to tarsnap clients. The keys are a fixed 33 bytes (a one-character type plus a 256-bit unique ID generated using SHA256), while blobs are an average of about 30 kB but can be as large as 256 kB. In order to create a new archive, the tarsnap client sends a "write transaction start" request, many "write data" requests, and a "commit transaction" request to the tarsnap server; deleting an archive is similar (except with a "delete transaction start" and "delete data" requests).
The tarsnap server has no concept of separate tarsnap archives or of eliminating data which is duplicated between archives; instead, it is up to the tarsnap client to recognize duplicate data and avoid storing it again, and to only delete data once no remaining archives require it (this is important for both performance and security). As a result, it is essential that the tarsnap server provide a storage system which is both transactional and strongly consistent: Without this guarantee, it would be possible for partial archives to be stored but "orphaned" in the event that the tarsnap client crashed in the middle of writing an archive (in which case an unlucky user would be stuck paying to store inaccessible and useless bits), or -- even worse -- for the tarsnap client to delete some bits which were still required by an archive, rendering that archive unreadable.
Of course, providing strong consistency comes at a price: Tarsnap sacrifices availability. If, at any point, the tarsnap server finds that it is unable to service a request, it simply drops the connection to the tarsnap client; the tarsnap client will then re-connect and retry. In practice, tarsnap client re-connections occur far more often due to client-server connections falling victim to network outages and packet loss; but the tarsnap client handles a server failure and a network outage identically (in fact, the client code isn't even aware of the distinction).
Once data reaches the tarsnap server, it is stored to Amazon S3; and the tarsnap server only acknowledges the client request once the S3 PUT has successfully completed -- that is, once the data has been stored on disks in multiple geographically diverse datacenters. However, S3 by itself doesn't provide either the consistency guarantees or the transactionality required by tarsnap. To provide these, the tarsnap server implements a log-structured filesystem on S3, but holds all of the relevant metadata on an EC2 instance. Because log entry numbers are strictly increasing, each object which the tarsnap server stores on S3 only has one possible value (if it exists at all); and because metadata is stored on EC2 (which makes it easy to provide strong consistency guarantees), this makes it possible to recognize if S3 provides "stale" data (since the only possible stale response is a 404 error).
Now, holding (meta)data on an EC2 instance means that we must accept the possibility of the EC2 instance dying; but the use of a log-structured filesystem makes this an easy problem to solve: All of the metadata is implicit in the individual log entries, so to regenerate the metadata one must merely read the log entries back from S3. In fact, this provides a very good "safety net" to protect against any unforseen glitches in the tarsnap service: If all else fails, I can do a complete reboot of the service by throwing away everything except the data stored on S3 and then reconstructing all of the transient state.
Naturally, this all comes at a cost, and this is part of why tarsnap's price for storage ($0.30 / GB / month) and bandwidth ($0.30 / GB) is higher than S3's prices ($0.15 / GB / month for storage, and $0.10 -- $0.17 / GB for bandwidth). However, the price difference isn't as large as it seems: In addition to the price of storage and bandwidth, S3 charges a per-request fee of $0.00001 (for PUTs) or $0.000001 (for GETs). While this seems small, it adds up: If the tarsnap client wrote data directly to S3 (ignoring, for the purpose of argument, the fact that S3 doesn't provide consistency and transactionality) the added cost of S3 PUTs would make it more expensive than writing data via the tarsnap server. Because the tarsnap server services requests from many clients at once, it is able to "bundle" multiple writes together, with the result that the cost of PUTs which I end up paying S3 is considerably lower on a per-GB basis.
That said, I do make a few cents of profit out of the $0.30; but unlike Jungle Disk, I don't charge an up-front fee for the tarsnap client code (the Jungle Disk software costs $20) or a monthly fee for the service (Jungle Disk Plus costs a flat $1/month beyond the S3 costs), and given that I spent two years working full-time on writing all the code for tarsnap, I don't think it's unreasonable for me to add a small markup to the service in order to pay for my time. :-)
Speaking of money, I would be remiss if I failed to mention SimpleDB. While I wrote last year that one should not try to use [SimpleDB] to store any sort of accounting information, this is in fact exactly what I'm doing: Tarsnap users' current and historical account balanaces along with how much storage and bandwidth they have used, are all stored in SimpleDB. The fact that SimpleDB lacks any useful consistency guarantee means that it is theoretically impossible to do this without a risk that a user's usage, or worse, a payment made will be "lost" and not reflected in a user's current balance; but I've written the code in such a way that the tarsnap accounting code will never lose anything providing that SimpleDB always reaches consistency within 24 hours. Given that I've been told that SimpleDB usually reaches consistency within a few seconds, the danger of losing some accounting data is low enough that I'm not particularly concerned. I would never be so cavalier about someone's data, but accounting data... well, it's only money, after all.
There are still improvements which I'd like to see made to Amazon Web Services: Removing "eventual consistency" or at least replacing it with eventually known consistency (the distinction being that with eventually known consistency you can query an API to ask if updates have propagated yet) and adding support for running FreeBSD on EC2 (I've signed an NDA, so I can't say much here except that we're working on it) are at the top of my list, closely followed by expanding the Flexible Payments Service to Canada. For all its quirks and limitations, though, Amazon Web Services is a great platform which I'd recommend to anyone interested in building an online service.