FreeBSD on EC2

FreeBSD doesn't run on Amazon's Elastic Compute Cloud service right now. I want to change this. I need to talk to other wannabe FreeBSD-on-EC2 users. If this describes you, please contact me.

Posted at 2008-10-26 00:40 | Permanent link | Comments

Hacking the Amazon S3 SLA

The Simple Storage Service (S3) provided by Amazon comes with a Service Level Agreement: If the Monthly Uptime Percentage is between 99% and 99.9%, you get a 10% refund; if the Monthly Uptime Percentage is below 99%, you get a 25% refund. The Monthly Uptime Percentage is computed in a fairly straightforward manner: Divide the month into 5-minute intervals and compute the Error Rate (failed requests divided by total requests, treating 0/0 as 0) for each interval; compute the average Error Rate over all the 5-minute intervals in the month; and subtract this value from 100%.

If the probability of a request failing during the n th 5-minute interval is p(n), and the number of requests issued during the n th interval is determined solely by p(n), the expected value of the Monthly Uptime Percentage is 100% minus the average value of p(n) over all the intervals; put another way, you can't cheat by waiting for a high p(n) and then quickly running up the failure count by issuing lots and lots of requests. However, this uncheatability applies only if the number of requests issued is independent of the success or failure of individual requests; if we can see whether one request succeeded before issuing the next one, we can cheat the SLA -- quite extravagently, in fact.

Consider a hypothetical attacker who has a very large amount of data stored on S3, but doesn't need to access it (perhaps said data is a backup of his world domination plans). Consequently, he doesn't have to issue any requests; but if issuing requests will allow him to get a refund via the SLA, said refund will be large enough that the cost of performing the requests doesn't matter in comparison. Furthermore, assume for the purpose of simplicity that S3 has a constant failure rate p for the entire month, and the attacker knows what p is (if necessary, he can figure it out quickly by sending a very large number of requests). Finally, assume that the attacker can issue as many requests as he likes within each 5-minute interval, and that he can see if each request succeeds or fails before deciding whether to issue another request. What should this attacker do to maximize the chance that he will get a refund?

Well, at the beginning of each 5-minute interval, he should issue a request. There's no point not doing this -- if he doesn't issue any requests, he'll get an Error Rate of zero for the interval. Suppose that request fails; should he issue any more requests? Absolutely not: If his first request fails, the attacker has achieved a 100% Error Rate for that 5-minute interval. Suppose the request succeeds; should he issue any more requests? Absolutely: Zero failed requests out of one request means an Error Rate of zero, and issuing more requests can't make that any worse. In fact, no matter how many requests he has made, if he hasn't seen any failures yet, he should issue more requests -- they can't decrease the Error Rate, and they might increase it. This provides a very simple strategy to the attacker: Keep issuing requests until at least one failure occurs.

He can do better than this. Suppose he sees his first error after making 2/p requests; if he stops at that point, he will have an Error Rate of p/2 for the interval, much less than the expected p if he issues a very large number of requests. The value i - pj for increasing values of j, where i, j are the number of failed requests and the total number of requests respectively, behaves as a random walk, and it is well known that a balanced random walk will take positive values infinitely many times with probability one; so no matter how many successful requests he has encountered, if the attacker issues enough requests he will eventually (with probability one) be able to stop with an Error Rate of more than p. This provides a better strategy to the attacker: Keep issuing requests until the Error Rate i/j is greater than p.

Can the attacker do any better than this? Yes. Suppose p = 0.01 and the attacker's 99th request fails. He now has i/j = 0.0101 > p; but suppose he decides to continue making requests. If he gets lucky and encounters another failure within the next 98 requests -- an event with probability 63% -- he will increase his Error Rate. In fact, he has a 50% chance of encountering his second failure within the next 69 requests -- which would give him an Error Rate of 0.0119 or more, which more than balances the less than 50% chance that he will end up with an Error Rate of less than 0.0101. With a bit of thought, we can see that for any p there must be a sequence x[1] , x[2], x[3]... such that the optimal strategy is to stop issuing requests if the i th failure is encountered when less than x[i] total requests have been made.

How can we determine these x[i]? With great difficulty. We can, however, approximate them by considering a continuous limit. Defining F(i, j) as the expected Error Rate if there are currently i failures out of j requests and the optimal strategy is followed, we can define G[i](x) for positive x as the limit, as p tends to zero, of F(i, x/p) * exp(-x) / p, and define X[i] as the limit as p tends to zero of x[i] / p. (Yes, the limits do exist.) This provides us with a series of differential equations: For each i, G[i](x) = exp(-x) * i/x for x <= X[i], and G'[i](x) = - G[i+1](x) for x > X[i].

These differential equations don't appear to have any closed-form symbolic solution, but they can be solved numerically. The first few values X[i] are approximately X[1] = 0.507, X[2] = 1.197, X[3] = 1.950, X[4] = 2.740, X[5] = 3.554, X[6] = 4.386, X[7] = 5.232, X[8] = 6.089, X[9] = 6.956, X[10] = 7.830. Assymptotically, X[i] appears to be approximately i - sqrt(i/2); this shouldn't be surprising, since with probability 1 - epsilon the attacker will encounter i + O(sqrt(i)) failures if he performs i/p requests (where the implicit constant depends on epsilon). Given these values, the attacker's limiting strategy for small p is to stop issuing requests as soon as he has seen i failures out of less than X[i] / p requests within the current 5-minute interval.

How large an expected Error Rate do these strategies produce? Much larger than the p which might be expected.
Strategy Expected Error Rate Minimum p to get a 25% refund Minimum p to get a 10% refund
One request per 5-minute interval p 1% 0.1%
Stop after 1 failure p ln(1/p) / (1-p) 0.154% 0.0110%
Stop when Error Rate > p p ln(1/p) + 0.191 p (approximately) 0.149% 0.0107%
Optimal strategy p ln(1/p) + 0.292 p (approximately) 0.147% 0.0106%
Put another way, in order to avoid giving a refund to our hypothetical attacker, Amazon has to maintain a failure rate almost 10 times lower than what might be naively assumed from reading the SLA.

Speaking of failure rates, what are the request failure rates for S3? The failure rates for GETs and DELETEs are very low -- on the order of one failure per 100,000 requests according to my logs. PUTs, however, are far more prone to failure: In the last six months, I've seen PUT failure rates of 0.030%, 0.449%, 0.089%, 0.085%, 0.192%, and 0.178% respectively. None of these are below the 0.0106% rate needed to avoid paying a 10% refund to our hypothetical attacker, and in three of these months, the error rate was high enough that our attacker would have received a 25% refund.

S3 is a great service... but if Amazon wants to avoid paying refunds under the SLA, they've got some work to do.

Posted at 2008-10-23 02:15 | Permanent link | Comments

Canadian election results trivia.

After the last Canadian federal election, I posted here with some election results trivia; since Elections Canada has again helpfully provided the (preliminary) results in CSV format, I've now done the same for the 40th Canadian federal election:

Note to media and blogs: Feel free to republish the above (in part or in whole), giving credit to Colin Percival or a link to this post.

UPDATE: Statistics concerning the fraction of the popular vote won by individual candidates added on 2008-10-16.
UPDATE: Some statistics adjusted to replace preliminary results with validated results on 2008-10-17, 2008-10-18, and 2008-10-23.

Posted at 2008-10-15 14:45 | Permanent link | Comments

Canadian Federal election polling

In December 2005, during the last Canadian federal election, I wrote about the discrepancies between polls and how they could be partly explained by different notions of what the population of Quebec was. In the 2008 election, there are even larger polling discrepancies -- on October 6th, Nanos reported 6% support for the Green party while Harris-Decima reported 13% support; and once again, certain pollsters consistently disagree with their colleagues.

If you take all of the polls conducted during the 2008 election campaign and plot them according to their completion dates, you'll get a graph which looks like this:

(Different pollsters are represented by different letters: A = Angus Reid, E = Ekos, H = Harris-Decima, I = Ipsos, N = Nanos, S = Segma, and T = Strategic Counsel. The different colours represent Conservative (blue), Liberal (red), NDP (orange), Bloc Quebecois (light blue), and Green (green) party support respectively.)

Some points on this graph stand out very clearly amidst the noise: For most of the campaign period, Nanos reports much higher Liberal support and lower Green support than other pollsters; less noticeably, Angus Reid reports higher Conservative support and lower Green support. I threw all of these numbers into a blender, and obtained the following table of pollster deviations from the consensus:
Pollster Conservative Liberal NDP Green Bloc Quebecois
Angus Reid +1.3% -1.1% +1.4% -1.8% +0.1%
Ekos -1.1% -0.6% +0.5% +1.2% +0.2%
Harris-Decima -0.7% +0.1% -1.1% +0.8% -0.4%
Ipsos +0.8% +0.4% -1.9% +0.2% -0.1%
Nanos -0.4% +3.2% +0.1% -1.9%  
Segma +2.5% -2.5%   -0.4% +0.1%
Strategic Counsel +2.4% -1.0% -1.0% +0.6%  

Note that these values are the difference between each individual pollster and the polling consensus, not the difference between each pollster and the correct values. It's possible that Nanos and Angus Reid were correct, and the Green party really did have far less support than all the other pollsters thought -- but until the election results come in, all we can do is compare polls to each other.

Once we know how much of a house effect each pollster has compared to the average pollster, it only makes sense to look at a graph of "corrected" polls -- and to draw a trend line for each party:

After applying these adjustments, there are no longer wild discrepancies between polls conducted at the same time. There is some noise, of course, but that is unavoidable given polls which (even given perfect sampling) are only accurate to within three percent 19 times out of 20.

Why do different pollsters consistently get results which are a few percentage points apart? There are several reasons, but the largest factor is probably differences in how questions are asked. Harris-Decima asks people to choose between "the Liberal Party candidate, the Conservative Party candidate, etc.", while most pollsters ask people to choose which party they would be most likely to support -- a significant difference given that many Canadians vote for a local candidate they like even if he or she does not belong to their favourite party.

Nanos, however, takes a prize for unusual polling methodology: First, instead of asking for whom someone will vote, they ask people to name their top two choices -- and then arbitrarily designate the first named party as the voter's preference. Second, instead of reading out a list of parties like all the other pollsters do, Nanos uses "top of head" polls: No parties are listed, and it's up to the person being polled to remember what the parties are. It's not hard to imagine that an environmentally-conscious voter would answer "Liberal and Green" when polled by Nanos, but respond "Green" when another pollster asks for a single choice; while a voter who is only marginally aware of the election might support the Green party when prompted, but not even remember that it exists if Nanos phones.

That said, it could be argued that for all of Nanos' shortcomings as polling methodology, it's actually very good as a prediction method: Top-of-head polls effectively underweight responses from people who can't remember the names of all the parties, but those people are less likely to vote anyway; and if a polling method produces results which are biased towards the major parties which people have as their second choices, it might just match the almost inevitable shift on election day when people decide to "vote strategically" instead of "wasting their vote".

What do I expect to see when the final polls close today? 35% Conservative, 28% Liberal, 19% NDP, 7% Green, and 10% Bloc. And the smallest fraction of the popular vote to ever elect a Canadian government.

Four hours from now, I'll see if I'm right -- but now, I'm off to cast my ballot.

Posted at 2008-10-14 22:35 | Permanent link | Comments

UBC Election Stock Market

On September 8th, I invested $25 into the Election Stock Market being run by the University of British Columbia for the 2008 Canadian Federal election. When the market closed, at 10PM PDT on Monday evening, I had increased that to $88.70 (I decided to end up with a cash-only position).

This was in fact my second foray into the UBC ESM -- I also participated in the market run for the 2006 Federal election, when managed to turn $25 into $93.08. At 272.3%, that was apparently the largest relative gain, and I suspect that my 254.8% gain this year will once again be the largest relative gain (although I'm sure there are many people who will have made larger absolute gains once the final election results arrive and the market is liquidated). I had hoped to do better in this market than in the last one, but this election campaign wasn't anywhere near as interesting as the 2005/06 campaign -- and it's very hard to make money in a market if nothing ever moves.

So what's my strategy? Automatic trading controlled by a script which checks the market conditions every minute. Lots and lots of automatic trading. I don't have statistics available for the 2006 market, but during the 2008 market I (or rather, my scripts) made 2577 trades worth a total of $21077.73; my average gain per share traded was slightly less than a tenth of a cent.

Depending on the results of today's election, there may be another election -- and another election stock market -- before long. I look forward to seeing how I -- and my scripts -- do next time.

Posted at 2008-10-14 06:05 | Permanent link | Comments

Recent posts

Monthly Archives

Yearly Archives


RSS