On Tuesday, I wrote about how Canadian poll aggregators suck — in particular, pointing out the common ways that their methodologies fail. At the end of the post, I said that we could do better; here's the details of how.

The fundamental realization is that our goal is not to compute a polling average; rather, our goal is to use the available data to compute a best estimate of voting intentions. Using noisy data to compute an estimate of a true value? This sounds like a problem for regression analysis! So let's start writing down the things which we know are approximately true:

1. Published opinion polls are the average of the support levels observed by the pollster in question across the days when the poll was conducted, to within the level of rounding used by the pollster. For example, Abacus Data conducted a poll between October 8th and October 10th for which they reported 32% support for the Liberal Party of Canada (hereafter "LPC" for succinctness); this tells us that the average of "Abacus Data LPC 2019-10-08", "Abacus Data LPC 2019-10-09", and "Abacus Data LPC 2019-10-10" is somewhere between 31.5 and 32.5. One equation with three unknowns; not very useful by itself, but it's a start.

For ease of manipulation, I treat "somewhere between 31.5 and 32.5" as "32 with a standard deviation of 1/sqrt(12)"; this replaces a uniform likelihood with a bell curve, which is technically wrong but for practical purposes works just fine. (The value 1/sqrt(12) is the standard deviation of the uniform likelihood distribution in question.)

For pollsters who conduct "rolling" polls (Nanos and Mainstreet), we apply the same method — but unlike with other pollsters, this provides us with multiple equations involving the same days of polling.

2. The observed level of support by a pollster for a particular party on a particular day is equal to the actual level of support for that party on that day, plus the pollster's house effect for that party, within a margin of error determined by the party's level of support, the number of people polled, and the pollster's non-sampling noise. I make two assumptions here: pollster house effects are constant (aka. pollsters don't dramatically change methodologies); and multi-day polls query the same number of people per day.

For the aforementioned case of the Abacus Data October 8-10 poll of 3000 people, this tells me that "Abacus Data LPC 2019-10-08" is approximately equal to "LPC 2019-10-08" + "Abacus Data LPC", plus or minus an error based on how accurately Abacus Data can determine support for a party with ~32% support by polling 1000 people. (Why does it matter that they had approximately 32% support? Because margins of error get smaller the further you get away from 50% support. The standard deviation on a measurement of something with likelihood p is sqrt(p * (1-p) / N).)

3. The true level of support for a particular party on a particular day is approximately the same as the support for that party on the previous day. Obviously not absolutely constant — if nobody ever changed their opinions, politics would be very boring! But that's why we have a margin of error on this equality; and here I make an editorial judgment call based on recent Canadian political history about how much public support moves from one day to the next: For each day, I assign a daily standard deviation of a minimum of 0.05%, but sometimes considerably more depending on how much is going on at the time.

What do I mean by "how much is going on"? Canadians tend to pay more attention to politics — and be more likely to change their intended votes — during election campaigns, but there are also other times when large shifts happen. For example, in February and March of this year, when it became clear that the Prime Minister had pressured the Attorney General to negotiate a deferred prosecution agreement with SNC-Lavalin (and fired her when she did not comply) politics was likewise at the forefront of many Canadians' minds, and voting intention changed more rapidly during this period than in the surrounding months.

So how do I measure this? I rely on what I have: Polls. When news organizations think that voting intentions are likely to be changing rapidly, they commission more polls. Consequently, I count how many polling companies are "in the field" within 3 days of each date (this window is here to allow time for polls to be commissioned after news breaks) and if there are N pollsters in the field, I take a standard deviation of (N + 1) * 0.05% in daily support for each party.

It's a judgment call, but it seems to work well. Larger or smaller margins of error would make the graphs more or less noisy.

Using my current corpus of polls — taken from Wikipedia's list of Canadian Federal opinion polls since the 2015 election — this gives me 24208 (approximate) linear equations in 21687 unknowns. This is obviously impractical to try to solve... just kidding! Computers are fast. It takes under a minute to compute a best-fit solution to these, and it would be even faster if I spent a few hours rewriting my solver to take full advantage of the sparsity of the system.

There's two more things we need to do. First, I mentioned above that I computed polling margins of error based the party's level of support, the number of people polled, and the pollster's non-sampling noise. We need to compute that noise — or as I refer to it, "excess variance". To do this, I take the regression output and feed it back to compute — including house effects — the expected polling results for each poll in my database. Then I calculate how far off the polls were, and compare that against the expected margins of error from a theoretical pollster with perfect random sampling. I average these over all the polls conducted by a firm to produce a "pollster excess variance" value; taking a middle-of-the-pack pollster as an example, Leger Marketing has a +/- 2.5% error on top of the unavoidable statistical errors. These values computed from one run get fed back for use in the next run; fortunately this converges very quickly, so even starting withoug any advance knowledge of how accurate pollsters are I get consistent results after a few runs.

Finally, we need to decide on a polling consensus. Here we run into another judgment call: Taking the famous Shy Tory effect as an example, it's possible that every pollster is reporting biased values due to uncooperative poll respondants. The best we can do is to hope that true voting intentions fall somewhere between what the different pollsters would measure; so I compute a "consensus" such that a weighted average of pollsters' house effects is zero. Those weights? Again a judgment call, but I weight pollsters according to the inverse of their average polling margin of error (including the aforementioned "excess" variance) — in short, I assume that polling firms which are more precise are also generally more accurate. (See Wikipedia for an explanation of these terms.) However, note that the determination of "consensus pollster" will shift all the polls consistently, and will not change the shape of how each party's support changes over time.

So where are we at right now? My latest run tells me that, as of October 15th, the Conservative Party of Canada is leading with 31.82% of the vote; the Liberal Party of Canada is in second place with 30.50%; the New Democratic Party has 18.65% support; the Green Party has 8.26%; the Bloc Quebecois is at 7.37% (nationally — they only run candidates in Quebec, so this translates to roughly 30% in Quebec); and the People's Party of Canada trails with 2.73%. This and more data is now available on a separate page, which I will keep updated between now and the election — and depending on public interest, may update further in the future.

In March 2008, a statistician using the pseudonym Poblano started a blog where he made predictions about the outcome of the 2008 US primary elections. This blog — named "FiveThirtyEight" in reference to the size of the US electoral college — quickly rose to prominence, and the author (who soon revealed himself to be Nate Silver) was widely lauded for his successful predictions of electoral outcomes.

He also attracted a large number of imitators — especially in Canada, where (since FiveThirtyEight concerned itself almost exclusively with US politics) they didn't face any competition from FiveThirtyEight. In some cases, these competitors went so far as to imitate the name; one of the first Canadian poll aggregation websites operated under the name "308" (a reference to the number of seats in the Canadian House of Commons), and a later website used the name "338" — after the House of Commons expanded by 30 seats.

Unfortunately, when political nerds try to imitate what a statistical nerd has done without having any understanding of statistics, the outcome is quite predictable: They suck.

There are a few ways that Canadian polling aggregators have failed. This is not intended to be an exhaustive list, and not every polling aggregator has failed in all of these ways; but any of these is severe enough to result in significantly misleading results.

1. Failing to account for house effects. As I wrote about in 2008, pollsters have "house effects" which skew their polling data. These come from many sources, including polling methods (some voters are more likely to talk to a human; others are more likely to press buttons in response to automated prompts) and how questions are asked (some pollsters ask about the "party X candidate"; others about "party X"; and others about "party X led by leader Y").

But whatever the source of these house effects, ignoring them produces a highly misleading view of the field: For example, the CBC Poll Tracker always shows an upwards spike in support for the Conservative Party of Canada every time a new poll from Angus Reid, DART, or Forum is released.

An effective poll aggregator must model house effects and compute a polling trendline using "adjusted" polls.

2. Considering only the most recent poll from each pollster. Polls are inherently noisy; typical reported statistical margins of error are +/- 2%, and those are theoretical ideal values which assume perfect random sampling. With pollsters releasing new polls on a weekly or even daily basis, a large shift (e.g., Nanos Research's poll ending October 3rd, which reported an overnight 3.4% jump in Liberal support) is far more likely to be due to sampling error rather than an actual shift in the underlying numbers.

Discarding "earlier" polls loses important information, and effective poll aggregation should always avoid losing information.

3. Mishandling rolling polls. Speaking of Nanos Research: They and Mainstreet Research both report 3 day "rolling" polls; each day they add a new day of polling and remove the oldest day of polling. Sites are handling these polls in at least three wrong ways: Ignoring all but the latest poll; including every third poll (to avoid overlapping dates); and including all of the polls but reducing their weight by a factor of 3 to account for the reused data.

The right way for a polling aggregator to handle rolling polls is to "reverse engineer" the original daily data and use those values (which inherently have a much higher margin of error due to the small daily sample sizes).

4. Ignoring the full date range of a poll. Many polls reported during the 2019 Canadian Federal election campaign have been conducted over a 3 day period, and then reported the following day; for example, a poll conducted between October 1st and 3rd would be reported on October 4th. This is not always the case, however: There have been a handful of polls conducted during a single day, and others over the course of an entire week; and while most polls are reported the day following the final day of polling, some are reported late on the final day, and others are reported 2-3 days later.

Most polling aggregators ignore the date range and treat polls as if they were conducted on a single day — occasionally the "midpoint" day, but usually the final day. Some aggregators do even worse, and treat polls as having been performed on the day the poll was reported. At times when party support levels are changing — for example, after the Justin Trudeau "blackface" scandal on September 18th (which cut Liberal party support by 1.0%), or after the Leaders' debates on October 7th and 10th (where NDP support surged by 3.2% and Bloc Quebecois support increased by 1.4%) it is essential for polling aggregators to use the correct date ranges.

5. Not accounting for non-sampling polling noise. As I mentioned earlier, polling "margins of error" are theoretical ideal levels based on the assumption of perfect random sampling. Guess what? Nobody is perfect. In practice, some pollsters are far "noisier" than their sample sizes would indicate; while there are some exceptions (in both directions!) the added polling variance from non-sampling error is typically around the same size as the unavoidable error from random sampling.

Good polling aggregators should estimate the excess variance for each pollster and use this to compute "corrected" error margins for polls.

Can we do better? Yes we can — and with good aggregation methodology, the results make far more sense. Stay tuned for details about how I'm aggregating Canadian political opinion polls.