Username:	Password:

Remember me

Forgot your username/password?

Print Email

Home>Articles

Introduction to the Polling System

By Bill James

May 6, 2019

Introduction to the Polling System

For the last month I have been doing daily Presidential polls on Twitter. The time has come for me to report some results, and, before I can do that, I need to explain what I am doing and why I am doing this.

First of all, I am well aware that:

a) Twitter users are not perfectly representative of voters, and

b) My followers on twitter are different from the overall twitter profile.

My polls are only instructive about those people who happen to follow Bill James on twitter and who care enough to vote; that’ s all. I’m not trying to make it more than that.

So far we have collected 34,055 votes in the polls—a little more than a thousand a day. Let me explain the method of interpreting these.

In college basketball or college football there are hundreds of teams playing thousands of games. It may be that in a given year Duke has not played UCLA. Nonetheless, we can evaluate the relative strength of Duke vs. UCLA, because Duke has played somebody who has played UCLA; if it happens that Duke has not played somebody who has played UCLA, which would be unusual, still Duke has played somebody who has played somebody who has played UCLA. There is SOME network of games that connects Duke to UCLA. If there was just one pathway of games that connected Duke to UCLA, one string, that would of course mean almost nothing, but in reality there are hundreds of such pathways, hundreds of such strings. Duke will have played Oklahoma, and Oklahoma will have played UCLA, and Duke will have played UNC-Charlotte and UNC-Charlotte will have played New Mexico State, and New Mexico State will have played UCLA.

Every basketball game that is played provides some information about how each team compares to the field. If Duke beats Oklahoma and Oklahoma beats UCLA, we have some indication that Duke would be likely to beat UCLA—not certainty, of course, but some probability. In the harder case, suppose that Duke is 27-2 and Boise State is 29-0, we can still reasonably infer that Duke is probably better than Boise State because the caliber of the schedule they have played would be much stronger.

The fact that Duke has a stronger resume than Boise does not mean they necessarily will win if they are pitted against one another in the NCAA tournament. The next game is a different game. The same is true of elections. I could poll Donald Trump against John Kasich tomorrow and again the next day, and Trump might win one day and Kasich the next; each poll is a different poll. But if Trump were to beat Kasich 70-30 one day, it is unlikely that Kasich would win the next. If Trump were to beat Kasich 51-49 one day, it is not nearly as unlikely that Kasich would win the next.

We can use each game that is played, in the NCAA basketball season, to triangulate the position of each team against the field, which means against each and every one of the other teams. Every game that is played, regardless of who is playing, tells us SOMETHING about the relative strength of UCLA vs. Duke—not very much, I grant you, but SOMETHING.

There are direct inferences from the results, which we could also call primary inferences, and there are indirect inferences from the results, which we could also call secondary inferences; calling them primary and secondary allows us also to talk about tertiary inferences and whatever fourth-level inferences would be called. Quaternary. I didn’t know; had to look it up.

People write me on twitter telling me that to compare Donald Trump to John Kasich, I have to have Donald Trump in the same poll as John Kasich, but this is not true. Clemson did not play Kansas University in football last season, 2018, nor the previous season. . . hell, I don’t remember if Kansas has ever played Clemson. I believe that the two teams did not have a common opponent. Nonetheless, we can reasonably conclude with something close to 100% certainty that if Clemson HAD played Kansas, Clemson would have won the game, probably by 70 points or more. How do we know this?

We know this because of tertiary and quaternary and quintenary comparisons of scores. We know how Clemson compared to the field of teams; we know how Kansas compared to the field of teams. We can do the same with political candidates. By running poll after poll after poll comparing each candidate to the other candidates, we know how each one compares to the field.

That’s the overview; let’s get into the details. We assign every candidate an "initial value" or "input score" in the process.

Let me say first, and I cannot emphasize this enough: it makes absolutely no difference whatsoever what the initial values are. If you assigned every candidate an initial value or 100 or if you assigned Marianne Williamson an initial value of twelve billion and assigned everyone else an initial value of negative 7 million, you would wind up with exactly and precisely the same scores at the end of the process either way.

Let’s take the first poll that I ran—a poll which, at random, happened to have four of the heavyweight contenders. The results were:

Kamala Harris 31%

Donald Trump 27%

Joe Biden 26%

Bernie Sanders 16%

There are 6 points of comparison that come out of this poll—Harris to Trump, Harris to Biden, Harris to Sanders, Trump to Biden, Trump to Sanders, and Biden to Sanders. Each of those 6 points of comparisons has two parties, so there are 12 "positions", one candidate to another, that come out of each four-person poll.

Let’s assume that we assign each of these four candidates an initial value of 100 points. That means that in each comparison, there are 200 points to be divided. When Harris is compared to Trump, Harris wins, 31-27. If there are 200 points to be divided and we divide them in a ratio of 31 to 27, that gives Harris 107 points, and Trump 93—a ratio that is the same as 31 to 27; not quite, the spreadsheet is saving decimals that we are not going to print out.

When Harris is compared to Biden, the ratio is 31 to 26, which, projected onto 200 points, gives Harris 109 points, and Biden 91. Harris compared to Sanders is a ratio of 31 to 16, which gives Harris 132 points and Sanders 68.

At the end of this process. ..well, at the end of stage one of this process—we have three estimates of Harris’ strength, which are 107, 109 and 132. We also have three estimates for Trump, which are 126, 102 and 93. We have three estimates for Biden, which are 124, 98 and 91, and we have three estimates for the Bernster, which are 76, 74 and 68.

We then take the average of those three estimates, which is 116 for Harris, 107 for Trump, 104 for Biden, and 73 for Sanders. These are what I call the output scores; actually they are what we would call the Stage 1 output scores.

We then go to Stage 2, but treating the Stage 1 output scores as the Stage 2 inputs.

In Stage 2, we again compare Harris to Trump, but this time, rather than having initial values of 100 for each candidate, we have an initial value of 116 for Harris and 107 for Trump, or 223 points between them. We split those 223 points in the Harris/Trump ratio of 31 to 27, as we did in the first round. Harris gets 119 of them; Trump gets 104. When Harris is compared to Biden in the Second stage, there are 220 points to be divided, which we divide in the ratio of 31 to 26. Harris gets 120 of them; Biden gets 100. When Harris is compared to Sanders in the Second Stage, there are 189 points to be divided, which we will split in the ratio of 31 to 16. That gives Harris 125 points, and Sanders 64.

We do that, of course, for Trump vs. Biden, Trump vs. Sanders, and Biden vs. Sanders. Now we have three new estimates for each candidate, 12 new estimates in total. We form an average of the three new estimates for each candidate. Those averages are 121 for Harris, 108 for Trump, 105 for Biden, and 66 for Sanders. These are the Second-Stage Output Scores.

We then repeat the process, using the second-stage output scores as the third-stage input scores. At the end of the third stage, Harris will have an average score of 123, Trump 108, Biden 104, Sanders 65. These are the third-stage output scores. We repeat the process, using the third-stage output scores as the fourth-stage input scores. We then repeat the process again, and again, and again, and again. Eventually the numbers will entirely stop moving, and the output numbers will be exactly the same as the input numbers.

When the point is reached at which the numbers entirely stop moving, Kamala Harris will have 124 points, Donald Trump will have 108, Joe Biden will have 104, and Bernie Sanders will have 64. There are 400 "points" in the system, of which Harris will have 31%, Trump will have 27%, Biden will have 26%, and Sanders will have 16%.

With only 12 calculations in the process, the data will stabilize in a relatively few iterations of the process—ten or less. Later on, we will have hundreds of calculations in the process, with each poll interacting with all of the other polls, and then it will take dozens of iterations for the data to stabilize. I’m not sure how many it takes at the moment; I think it is around 50 iterations, 50 "stages", each one using the output data from the previous stage as the input data, or "starting point" data.

Suppose, however, that we didn’t give each candidate 100 points as a starting point. Suppose that gave all 400 points at the starting gate to Bernie Sanders.

Biden	Harris	Sanders	Trump
0	0	400	0

When you compare Biden to Harris, Biden to Trump, or Harris to Trump, they will have first-stage scores of zero, because there are a total of zero points to be divided in those comparisons. When you compare Harris to Sanders, however, there are 400 points to be divided, which we will split in the ratio of 31 to 16, giving Harris 264 of them, and Sanders 136. Harris, then, has first-stage scores of zero, zero and 264, which is an average of 88. Bernie has first-stage scores of 152, 149 and 136, which is an average of 146. These, then, are the first-stage output scores:

Biden	Harris	Sanders	Trump
83	88	146	84

We repeat the process, using the first-stage output scores as second-stage input scores. When we compare Harris to Sanders in the second stage there are 234 points to be divided, of which Harris will claim 154, and Sanders 80. These will be the second-stage output scores:

Biden	Harris	Sanders	Trump
100	113	84	103

The second-stage output scores, of course, become the third-stage input scores. Harris and Sanders now have 197 points to be split between them, of which Harris will get 130, and Sanders 67. These will be the third-stage output scores:

Biden	Harris	Sanders	Trump
104	120	69	107

These will be the fourth-stage output scores:

Biden	Harris	Sanders	Trump
104	123	65	108

And these will be the fifth-stage output scores:

Biden	Harris	Sanders	Trump
104	124	64	108

These are, of course, the exact same numbers that we got when we initially assigned 100 points to each candidate. The initial values have no input at all into the final numbers. The conclusions of the system are based entirely on the polling data. These are the numbers you get because these are the only numbers that you CAN get from this data, using this process. It’s the same process, basically, that we use to compare Duke to UCLA in basketball, or Clemson to Kansas University in football. We know what the relative strength of all of the candidates is, based on the fact that everybody is "in" the polling data somewhere, and there is only one position for each candidate in the data which is stable, one position which is not going to move if you repeat the process a million more times, although it will move, of course, if you do another poll.

At this point you could say "Well, couldn’t you just have multiplied the 400 points times the 31% for Harris, the 27% for Trump, etc., and gotten the same results a lot easier?" In this case, yes, you could have, because there is only one poll involved, so all of the data is pushing the conclusion toward only one point. But, in the same way that Duke may play Carolina twice and Duke wins once and Carolina wins by 18 the other time, different polls will have conflicting information about the relative strength of the different candidates. When you have 25 polls involved rather than one, the candidate is being pushed in all different directions, so the simpler process would no longer work.

There are four or five technical points that I need to make here.

First, I said earlier that "you would wind up with exactly and precisely the same scores at the end of the process" regardless of what input numbers you use at the start of the process. Well. ."exactly and precisely" is a tough standard. You would wind up with exactly and precisely the same number if you ran the system through an infinite number of loops. I don’t run the system through an infinite number of loops. What I do is, I measure the changes between the input and the output scores for each candidate in each stage. I have 27 candidates now in the polling network. I sum up the input/output changes for all 27 candidates. When the total of the changes for all 27 candidates is less than .00001, then I stop. At that point the numbers are not moving in any meaningful sense. Joe Biden is at 935.26402. If I ran the system through another 100 loops, he would still be at 935.26402. The data is not moving. Whether that justifies the use of the expression "exactly and precisely" the same. . . .I’ll leave that to you.

Second, I haven’t explained how many points there are in the system. The system at the moment is set up to divide 10,000 voters. Joe Biden is at 935. What that means, exactly, is that, as best I can measure this, Joe Biden would be the #1 choice for 935 out of 10,000 poll respondents, or 9.35%--not 9.35% of Democrats, but 9.35% of all candidates in the polls, Republican and Democrat. Marianne Williamson is at 33 out of 10,000, or 0.33%.

In the extended example that I gave you, based on the first poll, everybody had 100 points to start with, 400 total. My third point: in that example, the number of points didn’t move up or down; they merely shifted from one person to another.

That’s the intent of the system, but it doesn’t work perfectly in practice, once you have 26 different polls interacting with one another, all at one time. The reason it doesn’t is that different candidates have been polled different numbers of times. The output score from each stage is the average of the calculations based on each candidate-to-candidate comparison. Bernie Sanders has been polled six times; Jay Inslee has been polled only twice. If I were to poll Inslee and Sanders tomorrow, the results of that poll, whatever they would be, would have more impact on Inslee’s new score than on Sanders. Inslee might move up more than Sanders moves down, or Inslee might move down more than Sanders moves up.

This doesn’t happen as long as everybody has been polled the same number of times, but since the polling groups are randomly selected, that never happens. Because of this, I go into each round with 10,000 "points"—10,000 voters who are being represented—but I will wind up the round with 9,982, or 10,009, or some similar number. Because of that, I have to put in another step to each round of calculations, re-centering the numbers to 10,000 voters.

My fourth technical point—maybe this is obvious—is that candidates who are not included in the poll still go up and down in the polls every day based on the poll results. Let’s say that I poll Harris, Warren, Inslee and Bill Weld; Joe Biden is not included in the poll. Still, Biden will go up or down at least a little bit, based on the information that comes from the poll. He has to; it’s the same as in the basketball rankings. Let’s say that Oklahoma State has a win against Arkansas, and that Arkansas then plays against LSU. If Arkansas beats LSU by 20 points, that will move Arkansas up in the polls. When Arkansas moves up, that pushes Oklahoma State up as well, as a secondary effect. The same thing here. Joe Biden is not in THIS poll, but Biden was compared to Kamala Harris in Poll #1, and to Bill Weld in Poll #25. All of the polls interlock. If Harris and Weld move, that moves Biden. Harris and Weld will move by much MORE than Biden moves, of course, but if Weld moves down by 20 points as a result of a poll, that might move Biden down by 2 points as a result of the comparison between Weld and Biden.

Fifth, I need to give you a better explanation of how each day’s poll candidates are decided. I have said that it is random, but it isn’t PERFECTLY random. It is random, but with four "buts". The first "but" is that nobody can be included in two consecutive polls. If I poll John Kasich on Monday, he can’t be polled again on Tuesday. The second "but" is that each group of random candidates MUST include at least one of the candidates who has been polled least often. Just at random, with 27 candidates included in the poll, selecting four each day, somebody would be polled nine times while somebody else wouldn’t have been polled at all; it just happens. The rule that each group must include at least one of the people who has been polled least often is intended to prevent candidates from being left behind.

The third "but" is similar to the second. Each candidate has a random number, and the four candidates with the four highest random numbers make up the next poll. But I modify the random number very slightly by the number of times that the candidate has previously been polled. The actual formula is a random number (zero to one) minus .02 times the number of times the candidate has previously been polled. If two candidates each have a random number of .900 but one has previously been polled six times and the other one two times, I change the random numbers to .78 and .86, so that it is more likely that the candidate who has only been polled twice previously would be in the next polling group.

And the fourth "but", which works at cross purposes with the rules trying to encourage an equal number of trials for each candidate, is that each polling group must represent candidates who have a total score of at least 1,000—that is, 10% of the value. If a polling group is Julian Castro (224), John Delaney (71), Jeff Flake (192) and Tulsi Gabbard (187), that’s four relatively weak candidates, with a total score of just 674, as of now.

Maybe I should poll that group, and maybe I will, later on in the process. But until I get a really firm hold on the position of the leaders, I don’t want to run polls sorting out exactly where one weak candidate stands in relation to three other weak candidates. I do care where the weaker candidates stand; I just care more about getting the front runners sorted out correctly.

If I sort the data at random and it generates a non-qualifying list, I just re-sort until I get a qualifying list. I will make some changes to the system later on. I might get rid of the rule requiring each polling group to represent 10% of the field. I might switch from a system in which I generate a new list each day, which makes it possible that someone will be polled three times in a week, to a system in which I generate one set of random numbers representing all 27 candidates, and then go through the whole list, four candidates at a time.

One change that I will definitely make will be that, at some point, I will start dropping the oldest polls out of the system, and basing the results only on the more recent polls. I am thinking that I will wait until I have polled everyone at least seven times, and then drop the oldest poll for each candidate, then wait again until I have seven polls for everybody. We’ve got almost a year to go until the end of the primaries.

OK, that concludes the explanation of the technical issues. Transitioning now to advocacy or defense of what I am doing.

Several people have already tried to tell me, and more of you are bound to tell me, that my polling does not meet "scientific standards" for polling methodology. Well, OK.

The concept of "scientific polling", frankly, is a bullshit term. In the Ty Cobb era, the term "scientific baseball" was used constantly. "Scientific baseball" meant using the hit and run, taking some advantage of the platoon differential, and a few other things. What it meant, actually, is "best practices" baseball—not that all of those WERE the best practices, but they were believed at the time to represent the best practices. What is actually meant by the term "scientific polling" is "best practices" polling. Science does not lock up in endorsing a certain set of practices; it moves forward.

All methods of polling have drawbacks; all polls will sometimes fail because of their drawbacks. "Best practices" polling means that you make every effort balance the polls so that they represent the people who are actually going to vote. You try to get the right balance of men to women in your polling group. You try to get the right percentage of old people versus young people. You try to get the right balance of politically passionate people versus less involved voters, the right percentage of Republicans vs. Democrats vs. Independents. You try to get people who are actually going to vote, instead of people who have opinions but can’t vote or won’t vote.

The key word is, you try; maybe you don’t succeed, but you try. The thing is, though, that it’s expensive to do this. Statisticians have calculated that in order to reach a level of "statistical significance", you have to have at least 800 votes in a poll. But, because it is so expensive to do a poll, they always just crawl over the threshold that they think they have to meet, 800 votes, and then they stop. If a "scientific" poll has 900 voters, that would be unusual, because it costs more money to poll 900 people than it does to poll 800, so why would we do that, when we’ve already met the legal limit?

My polls just represent my twitter followers; that is all that I know about them. They’re not scientifically pre-screened to make sure I have the right percentage of left-brain people and right-brain people. If a 12-year-old takes a notion to vote in my poll, I’ve got the 12-year-old included in my data. It ain’t "scientific".

But.

My polls get about 1200 people a day, and I have a new poll every day. Through the first 26 polls, I have collected 34,055 votes. That’s a significant advantage over a group of 800 voters.

In a scientific poll, everybody is fighting for position in one big scrum. In my method, we focus on a group of just four candidates at a time, thus enabling us to collect meaningful information about a minor candidate, and then place that information in context relative to the larger picture.

My methodology is designed to, and does, position us to make inferences from the votes that go beyond what the scientific polls are able to do. All that the scientific polls can tell you about Eric Swallwell or John Delaney or Marianne Williamson is that not very many people are going to vote for them. They’re a trace; they don’t really show up in the polls.

My polls, on the other hand, are sufficient to distinguish pretty accurately between Eric Swallwell—127 supporters per 10,000 voters—and Marianne Williamson, at 33 supporters per 10,000 voters. You can take that advantage for whatever you think it is worth.

Because of that advantage, these polls should be sufficient, over time, to track relatively small changes in a candidate’s base of support. A candidate who is at 250 and gaining strength is different from a candidate who was at 400 a month ago and has dropped to 250. I see value in this; you can take it for whatever you think it is worth.

Every day, now that I have some prior results, I can predict what each candidate is likely to do in today’s poll. This adds meaning to the poll results. Yesterday’s poll candidates were Pete Buttigieg, John Kasich, Kirsten Gillibrand and Amy Klobuchar. Buttigieg came in with a score of 1180, Kasich at 597, Gillibrand at 212, and Klobuchar at 392. That means, based on prior poll results, that Buttigieg should get 50% of the vote, Kasich 25%, Gillibrand 9%, and Klobuchar 16%.

That is generally the results I am got—generally, but not precisely. Klobuchar and Gillibrand are almost exactly where the previous data suggests that they would be—16% for Klobuchar, 10% for Gillibrand. But Kasich, expected to be at 25%, was actually at 30%, while Buttigieg, expected to be at 50%, finished with 43%.

It’s not a HUGE difference, but it is a difference. There are three obvious theories that could explain this difference:

1) Buttigieg’s support my have slipped a little bit, while Kasich may have gained a little bit,

2) My previous measures of Buttigieg may have been slightly inaccurate, which is certainly possible. I had polled him only twice previously; it is possible that the system just hasn’t placed him properly yet.

3) It may just have happened, kind of at random, that today’s poll respondents are a little different than the previous groups.

But if a candidate shows weakening support in two or three consecutive polls—as Beto O’Rourke and Cory Booker already have, for example—then we can reach the conclusion that their campaigns are not on schedule. We don’t want to over-rate or over-value that information, but we don’t necessarily want to ignore it, either.

Thank you for reading. I am going to start posting results, as nearly as I can, on a daily basis.