The Perfect Voting Structure
The purpose of this research is to estimate how often a group of MVP voters will wind up getting the vote right, given different sets of voting parameters. In other words, if we vote this way with this number of voters, we’ll wind up getting the right MVP xx% of the time, whereas if we vote that way with that number of voters, we’ll wind up getting the right MVP xx% of the time.
I am studying that problem by a series of models. As I do this, I’ll have to explain exactly how the research was done, not because this is interesting but because someone else, at some point in the future, may want to follow up on the research, and it is better if he or she does not start out at zero.
This issue could of course be studied in other ways, but there are two advantages to modeling the problem. One is that it greatly expands the sample size. We have a hundred years of real-life history with MVP votes; we can create—and I have created—thousands of years of simulated votes. The other advantage is that, in a model, we absolutely KNOW who the Most Valuable Player is. In real life, you may have your thoughts about who should be the MVP, I may have mine, but there is no absolute knowledge about the subject, thus no way to say for certain whether the voting system got the answer right or wrong. Combining these advantages, no one can say with much confidence whether the MVP vote reaches the best conclusion 90% of the time, 70%, 40%. . .no one really knows. Working with a model, we can know.
I have a friend; I’ll call him Jerry because that is not his name. Jerry has some wonderful qualities and some really annoying qualities. One of the latter is that he condemns things that are not the way they were when he learned to love them. He used to like college basketball, but they ruined college basketball for him when they adopted the shot clock and the 3-point basket, so he no longer has any interest in college basketball. When I first knew him he liked movies sometimes, but the last good movie he saw was. . ..I forget what it was; there was a movie he liked in the 1990s. He holds every movie that he sees to the standard of Casablanca, and thus rejects them all. You can go to lunch with him, but there are only like four restaurants in town that he won’t find some reason not to go to, and he’s not real happy with two of those. I’m guessing he doesn’t like poetry because it doesn’t rhyme anymore, so he’s right about that one; a blind pig will get a Valentine once in a while, I’m told. I’m sure you know people like that; you feel so bad for them because they are cutting themselves off from things that they once enjoyed and should still enjoy, but there’s really nothing you can do about it.
So one time, maybe 1985, it was November and they were getting ready to announce the Cy Young Awards. I asked Jerry who he thought should win, but he responded, "Oh, who cares; it’s all bullshit now anyway."
Excuse me?
It turns out that Jerry is upset about the fact that, in voting for the Cy Young Award, each voter votes for three pitchers, not just one like they used to. Until 1969, through 1969, each voter just voted for one pitcher; whoever had the most votes got the trophy. Obviously it is a matter of time until that system winds up in a tie, and time ran out in 1969, when Denny McLain and Mike Cuellar tied for the American League Cy Young Award with ten votes each out of 24 cast. Nobody remembers this, but in the National League MVP battle, Willie McCovey and Tom Seaver tied for first with 11 votes each out of 24. Nobody remembers that because they used a sensible voting structure with 10 names on each ballot and points given for each name, so that the vote didn’t wind up in a tie. But, in Jerry’s world, "Who the hell cares who they think should be second or third? It’s all bullshit."
That’s another of Jerry’s less charming qualities; the man has the analytical skills of Lou Dobbs. I tried to explain to him that, when you collect more information in the vote, you get a more reliable, more valid outcome. You can probably guess how that went.
Anyway, this article deals with that issue: what is the best way to vote for the MVP? How often does the best player actually win the MVP Award, do you suppose? How much better is the voting system now than it was in years past? How could it be done better than it is?
The Basic Model
I "created" 100 players to represent the players in a league, with the value of each player created by the formula:
100 * random * random
That is, 100, times a random number, times another random number. The reason you create player values with two random numbers, rather than one, is that it creates a more realistic distribution of values. If you just create each value as 100 times a random number, then there will be as many players between 90 and 100 as there are between zero and 10. In the real world, of course, there are many more players near the bottom of the value scale than near the top of the scale. The other alternative over-populates the high end of the scale, thus making it less clear who should be the MVP.
With this system, you have 100 players to choose from but a limited number of high-impact players, and it thus becomes relatively clear who the MVP candidates are—like real life. With 100 players in a league, the best player will usually have a value around 88 to 90, but sometimes the best player will be at 75 (or even lower), and sometimes he will be at 99.5 (or even higher). The average player will have a value a little bit less than 25. Sometimes you will have three players who are almost the same, and sometimes you will have one player who is far better than everybody else—like real life.
Having created the players, we must then create each voter’s perception of each player. I set up the system so that each voter’s perception of each player was the player’s actual value, plus 15 points, minus 30 times a random number. Using "AV" for actual value and "PV" for Perceived Value:
PV = AV + 15 – (30 * random)
So the player’s perceived value—that is, his value as seen by the voter--can be as much as 15 points higher than his actual value, or 15 points lower. A player with an Actual Value of 90 can have a perceived value by the voter as high as 105, or as low as 75. An average player (25) can have a perceived value as high as 40, or as low as 10. The voter will never think that an average player is the best player in the league, and he will never think that a player with a value of 60 is better than a player with a value of 90, but he MAY think that a player with a value of 70 is better than a player with a value of 90.
Note that these parameters were established mostly to model the problem of identifying the MVP of a league. If you were modeling the Cy Young Award, for example, you could probably use 40 players rather than 100, and if you were modeling the problem of identifying the Most Valuable Player on a team, then you could probably create 10 players, rather than 100. You would have to adjust the model for what you are trying to study.
Bias and Error
Each voter is subject to both bias and error, which may be seen as different things. Bias is systematic, and applies to groups of players. Error may be individual, and apply to only one player. For example, a voter may be a fan of one team, and over-value the members of that team, or he may be prejudiced against a group of players or a type of player, or he may just not want to vote for a pitcher for the MVP, without regard to individual value. That’s bias. The voter may believe that only players from championship teams should win the Award, which is bias in favor of a set of teams. But the voter may also have happened to see games in which a player did not play well or played super-well, or he may have a false belief about the player’s baserunning skill or his defense, or he may believe that someone is a great team leader when in reality the player is just good with the press. That is error.
As bias tends to be apply to groups of players, it also tends to be seen in groups of voters. If one person has a bias, probably others have the same bias—for example, the now-discredited belief in the reliability of won-lost records was a form of bias common to all the sportswriters of the 1950s/1960s.
I thought about modeling bias and error separately, but ultimately concluded that bias was simply one form of error—thus, that all of them could be accounted for at one time.
I should also note that statistical systems like WAR and Win Shares are also subject to both bias and error. These systems are built on assumptions, generalizations and estimates which, while hidden within the calculations, are nonetheless forms of bias and error.
While I ultimately decided to model bias and error as one, I bring this up because someone else might pick up this research and execute it in more detail than I have done, and that person might want to create bias and error as separate elements, with bias being shared among voters to a certain extent. We’ll return to the issue of bias later on, with our definition of bias being "shared assumptions which are not valid."
Our First Results
Our first question is, Given this set of assumptions, how often would the individual voter be "right" in selecting the MVP?
In 16,384 trials with this set of assumptions, the "voter" picked the right MVP 8,457 times, and picked a different player 7,927 times. The voter was right 51.6% of the time, and the individual voter was wrong 48.4% of the time.
Reality Check
After I calculated that figure—voter is right 52% of the time—I had the thought that "I wish there was some way to check what the number is in the real world." And then it occurred to me: There is, sort of.
Suppose that you assume that the MVP Award winner is always the right person, the person who should have been selected. If that was the case, then you could get the number we want by asking the objective question, what percentage of all first-place MVP votes go to the person who wins the Award?
Since the BBWAA began voting on the MVP Awards in 1931 and including 2019, there have been 4,348 ballots cast in MVP voting. 48 of those ballots are unaccounted for, meaning that in six of the early votes we do not know how many of the voters voted for the winner. 99% of the ballots, however, are accounted for—4,300 out of 4,348.
Of those 4,300 ballots, 2,875 have listed the MVP winner as the #1 man. That’s 66.9%. It’s almost exactly two-thirds. Two-thirds of MVP votes are cast in favor of the eventual winner.
We do not know, of course, that the actual MVP Award winner is the most-deserving MVP candidate. We do not know that, but there are two possibilities. If the deserving MVP is in fact the winner, then he gets 66.9% of the first-place vote. If the deserving MVP is NOT the winner, then he gets LESS THAN 66.9% of the first-place vote.
What we know, then, is that the percentage of the ballots which go to the "right" man cannot be higher than 66.9%--and, unless the right man is ALWAYS selected, then it must overall be LESS THAN 66.9% of the vote.
OK, there is a tiny bit of space there created by the fact that the most-deserving candidate could have gotten a higher percentage of the FIRST-PLACE vote, but a lower percentage of the OVERALL vote. That could in theory happen, but as a practical matter it almost never does—and even when it does, it’s not mathematically significant. The MVP winner has gotten fewer votes than another candidate only once in the last 50 years, and then it was a margin of one vote. As a practical matter, we know that, unless the right candidate is ALWAYS the winner, then the overall percentage of votes which go to the right candidate must be less than 66.9%.
How much less? Well, that depends on how often you think the voters get it wrong. It is my opinion that, while the majority of awards do go to the most deserving candidate, there have been a substantial number of awards which went to the wrong man. In 1958 Mickey Mantle led the American League in WAR by a wide margin—but didn’t draw a single first-place MVP vote. The same year, Frank Lary had the highest WAR for pitchers by a good margin, but was not mentioned in the Cy Young voting. The Cy Young Award winner, Bob Turley, was not in the top 20 in WAR by a pitcher (combining the leagues, as the Cy Young Award was at that time a combined-league award.) It is reasonable to argue that everybody got it wrong—everybody in the AL MVP vote, everybody in the Cy Young vote. The NL MVP, Ernie Banks, also did not lead the league in WAR, so one could argue that only 3 out of 24 National League MVP voters got it right. The American League Rookie of the Year, Albie Pearson, had only 0.9 WAR, while other candidates had 2.9, 2.3 and 1.4.
I don’t "know" that these voters all got it wrong; my point is that it is reasonable to argue that, in 1958 at least, the MVP voters were almost unanimous in voting for the wrong candidates. It is possible to argue that the actual percentage of voters voting for the "right" candidate is significantly lower than 66.9%. It cannot be higher than that. 51.6% actually seems to me like a pretty good estimate.
The Three-Judge Panel
Suppose that you have a three-judge panel voting on an award, with each judge listing just one player. How often is that going to arrive at the right result?
Three-judge panels are commonly used for post-season series, in which the award is going to be announced as soon as the series is over. Very often the two in-booth announcers and one other guy, the on-field interviewer or a producer, will just vote quickly on the award. If we assume that each voter is 51.6% accurate, how often will the three of them get it right?
Don’t have to model this one separately; it’s just simple math. If each voter gets it right 51.6% of the time, then:
All three voters will be right 13.75% of the time,
Two of the three will be right 38.67%,
Only one of the three will be right 36.25%, and
All three will be wrong 11.33%.
Obviously, if two of the three or all three of the three are right, then the right person wins the award, so that’s 52.4% of the time the award will go to the right person. If all three are wrong, obviously the award goes to the wrong person. The only complication is that, if only one voter gets it right, then the result could be either a tie in the voting, or an award given to the less-deserving candidate.
If only one of the three voters gets it right, then the award will be given to the wrong man about 30% of the time, and there will be a tie about 70% of the time. Don’t ask me how I know this; it’s just an estimate, and it doesn’t make any difference anyway. It’s just how you split 36.25% when only one of the three voters sees the right answer. If two voters agree on the wrong answer 30% of the time, then 10.9 of those 36.25 will go to a lesser candidate, and the other 25.4 will wind up in three-way ties. Thus, we can estimate that, with a three-judge panel, each person voting only once:
The panel will get it right 52.4% of the time,
The panel will get it wrong 36.7% of the time, and
There will be a three-way tie, with the most-deserving candidate being one of three winners, 10.9% of the time.
Eight-Person Panel, One Vote Each
Suppose that you have an eight-person panel voting on the MVP Award, with each man casting just one vote. I’m not aware that this system has ever been used in Major League baseball, although it is fairly common in amateur leagues. In the old Big 8, for example, the coaches would vote for the Coach of the Year, eight votes; somebody would win 3-2 or 4-2 or something. Suppose that you voted on the MVP that way. How often would that result in the right player being elected?
That system would result in the right MVP being selected about 68.6% of the time, depending on how you score the ties and how you break the ties. You’d have a lot of ties.
I created a model of the problem in the manner outlined before, and ran the process 512 times. The "right" MVP candidate:
Was a unanimous selection 15 times,
Won 7 of the 8 votes 36 times,
Won 6 of the 8 votes 62 times,
Won 5 of the 8 votes 89 times,
Won 4 of the 8 votes 104 times,
Won 3 of the 8 votes 97 times,
Won 2 of the 8 votes 71 times,
Won 1 of the 8 votes 37 times, and
Did not win any of the 8 votes one time.
Obviously, if the "true" MVP gets 5 or more votes then he will win the Award, so there’s 202 times out of the 512 that the true MVP wins the Award.
If the most-deserving MVP gets 4 of the 8 votes, he can either win the Award outright or tie for it. In the 104 times that the most-deserving player got four votes, he won the award outright 86 times, and tied for the award (one other player getting all four of the other votes) 18 times.
If the most-deserving MVP gets 3 of the 8 votes, he can (a) win the Award, (b) lose the Award outright, or (c) finish in a 3-3 tie with another candidate. There were 97 times that the "true" MVP got 3 of the 8 votes. In those 97 trials, the true MVP:
(a) Won the award outright 29 times,
(b) Lost it outright 29 times, and
(c) Tied for the award 39 times.
If the most-deserving MVP gets 2 of the 8 votes, he could, in theory, still win the award, as six other candidates could get one vote each. In 512 trials there were 71 times when the most-deserving candidate got only two votes, but it never happened that this was enough to win the Award anyway. The 71 trials led to 11 ties and 60 outright losses.
The results of the 512 trials with this model are summarized in the following chart:
First Place Votes
|
Occurs
|
Wins
|
Ties
|
Losses
|
8
|
15
|
15
|
0
|
0
|
7
|
36
|
36
|
0
|
0
|
6
|
62
|
62
|
0
|
0
|
5
|
89
|
89
|
0
|
0
|
4
|
104
|
86
|
18
|
0
|
3
|
97
|
29
|
39
|
29
|
2
|
71
|
0
|
11
|
60
|
1
|
37
|
0
|
0
|
37
|
0
|
1
|
0
|
0
|
1
|
|
|
|
|
|
Total
|
512
|
317
|
68
|
127
|
This voting structure will result in a tie about 13% of the time. If we split the ties and count them as half-victories, then the "right" player wins 351 MVP Awards in 512 trials, or 68.6%.
Sixteen-Person Panel, One Vote Each
OK, we move on now to a 16-person panel, with each voter casting just one vote. This is the system that was actually used in the Cy Young vote from 1956 to 1960, before expansion added two voters per team.
I ran 640 trial seasons with the assumptions outlined before, and a 16-person, one-vote panel. In those 640 trials the most-deserving MVP candidate won the vote outright 481 times, which is almost exactly 75%. He lost the vote outright 113 times (18%) and the vote ended in a tie the other 46 times, or 7%. This is a fuller breakdown of the results:
First Place Votes
|
Occurs
|
Wins
|
Ties
|
Losses
|
16
|
13
|
13
|
0
|
0
|
15
|
8
|
8
|
0
|
0
|
14
|
21
|
21
|
0
|
0
|
13
|
24
|
24
|
0
|
0
|
12
|
42
|
42
|
0
|
0
|
11
|
46
|
46
|
0
|
0
|
10
|
77
|
77
|
0
|
0
|
9
|
66
|
66
|
0
|
0
|
8
|
76
|
71
|
5
|
0
|
7
|
70
|
57
|
4
|
9
|
6
|
68
|
36
|
16
|
16
|
5
|
66
|
20
|
16
|
30
|
4
|
28
|
0
|
4
|
24
|
3
|
19
|
0
|
1
|
18
|
2
|
10
|
0
|
0
|
10
|
1
|
5
|
0
|
0
|
5
|
0
|
1
|
0
|
0
|
1
|
|
|
|
|
|
Total
|
640
|
481
|
46
|
113
|
For the sake of making sure that you can understand this chart, there were 70 times in those 640 trials when the most-deserving MVP got 7 first-place votes. In those 70 trials, the most-deserving MVP won the Award 57 times, lost it 9 times, and the vote ended in a tie 4 times. There was one time in the 640 trials when not a single MVP voter voted for the most-deserving candidate.
If we assume that there is a tie-breaker process in place and that the most-deserving MVP wins the tie-breaker 50% of the time, then the 16-person panel should find the right man 78.75% of the time. 79%.
Observation about the 16-person Panel and the Cy Young
Our study above concludes that a 16-person panel with one vote per person should get the answer right about 79% of the time. This voting structure was actually used in determining the Cy Young Award from 1956 to 1960, and one would think that, because there are fewer candidates in the Cy Young competition, fewer serious contenders, that the voting results should be MORE accurate than that. . .more accurate in a Cy Young vote than in an MVP vote, which is the basis of our model.
But in reality, if you look at those Cy Young votes, one can make a good argument that the voters got all five of them wrong. I think it is probably true that the voters got all five of them wrong. By WAR, they missed them all, and missed most of them by very wide margins. If that is true, that means that the actual results—granted, it is a sample of five—but the actual results do not seem to be consistent with the theoretical model. That forces us to ask why. What are we missing here?
I think it is bias. The voters in that era had a shared assumption that won-lost records were reliable, and thus, that the best pitchers would have the best won-lost records. They missed it because they were all voting on the same wrong assumption. It was Groupthink.
There’s a second indicator that that is what happened. Our study shows that 7% of such votes SHOULD result in a tie. This suggests that there should probably have been a tie in the Cy Young voting long before there actually was. With a 7% chance of a tie in each vote, there is a 52% chance of a tie in the first ten votes. In fact, there was no tie in the first 16 votes.
Why?
Groupthink. They never had a tie, because they all thought the same way, all shared SOME of the same error, which created a certain amount of consensus where logic and individual error of observation would not have yielded consensus.
The 24-Person, 10-Vote Ballot
Let us step forward now to a 24-person panel in which each voter votes for 10 players, and ranks them 10-9-8-7-6-5-4-3-2-1, with the best player getting 10 points and the 10th best player getting one. This exact voting structure has never been used to determine the BBWAA MVP Award, although it is close to the system which was used for many years. From 1931 to 1937 the BBWAA used a 10-9-8-7-6-5-4-3-2-1 voting structure, but with only eight voters, one per team. In 1938 they made two changes, changing from an eight-person panel to a 24-person voting group, and also changing from a 10-9-8-7-6-5-4-3-2-1 weighting system to a 14-9-8-7-6-5-4-3-2-1 system; in other words, the same as before except that the person listed first on the ballot gets 14 points, rather than 10. That voting system was used from 1938 to 1960, then was used again in the American League from 1969 to 1976, and in the National League from 1969 to 1992. When the first expansion happened in 1961, the BBWAA cut the number of voters from 3 per team to 2 per team, thus cutting the number of ballots from 24 to 20, but when the second expansion happened in 1969 then there were 12 teams in each league, so the number of ballots went back up to 24. It was 24 in each league, then, until that league expanded.
In my studies, this system was 86.7% accurate at determining the best-qualified candidate for the Most Valuable Player Award. I did two studies of this, and got that identical percentage both times. In the first study I simulated 128 seasons. Of the 128 seasons, the correct MVP was identified by the process in 111 seasons—86.72%. In the second round of studies I figured out a more time-efficient method to run the studies, and I was able to do 512 seasons in fewer work hours than it had required to do the first 128. In that series, the correct MVP was identified in 444 out of 512 trials—precisely the same percentage. Sticking just with the second group, the "true" MVP finished:
1st in the voting 444 times,
2nd in the voting 62 times,
3rd in the voting 5 times, and
5th in the voting once.
Turning it around the other way, the voted MVP was:
The best candidate 444 times,
The second-best candidate 55 times,
The third-best candidate 11 times, and
The fourth-best candidate 2 times.
In the 512 trials, there were only two contests which ended in a tie.
Discrepancy Noted
In my 512-season simulation there was not a single case of a unanimous MVP selection with 24 ballots. In the real world, there have been 18 unanimous selections, although I think that count includes a couple of Awards that predate the BBWAA taking over the vote in 1931. I believe that most of those 18 unanimous selections were with less than 24 votes, and, as you get more voters, you get less chance of unanimity, but still, it’s a pretty significant discrepancy between the model and the real world.
There would seem to be three possible sources for the discrepancy. First, it could result from Groupthink, from people all agreeing on something that isn’t necessarily true. I could build this into the model by creating systematic bias—that is, having all of the "voters" or most of the voters agree on some value that isn’t actually there.
Probably some of the unanimous selections did result from Groupthink bias. In 1967, for example, Orlando Cepeda was a unanimous MVP selection in the National League, albeit with only 20 voters, but still, that seems like a Groupthink selection. I’m not really certain that Cepeda was the MVP at all, now that you mention it. He was 5th in the league in WAR, but he led the league in RBI, which was a huge deal at that time, and his team won the pennant after two seasons very near .500, which may have unduly influenced some voters, and the player who perhaps should have been the MVP, Roberto Clemente, had won the Award the previous season, which probably discouraged some voters from voting for him again. There have been other unanimous MVP selections which seem to me to have perhaps been the result of collective bias, and also there are other telling details all over the study which suggest that there is some Groupthink that influences the voting.
However, there have also been cases in which players won the Award unanimously, and it would seem like they should indeed have done so. Al Rosen in 1953, or Mike Schmidt in 1980; it seems like you would have to be pretty dense to miss the fact that this was the best player in the league.
Second, it could be in some cases more obvious who is the deserving MVP than my model has allowed for. We could create this "occasional separation from the pack" by adding another random element to the value model, thus occasionally allowing one player to separate himself by a wider margin.
Third, it could be that perceptual error in real life is less than it is in my model. In my model I allowed each voter’s perception of each candidate to be 15 points better than the player’s actual ability, or 15 points worse, as a theoretical maximum. It would be a simple matter to change that to 14 points, or 10 points; in other words, to reduce the perceptual error of each voter.
I estimated that, using this voting structure, the voters would get the right result 87% of the time. The key question here is whether this discrepancy indicates that the actual vote is MORE accurate than I have estimated—that is, that the voters get the answer right more than 87% of the time—or whether it indicates that they are less accurate. If the discrepancies result from Groupthink in the voting, then the real-life voting is probably less accurate than 87%. If, on the other hand, the right MVP stands out from the group sometimes more than my model believes, or if the relative perceptual error is less than I have built into the model, then the real-life voting would probably be more than 87% accurate.
I’m not going to re-run these studies to try to resolve the issue, because (a) these studies represent more than a week’s work, and I don’t have another week to put into this project, and (b) I don’t really know which direction to go in in reconstructing the model—building in Groupthink bias, reducing the Perceptual Error, or creating a feature which would occasionally allow one player to stand out from the group by a wider margin.
Also, the real goal here is not to build a perfect model; it is, rather, to understand how different variables affect the accuracy of the voting. Does increasing the number of voters meaningfully increase the accuracy of the voting? Does using a 14-9-8-7 system rather than a 10-9-8-7 system actually improve the accuracy of the selection? That’s really what I am trying to get to. The answers to those questions are probably the same regardless of what causes this discrepancy.
The 1938 Model
So let’s get to that question: does using the 14-9-8-7-6-5-4-3-2-1 voting weight, rather than a 10-9-8-7-6-5-4-3-2-1 system, actually increase the reliability of the system for identifying the most deserving MVP?
It does not.
I will call this the 1938 Model, which is not intended in any way to suggest that this is an outdated or antique model, like a 1938 Ford or something; that’s not what I am saying. From 1931 to 1937 the BBWAA used an 8-person panel and weighted votes by the 10-9-8-7-6-5-4-3-2-1 method; in 1938 they switched to a 24-person panel and to the 14-9-8-7-6-5-4-3-2-1. That exact model has been used essentially ever since, although the number of voters has varied from as low as 20 to as high as 32, but essentially the same system.
If you have read my stuff over the years, you know that I have generally spoken well of this system. I have always described it as an intelligently designed system which generally does an excellent job of finding the right MVP. That’s the conclusion here, as well: this system generally works.
And one can understand why the change from 10-9-8-7 to 14-9-8-7 was made. The BBWAA members at that time were saying that, while they wanted to know who the voters thought all of the best players in the league were, there should be a special emphasis on knowing who the voters thought was the BEST player in the league, the #1 guy. It was sort of like what my friend Jerry was saying, back at the start of the article: that the only thing that should REALLY matter was who is the number one man? Intuitively, it makes sense.
But mathematically, it doesn’t make sense, and mathematically, it doesn’t really work. Mathematically, what you are doing by giving 14 points for a first-place vote, rather than 10, is arbitrarily giving additional weight to a distinction which there is no reason to believe is especially reliable. This doesn’t cause the system to work better; it actually causes it to work slightly worse.
Well, I don’t want to overstate that; overstating it will cause confusion. There actually is a mathematical reason to give extra weight to the #1 selection. Given an array of player values in a competitive environment, it is likely that the difference between the #1 player and the #2 player is greater than the difference between the #2 player and the #3 player. It is virtually certain that the difference between the #1 player and the #2 player would be larger than the difference between the #65 player and the #66 player. If you studied WAR, for example, or Win Shares, you would certainly find that the difference between the #1 player in the league and the #2 player in the league was, on average, much greater than the difference between #2 and #3. This difference would, in fact, justify a mathematical model which places extra weight on who the voters perceive as being #1 man.
But the key question is "how much"? How much extra weight?
The 4 extra points for the first-place vote are probably way too much. Probably giving one point to the first-place vote—an 11-9-8-7—probably that would be too much. Giving 4 extra points is CERTAINLY too much.
The thing is, there are cases, like Al Rosen in 1953 or Mookie Betts in 2018, where one player is pretty obviously better than everybody else. But when one player is far better than everybody else, the voters are going to see that, anyway. The system does not benefit from those 4 extra points for the first-place vote, because that kind of player will usually win the award without the extra help.
In the simulation study, rather than re-running the data for all 512 teams to compare the 14-point system to the 10-point system, I simply took out all of the 10-point votes, and replaced them all with 14-point votes. This saved me several hours of work, but also it seemed like the more appropriate way to the do it, because it creates a more direct comparison between the systems.
When I made this change, there were 17 cases (in the 512 simulated votes) in which adding the 4 points to the first-place votes changed the award recipient from the wrong selection to the right one. The problem is, there were 20 cases in which it changed the award winner from the right player to the wrong one. The net effect was to reduce the number of awards going to the right man from 444 out of 512 to 441 out of 512. It reduced the accuracy of the voting system from 87% to 86%. The 4 extra points simply add emphasis to some random perceptual error.
I could use one of those cases to illustrate the point, but that would involve making a lot of statements like "In Simulated Season number 378, player 51 had an actual value of 83.421, whereas Player 74 had an actual value of 83.247. However, voter number 16 had a perceptual error of. . .. " You get why that’s not helpful.
Instead, I would suggest that you look at a real-life case, which is the National League MVP vote in 1979. The Pittsburgh Pirates won the National League East in 1979 with what was then a very fun team to watch, although the story soured a couple of years later. The ’79 Pirates adopted Sister Sledge’s "We Are Family" as their theme song, wore very cool retro hats, played aggressive, exciting baseball, and won 98 games, winning the NL East by three games.
Willie Stargell at that time was very much like David Ortiz toward the end of David’s career. He was a greatly respected veteran leader in the clubhouse, and a beloved Old Hand by the public, just as David was, and also, although he was old and slow and couldn’t really play the field, he could still mash. He was a left-handed power bat, and a good one. He had a tremendously quick bat, even quicker than Ortiz, and he had formidable strength in his wrists, which enabled him to propel a heavy bat at a high rate of speed and still make contact with the ball.
He had terrible wheels, however, and, because the National League had no DH rule, and also because the National League at that time had a LOT of primitive artificial turf that was as hard as cement and played hell with Stargell’s aching feet, he played in only 126 games, 16 of those as a pinch hitter. He had only 480 plate appearances, no defensive value at all. He hit well, .281 with 32 homers, but he was not the best hitter in the league, by a pretty good margin. He had been the best hitter in the league 1973-1974, but by 1979 he wasn’t really close to that.
He was, however, the emotional center of the championship team, and a FUN championship team, at that. He was "Pops"—again parallel to Ortiz, who was "Papi". In September, he delivered a few Game-Breaking big hits down the stretch. It actually, when you look at the factual record, wasn’t all that huge of a deal. When you look at the actual record, you’re actually talking about just five big hits, in games on September 1, 5, 11, 18 and 25—not that those five hits were not important, but Stargell hit just .222 in September with only 18 RBI, hardly phenomenal numbers. The Pirates, six games ahead on September 1, wound up winning the pennant by three games. Still, a narrative started to develop that Pops was The Guy on this team; he was the guy delivering big, game-breaking hits day after day after day as his team drove to the pennant with a September surge.
It was, to be blunt, kind of a bullshit narrative. Stargell’s WAR for the season was only 2.5, while four players were over 7.5, which I will grant you is not a precisely accurate comparison, either; it may not give Stargell enough credit for his big hits in September, and it gives him no credit at all for his leadership.
Still, Stargell was really NOT the best player in the league, and the majority of the MVP voters knew that. Stargell received ten first-place votes, with the other 14 going to players who had more WAR, most of them to players with three times as much WAR, but split among those players, with no one player getting more than four first-place votes. Stargell finished well down the ballot on the other 14 ballots.
Had the votes been counted on a 10-9-8-7 basis, Stargell would have finished a distant second in the MVP voting. But given an extra 40 points by the 14-9-8-7 weighting, Stargell wound up in a tie for first place—the only tied vote in the history of the award—and wound up with a half-share of the MVP Award.
It’s one case, of course, but I think it illustrates why the 14 points for the first place ballot is not actually helpful in identifying the true Most Valuable Player. Emphasizing the first-place selection gives additional weight to a distinction in the mind, rather than to a distinction on the field. There is narrative value—that is, a story which explains why this player is the Most Valuable—and there is production value, which is imperfectly measured by mathematical tools. The 4-point bonus for a first-place vote gives weight to an excited minority of the voters who have convinced one another of a narrative which selects certain facts as the "important" facts, but gives no weight to all of the other boring facts, those boring home runs in July and those boring doubles and triples and all that boring defensive play.
The Thirty-Man Panel
Excuse Me, the Thirty-Person Panel
In modern baseball, of course, we use a 30-person voting group, two representatives from each team. This brings up the next question: Are 30 voters meaningfully more likely to get the answer right than 24 voters?
It depends on how you define "meaningfully", but yes, getting 30 voters is more accurate than using 24. Using the 10-9-8-7 weighting system for ballots, the 24-person ballot got the "right" answer 444 times in 512 trials, with the second-best candidate winning 55 of the other 68. Using the same system but with 30 voters, the voters got the right answer in 449 out of 512 trials, with the second-best candidate winning 56 of the other 63. The number of awards going to players who should have finished third or lower dropped from 13 to 7, and the "correct decision" percentage increased from 87% to 88%--and really, it is a little more meaningful for that, because the net effect is not to transfer awards from the second-most deserving candidate to the best candidate, but actually from the third-most deserving candidate to the best candidate. Some awards go from the third most-deserving candidate to the second-most, some from the second-most deserving to the most-deserving, so the net effect is to move awards from the third-best candidate to the best.
That is using the 10-9-8-7 ballot. Using the 14-9-8-7 ballot which is actually used, the larger voting panel (30 as opposed to 24) increases the number of correct choices from 441 to 447 (from 86% to 87%), and also decreases the number of selections going to the third-best candidate from 13 to 9.
If you think about it from a certain perspective, it becomes logically obvious that increasing the voter panel would increase the predictability of the outcome. If you think about MVP votes simply in terms of "preference"—I think it is this guy, you think it is that guy, one opinion is as good as another—then it does not seem that increasing the voting panel has much effect; you would still have differences of opinion if you had 100 voters.
But if you assume that there IS one player who is more valuable than any other player—an assumption which I believe is necessarily implicit in voting for a Most Valuable Player—then the votes for other players are not merely differences of opinion, but errors. If you think about it that way, it’s obvious that increasing the voting panel increases the accuracy of the outcome. There MUST be observational errors, right; otherwise everybody would see who the Most Valuable Player actually was. If there was no observational error, you’d just need one voter to decide the thing. The reason you need a larger voting panel is to balance the observational error. The "preference" does not reside in the ACTUAL value; it resides in the observational error.
In mathematical terms, suppose that one player has a value of 91 Whatsis and the other player has a value of 90 Whatsis. The player who has 91 Whatsis will win the vote unless the sum of the observational errors favors the lesser player by at least 1 per voter. Assuming that Observational Errors are a random variable centering at zero, then the more voters are involved, the less chance there is that the average of the Observational Errors is larger than the difference in value between the two players.
What If You Added a Third Voter Per Team?
What if we added a third voter per team, increasing the voter panel from 30 to 45 voters. What difference would that make?
It would make a quite significant difference. Using a 10-9-8-7 ballot, a 30-voter panel got the right answer about 449 times in 512 trials, or 87%, as stated before. A 45-person panel got the right answer 463 times in 512 trials, or a little bit above 90%.. ..a high 90, less than 91.
Using the 14-9-8-7-6 weighting system which is actually used, the number of correct decisions increased from 447 out of 512 to 462 out of 512, which, again, is 90%.
An increase in reliability from 86 or 87% up to 90% may not seem to be a big deal, but if you focus instead on the number of incorrect votes, it seems much larger. Using the actual voting structure, the number of expected incorrect voting results in 512 trials drops from 65 to 50—a 23% decrease. That’s quite significant, in my opinion.
But What About. . .
The only argument that I can think of against adding a third voter for each team would be that the third voter might be less well-informed than the previous two, and thus might have a larger range of error.
It seems to me that this is tremendously unlikely. The modern world, compared to the world of 1960, is vastly better at creating and distributing information. In the 1960s, maybe the voter didn’t know that much about the other teams unless he traveled with one team. In the modern world, many of us have the MLB-TV package that enables us to watch a very large number of games. I’ll bet I saw 50 Oakland A’s games this season. In the 1970s, Peter Gammons made himself a national institution by, among other things, getting on the telephone and sharing information with beat writers around the country. In the modern world, information of that type is shared seamlessly with people who would never have qualified for access to detailed information in the pre-internet universe. I don’t think that there is a shortage of qualified voters, frankly.
Nor do I believe that it’s a highly relevant issue. One of the things that could be done with studies of this type would be to vary the observational error, to see how the conclusions change with different levels of observational error. In other words, if the present method is 87% accurate if we assume that the observational error is potentially 15 points per player, what would the accuracy be if we assumed that the potential observational error was 10 points per player, or 20 points per player?
I have not done THOSE studies, but I’ll tell you what I think. I don’t think it would make a great deal of difference. My belief, which I must have published 5,000 times over the years, is that the external world is vastly more complicated than the human mind, billions of times more complicated, and, because this is true, everyone’s understanding of anything and everything is unreliable.
People think that getting "qualified" voters is the key to getting a good result, but is it really? I doubt it. I don’t believe it, because I don’t believe that anyone actually understands the world. We approach understanding only by working together. That is the foundation of science, that we create understanding only by working together. For that reason, I believe that a 45-person panel would work substantially better, in getting an accurate MVP vote, than a 30-person panel.
Thank you for reading.