2017-60
MVP Followup
General and Philosophical
I enjoyed the Twitter discussion which followed the Judge and Altuve article posted a few days ago, and I appreciate all of your thoughts. One of the problems with general-participation discussions is that they tend to go everywhere. To make progress in thinking about an issue you have to stay focused on that issue. Analysis is largely a process of taking an unmanageable concept and breaking it down into smaller, more manageable issues. People have been arguing about "Who should be the MVP?" for 100 years; that’s not analysis. It becomes analysis when you break it down into smaller issues which are closer to the size that your mind can deal with, and then try find a compelling logic to resolve the smaller issues. How many runs did each player create? How many runs did he prevent with his defense? How many wins resulted from those runs? How did the parks effect their batting stats? What weight should we give to the fact that the player’s team won the pennant? Should we give credit for leadership? Does character count?
Each question breaks down into a series of smaller questions. "How many runs did each player create?" breaks down into what is the run value of a double, what is the run value of a homer, etc. "Does character count?" breaks down into "What are the elements of character?", which becomes "What is the practical value of courage?", "What is the practical value of good work habits?", "What is the practical value of holding your teammates accountable?", "What is the practical value of sensitivity to the needs of others?", "What is the practical value of honesty?", etc. Every question works down toward smaller questions until you reach the point at which the questions actually have answers.
In public debate the opposite happens. People try relentlessly to inject different and larger questions into the debate, which prevents the debate from focusing on the smaller issues which you need to resolve. You’re trying to talk about Judge vs. Altuve, and people will form a circle around you and shout at you "What about Mike Trout? WHAT ABOUT MIKE TROUT? YOU HAVEN’T TALKED ABOUT MIKE TROUT!!! ISN’T MIKE TROUT ACTUALLY BETTER THAN EITHER ONE? WHAT ABOUT JOSE RAMIREZ? ISN’T JOSE RAMIREZ REALLY ABOUT THE SAME AS JOSE ALTUVE? SHOULDN’T YOU BE USING WIN PROBABILITY ADDED INSTEAD OF THE METHODS YOU ARE USING? HOW DOES WHAT YOU ARE SAYING DEAL WITH WPA? WHY AREN’T YOU USING WPA? WHY DIDN’T YOU TALK ABOUT THE NUMBER OF OUTS THAT JOSE ALTUVE MADE?" If you respond to all of that stuff then the discussion wanders all over the map, and you can never do any actual analysis. If you want to do actual analysis you have to learn to ignore the chatter and stay focused on what you are trying to understand. But I enjoy talking to you all, and I’ll try to comment superficially on some of the other stuff people are worrying about. My thoughts.
WPA
Win Probability is a method of measuring the state-to-state changes in each game, and attributing them to some player. You’re in the top of the 8th, one out, trailing 4-2, two men on base, your team’s chance of winning the game is 18% (.1845). The batter hits a home run, the team’s chance of winning goes to 73% (.7301). The home run improves their probability of winning the game by .5456, so we credit the batter with +.5456, and the pitcher with negative .5456.
I remember John Dewan proposing what we would now call a Win Probability Added system about 30 years ago; not sure whether that was the first time I had heard the idea or not. I think Pete Palmer may have proposed something similar before then, but who knows; it’s been a long time. I remember John’s proposal because John was irritated that I didn’t much like the idea.
I’m not saying that that approach does not have merit or that that research should not be done. Like any other approach, it has its problems. Specifically it has two problems, or three problems, which are 1, 2a and 2b. The first problem is the problem of attribution. The example I gave you before was easy; the batter hits a three-run homer, you can’t blame anybody but the pitcher. But take a different example growing out of the same situation. The batter doesn’t hit a home run; he grounds the ball down the first base line. The first baseman should stop the ball and make a play at first, but just as the ball is delivered the runner from second fakes a move to third. The first baseman checks to see whether his runner is moving, and the ball gets down the line. The right fielder should cut the ball off and hold it to a single, most right fielders would, but it’s Larry Parrish or Gary Sheffield or Nick Markakis or Carlos Gonzalez or somebody; he has a good arm but he doesn’t move that well, so the ball scoots into the right field corner, where Markakis retrieves it and looks toward home plate. At first he is going to concede the run and throw to third, but as he starts to throw he sees that, because his team was in a shift and the shortstop has gone to cover second, there is no one in position to take the throw at third, so he throws home instead. The throw home is too late, but the throw to the plate enables the batter/runner to make it to third base. The score is tied, runner on third, one out, the visiting team has a 60.67% chance to win the game.
It is easy enough to measure the static change (the change in the win probability states)—but who is responsible for that? Is the pitcher responsible for all of the negative change of state, or is it the first baseman, or the right fielder, or the shortstop? Is the batter responsible for all of the positive change of state, or is it the runner from second, who faked the move to third, or the runner from first, who was able to score on a ball that another runner might not have scored on?
If it’s a three true outcomes play you can make a clean attribution, but if it isn’t, you can’t. If it’s a play from the distant past, Keith Hernandez hitting the ball into the corner and Larry Parrish slow to retrieve it, then all the information you would have is "double to right, 1-H, 2-H, B-3 on throw." If it’s a modern play then you have much more information; you can get some estimate of the probability that the play would be made by the first baseman, etc. In the 1980s, when John Dewan and I first debated this system, we didn’t have that.
The problem is that a system like this leaves you in a much poorer position to evaluate the responsibility of each fielder than even the final fielding stats would—and the old-style fielding stats were not good. I think the current WPA systems actually just nakedly ignore fielding and base running and attribute everything to the hitter or the pitcher, which is a really terrible way of doing it, but we have to assume that better systems will evolve over time.
Problem 2a. In general, I do not like and I do not have much faith in, systems which start everybody out in the middle and move them up or down from .500. The fielding systems that we use now basically do that; they start everybody out in the middle and move everybody up or down based on how they compare to the average. I’ve never liked that.
It doesn’t describe the real world. You don’t start out at average and move up or down. You start out at zero and build upward. If you have two shortstops in the same league, one of whom is a backup who plays 50 innings and is +1 and the other of whom plays 1,200 innings and is -20, it is very likely that the one who is -20 is in reality the better shortstop, because if he wasn’t, he wouldn’t be playing 1,200 innings at shortstop.
Problem 2b is a manifestation of Problem 2a. In the last week I don’t know how many people have told me that WPA is the ultimate measure of value and we should be using that to identify the MVP. WPA is a very poor measure of value, and not of much use as an MVP indicator. One problem is that, because it starts everybody out in the middle and measures up and down movements, it does not measure the value of being average. MOST value in baseball is in being average. 80 to 90% of value in baseball is the value of being average or less than average. This is true because the difference between two major league players is a lot less than the difference between a major league player and a man off the street.
Look, let’s suppose that the replacement level is .290 or whatever people say that it is. Let us suppose that you have a .510 player; that is, a player who is just a little bit above average. His value is .220 times his playing time, right--.510, minus .290, times his playing time. But his ABOVE AVERAGE value is .010 times his playing time. A little more than 95% of his value is in just being an average major league player. Only a tiny bit of it is measured by his margin above average.
Suppose that you have an array of 10 players—a .650 player, a .600 player, a .560 player, a .530 player, a .515 player, a .500 player, a .450 player, a .420 player, a .390 player and a .385 player. On average, they’re average.
How much of their value is simply being average, or being less than average but better than .290?
The .650 player has a margin of .360, of which .150 is created by being better than average, so that’s 42%. The .600 player has a margin of .310, of which .100 is created by being better than average. That’s 32%. For the ten batters as a whole, they have an aggregate value of 2.1, of which .355 is created by being better than average, or 17%. 83% of the value of the group of players is created by being better than the replacement level, but not better than average. WPA provides us no way to get a handle on the value of being average.
In order to convert WPA into a passable imitation of an MVP ballot, you have to do several things. First, you have to figure some way to weight playing time so that each player gets credit for the value of being average. Second, you have to convert the WPA and the "playing time score" so that they are on the same scale, so that you can add them together. Third, you have to find some way to give each player credit for his defense. You have to figure out how much credit to give an average defensive shortstop as opposed to an average defensive first baseman. Fourth, you have to deal with base running. Then you get into the tricky stuff, the stuff that we don’t know the real answers to—does a player deserve extra credit for leadership, or for playing for a pennant-winning team.
I am not opposed to incorporating WPA into a comprehensive value system. It could be done well; it could be done poorly. But I think, in general, that. . .well, there is a steep side of the mountain, and there is a less steep slope on the other side. I think this is the steep side of the mountain. In general, you drive up the less steep slope of the mountain. This is driving up the steep side. It is easier to get where you’re trying to go if you come at it from a different angle.
Mike Trout
So why didn’t I discuss Mike Trout, when I was arguing with the world about whether Judge was on the same level as Jose Altuve?
Because he was never a real candidate for the Award. It was obvious that either Judge or Altuve was going to win the award. Why would I waste my time worrying about somebody who had no chance to win the award?
So do I believe that Mike Trout should have been an MVP candidate?
No, I do not. He missed 30% of the season. You have to discount his value for that, and it’s not a small discount; it’s a big discount. He’s a marginal candidate at best.
Home/Road Stats
Charlie Blackman finished fifth in the National League MVP voting. I think he should have done better, maybe. Blackmon hit .391 with a 1.239 OPS in Colorado, but .276 with a .784 OPS on the road. The voters knew that, and they discounted Blackmon as an MVP candidate in part because of that.
This is the way I see it. Voters know that Blackmon’s stats need to be discounted because he played in Colorado, but they have no clear idea of how much to discount them relative to this stat or that one. That being the case, some voters discounted his stats by a grossly inappropriate percentage, which was influenced by Blackmon’s individual home/road stats.
I know that this is a confusing issue. It’s not intuitively obvious what set of numbers you should focus on. But the example that helped me to get it clear in my mind was Bill Dickey vs. Elston Howard.
Elston Howard and Bill Dickey were both Yankee catchers, of course, both tremendous players. Elston Howard was a 12-time All Star and won an MVP Award but is not in the Hall of Fame; Bill Dickey is in the Hall of Fame but did not win an MVP Award. Both played in Yankee Stadium, and the effects of Yankee Stadium were essentially the same in Howard’s era as they were in Dickey’s. The Park Factors for Yankee Stadium from 1930 to 1935 were 80, 95, 83, 81, 84, and 79. From 1959 to 1964 (Howard’s best years) they were 81, 83, 85, 84, 96 and 100. It’s a matched set.
The thing is, though, that Yankee Stadium was great for Bill Dickey, a left-handed hitter, whereas it was absolutely turabull for Elston Howard. Dickey’s home-field advantage in some seasons was bigger than Blackmon’s. In 1935 Dickey hit .303 with 11 homers at home, but .257 with 3 homers on the road. In 1937 he hit 21 homers at home, 8 on the road. In 1939 he hit .357 with 23 homers, 84 RBI in Yankee Stadium, but .274 with 4 homers, 32 RBI on the road.
Elston Howard, on the other eyeball. In 1959 he hit 68 points higher on the road, with 13 of his 18 homers on the road. 21 RBI at home, 52 on the road. In 1962 he hit 18 of his 21 homers on the road. In 1964 he hit 65 points higher on the road, with 12 of his 15 homers on the road.
If Howard had been a left-handed hitter and Dickey a right-handed pull hitter like Howard, Elston would be in the Hall of Fame and Dickey would not—but that didn’t happen. Should we, then, adjust Howard’s numbers upward, because the park hurt him, but adjust Dickey’s numbers downward?
But you can’t do that, because the rock that you stand upon when analyzing stats is wins. The thing is that real and permanent wins resulted from Dickey’s good fortune, while real and permanent losses resulted from Howard’s tough luck. Because the park helped him, Dickey created more runs than Howard did, and more runs relative to the offensive context. That made him a more successful player. It’s luck, yes, but we don’t adjust luck out of existence, because you can’t adjust luck out of existence. If you adjust Howard upward and Dickey down, what you are in essence saying is that in a neutral park, Howard would have been better and Dickey would have been worse. Analysis is not about what would have happened in a different world. It is about the value of each player in the real world. Because Dickey and Howard created runs in the same park, the park adjustments that apply to them are the same for both players.
It is the same as dollars and purchasing power. What is relevant is the ratio between the dollars you have to spend and the cost of living in the place where you have to spend it. Blackmon created about 50% more runs in Colorado than he did on the road. It may be that Johnny Blackmon, as an individual worker, can earn $100,000 a year in St. Louis but $150,000 a year in Colorado, because his individual skills are more in demand in the Colorado area than they are in the St. Louis area; insert pot farming joke here. But that that doesn’t mean that Blackmon has an "individual cost of living" which is 50% higher in Colorado than it is in St. Louis. What it means is that he has more purchasing power in Colorado than he does in St. Louis.
Runs are used to purchase wins in a similar manner to how dollars are used to purchase groceries and pay rent. What is relevant is "How many runs do you have to work with?" and "What is the cost of a win, in terms of runs?" The cost of a win, in terms of runs, IS higher in Colorado than it is in an average NL park—about 33% higher, in 2017. You play half of your games at home, half on the road, so the appropriate discount for Blackmon’s stats is about 16%--not 50%.
The thing is that Blackmon hit .391 in Coors’ Field—but other players did not. If everybody hit .391 in Coors’ Field, or if everybody hit 115 points higher in Coors Field than they did on the road, then we would discount Blackmon’s performance at that rate. If everybody hit 60 points higher in Coors Field, you could mostly discount it. But in fact hitters as a whole hit only 34 points higher in Colorado in than Rockies’ road games.
The Garvey/Votto Rule
I was wrong about something, in the 1970s and 1980s; I was wrong about a lot of things. One in particular was this.
Steve Garvey in 1974 hit .312 with 21 homers, 111 RBI and 200 hits, and won the MVP Award. Then he did basically the exact same thing again every year until 1980, but he never won the MVP Award again. He finished 11th in 1975, 7th in 1976, 6th in 1977, a distant second in 1978 (no first-place votes), 15th in 1979, and 6th in 1980.
Another example is Willie Mays in 1954. Mays in 1954 hit .345 with 41 homers, 110 RBI, and won the MVP Award. It was his first great season. Actually, he just as great as that pretty much every year after that, but he didn’t win another MVP Award until 1965.
Hank Aaron hit .322 with 44 homers in 1957, won the MVP Award. How many times did Hank Aaron have that season? He was better than that in 1959, and just as good as he had been in 1957 in 1960, 1961, 1963 and 1969, but he never won another MVP Award.
Based on these examples and others, I developed what I called the Steve Garvey rule. A player is more likely to win the MVP Award in his FIRST great season than he is in a subsequent season which is just as great. The reason for this, I thought, is that that which is moving attracts the eye. A stationary object does not attract attention. A moving object does. If a player does something every year, he becomes a stationary object. People stop noticing.
OK, I was wrong about that; not entirely wrong, but more wrong than right. There are two factors pulling in opposite directions, that one and a generalization asserting that a great player builds respect over time, and people start to feel that it is his turn to win. On balance, the larger of the two effects is the second one. Overall, players gain strength in the MVP voting when they repeat an outstanding season, more than they lose strength.
Nellie Fox in 1959 hit .306 with 2 homers, 191 hits, which really was just his normal season. He won the MVP Award in 1959 because (a) his team won the pennant and (b) people felt that it was his time. Barry Larkin wasn’t really any better in 1995 than he was in several earlier seasons, but by 1995 people had decided that it was his time, and he deserved a break.
Joey Votto was the same player in 2017 that he has been since 2010, but in 2017 he ALMOST won the MVP Award. It’s the Nellie Fox effect. People had decided that it was his turn—not quite enough people, but almost enough.
The Pennant Winners’ Bonus
The Altuve/Judge article ends with these two sentences:
What creates value for a baseball player is winning games. You cannot discard that principle, and have a valid analysis.
If I have a problem with that phrase, it is with the word "games". What creates value for a baseball player is winning. But do we truly believe that all wins are created equal? Let us say that one team, who we will call the 2014 San Francisco Giants, wins 88 games and goes on to win the World Series. Another team, who we will call the 2012 Angels, wins 89 games but does not qualify for the post-season. Must we give more win credit to the 2012 Angels than to the 2014 Giants? The purpose is not to win games; it is to win pennants. Should there not be a recognition of this, in determining a player’s value?
One COULD set up the Win Shares system in this way: that rather than crediting the team with three win shares for each win, we could credit a team with three win shares for each win, plus some number of "bonus" win shares if they make the post-season, and some larger number of bonus win shares if they win their division, thus are in a more advantageous position in the post-season.
If we were to do that, I believe that that would cause the "Win Shares MVP" to match up better with the elected MVP. To return to the examples from earlier in the article, Garvey won in 1974, Mays in 1954, Aaron in 1957, and Fox in 1959 won the MVP Award not because the players individually had better seasons, but because they received extra credit for the success of the team. Is this not actually rational?
One of the things that research is about is the search for a compelling logic. We are always trying to find an argument about some small issue which is so compelling that (a) there appears to be no way to escape that conclusion, or (b) we might reasonably expect that a consensus would develop in support of that conclusion.
I might almost argue that this is a compelling line of logic:
1) Value is created not merely by winning games, but also by winning pennants or post-season position, therefore
2) In assessing each player’s value, we will include some credit for winning enough games to qualify for post-season play, rather than simply straight-line credit for winning games.
So far, so good. But here’s where we lose it. How much extra credit should we give?
I would go for one extra game (three extra win shares) for qualifying for post-season play, and two extra games (six extra win shares) for winning the division. I might go one more game than that—six win shares and nine. That would make more difference in an MVP race than you might suspect. An MVP candidate typically earns a little more than 10% of his team’s Win Shares. If you give the team nine extra Win Shares, he gets one of them. MVP races often come down to one or two Win Shares. A small bias in the direction of the player from the winning team is not unreasonable, and would make a difference.
But if there is a compelling logic here, there has to be a compelling logic in favor of some SPECIFIC answer, some specific number, rather than just the general principle. We could give 3 Win Shares extra credit, we could give 6, we could give 9, we could give 50. Unless we know what the number is, we don’t really have a compelling logic.
As always, we’re trying to be faithful to multiple different premises. We are trying to be fair to a player who plays on any team, regardless of whether his teammates are good players or not. But at the same time, winning has to count, because that is the purpose of the effort.
Tom Tango’s comments on this post:
1. WPA in its original form was Player Win Averages from the Mills Brothers from 1970. I excerpted the relevant part of their book here:
http://tangotiger.net/PWA.html
I learned about it originally from Pete in The Hidden Game.
2. I agree about the attribution issue of WPA. It's really a limitation of the "availability of sequencing" of the data. We throw our hands in the air, and deal with the batter and pitcher, or in case of SB, CS, WP, PB, PK, BK, with the lead runner, pitcher, and catcher. More data is better, but we just deal with what we have available. And rely on sample size for things to work themselves out, which of course, kind of goes away from the idea of individual play attribution that started with the advantage of WPA.
If you look at WPA long enough, it does all work itself out. Pedro for example was +51 wins in WPA. He has a 219-100 record, which is +59 wins above a .500 record (though we really should use his team strength, but, let's say between Expos, Sox, Mets, it was .500). He gave up runs at about 2/3 the league average which is equivalent to around a .675 record, so with 2827 IP, divided by 9, is 314 "games", or 213-101, or +56 wins. No different than say a QB and Wide Receiver. We really don't know for that ONE PLAY, but look at Joe Montana long enough, and it works its way out.
But, I agree with your basic point, or at least I don't disagree with it.
3. I agree about your point of WPA and average. Studes and I have independently done a WPA above replacement some 10 years ago. Not that hard to do, but not as clean. Going back to what you said 30 years ago: "I can't do all this myself". There is a path to make a WPA as WAR like... it's just that others should step up. We have bigger fish to fry.
***
As for the other point regarding WAR and MVP: WAR really was not intended to be used in MVP discussion, just like FIP (or DIPS) was not really intended to be the end-all for pitcher evaluation. It just so happened that FIP, focusing on one-third a pitcher's plate appearances, is SO STRONGLY LINKED to a pitcher's ERA that we can get by with ignoring hits on balls in play and caught stealing or even "sequencing" (performance with men on base). That's what WAR is, that even though it is not "situationally aware", it works most of the time for MVP. And when it doesn't, we kind of forgot that it wasn't supposed to work all the time. Like with Judge.
--Tom