Username:	Password:

Remember me

Forgot your username/password?

Print Email

Home>Articles

MVP Followup

By Bill James

November 20, 2017

2017-60

MVP Followup

General and Philosophical

I enjoyed the Twitter discussion which followed the Judge and Altuve article posted a few days ago, and I appreciate all of your thoughts. One of the problems with general-participation discussions is that they tend to go everywhere. To make progress in thinking about an issue you have to stay focused on that issue. Analysis is largely a process of taking an unmanageable concept and breaking it down into smaller, more manageable issues. People have been arguing about "Who should be the MVP?" for 100 years; that’s not analysis. It becomes analysis when you break it down into smaller issues which are closer to the size that your mind can deal with, and then try find a compelling logic to resolve the smaller issues. How many runs did each player create? How many runs did he prevent with his defense? How many wins resulted from those runs? How did the parks effect their batting stats? What weight should we give to the fact that the player’s team won the pennant? Should we give credit for leadership? Does character count?

Each question breaks down into a series of smaller questions. "How many runs did each player create?" breaks down into what is the run value of a double, what is the run value of a homer, etc. "Does character count?" breaks down into "What are the elements of character?", which becomes "What is the practical value of courage?", "What is the practical value of good work habits?", "What is the practical value of holding your teammates accountable?", "What is the practical value of sensitivity to the needs of others?", "What is the practical value of honesty?", etc. Every question works down toward smaller questions until you reach the point at which the questions actually have answers.

In public debate the opposite happens. People try relentlessly to inject different and larger questions into the debate, which prevents the debate from focusing on the smaller issues which you need to resolve. You’re trying to talk about Judge vs. Altuve, and people will form a circle around you and shout at you "What about Mike Trout? WHAT ABOUT MIKE TROUT? YOU HAVEN’T TALKED ABOUT MIKE TROUT!!! ISN’T MIKE TROUT ACTUALLY BETTER THAN EITHER ONE? WHAT ABOUT JOSE RAMIREZ? ISN’T JOSE RAMIREZ REALLY ABOUT THE SAME AS JOSE ALTUVE? SHOULDN’T YOU BE USING WIN PROBABILITY ADDED INSTEAD OF THE METHODS YOU ARE USING? HOW DOES WHAT YOU ARE SAYING DEAL WITH WPA? WHY AREN’T YOU USING WPA? WHY DIDN’T YOU TALK ABOUT THE NUMBER OF OUTS THAT JOSE ALTUVE MADE?" If you respond to all of that stuff then the discussion wanders all over the map, and you can never do any actual analysis. If you want to do actual analysis you have to learn to ignore the chatter and stay focused on what you are trying to understand. But I enjoy talking to you all, and I’ll try to comment superficially on some of the other stuff people are worrying about. My thoughts.

WPA

Win Probability is a method of measuring the state-to-state changes in each game, and attributing them to some player. You’re in the top of the 8^th, one out, trailing 4-2, two men on base, your team’s chance of winning the game is 18% (.1845). The batter hits a home run, the team’s chance of winning goes to 73% (.7301). The home run improves their probability of winning the game by .5456, so we credit the batter with +.5456, and the pitcher with negative .5456.

I remember John Dewan proposing what we would now call a Win Probability Added system about 30 years ago; not sure whether that was the first time I had heard the idea or not. I think Pete Palmer may have proposed something similar before then, but who knows; it’s been a long time. I remember John’s proposal because John was irritated that I didn’t much like the idea.

I’m not saying that that approach does not have merit or that that research should not be done. Like any other approach, it has its problems. Specifically it has two problems, or three problems, which are 1, 2a and 2b. The first problem is the problem of attribution. The example I gave you before was easy; the batter hits a three-run homer, you can’t blame anybody but the pitcher. But take a different example growing out of the same situation. The batter doesn’t hit a home run; he grounds the ball down the first base line. The first baseman should stop the ball and make a play at first, but just as the ball is delivered the runner from second fakes a move to third. The first baseman checks to see whether his runner is moving, and the ball gets down the line. The right fielder should cut the ball off and hold it to a single, most right fielders would, but it’s Larry Parrish or Gary Sheffield or Nick Markakis or Carlos Gonzalez or somebody; he has a good arm but he doesn’t move that well, so the ball scoots into the right field corner, where Markakis retrieves it and looks toward home plate. At first he is going to concede the run and throw to third, but as he starts to throw he sees that, because his team was in a shift and the shortstop has gone to cover second, there is no one in position to take the throw at third, so he throws home instead. The throw home is too late, but the throw to the plate enables the batter/runner to make it to third base. The score is tied, runner on third, one out, the visiting team has a 60.67% chance to win the game.

It is easy enough to measure the static change (the change in the win probability states)—but who is responsible for that? Is the pitcher responsible for all of the negative change of state, or is it the first baseman, or the right fielder, or the shortstop? Is the batter responsible for all of the positive change of state, or is it the runner from second, who faked the move to third, or the runner from first, who was able to score on a ball that another runner might not have scored on?

If it’s a three true outcomes play you can make a clean attribution, but if it isn’t, you can’t. If it’s a play from the distant past, Keith Hernandez hitting the ball into the corner and Larry Parrish slow to retrieve it, then all the information you would have is "double to right, 1-H, 2-H, B-3 on throw." If it’s a modern play then you have much more information; you can get some estimate of the probability that the play would be made by the first baseman, etc. In the 1980s, when John Dewan and I first debated this system, we didn’t have that.

The problem is that a system like this leaves you in a much poorer position to evaluate the responsibility of each fielder than even the final fielding stats would—and the old-style fielding stats were not good. I think the current WPA systems actually just nakedly ignore fielding and base running and attribute everything to the hitter or the pitcher, which is a really terrible way of doing it, but we have to assume that better systems will evolve over time.

Problem 2a. In general, I do not like and I do not have much faith in, systems which start everybody out in the middle and move them up or down from .500. The fielding systems that we use now basically do that; they start everybody out in the middle and move everybody up or down based on how they compare to the average. I’ve never liked that.

It doesn’t describe the real world. You don’t start out at average and move up or down. You start out at zero and build upward. If you have two shortstops in the same league, one of whom is a backup who plays 50 innings and is +1 and the other of whom plays 1,200 innings and is -20, it is very likely that the one who is -20 is in reality the better shortstop, because if he wasn’t, he wouldn’t be playing 1,200 innings at shortstop.

Problem 2b is a manifestation of Problem 2a. In the last week I don’t know how many people have told me that WPA is the ultimate measure of value and we should be using that to identify the MVP. WPA is a very poor measure of value, and not of much use as an MVP indicator. One problem is that, because it starts everybody out in the middle and measures up and down movements, it does not measure the value of being average. MOST value in baseball is in being average. 80 to 90% of value in baseball is the value of being average or less than average. This is true because the difference between two major league players is a lot less than the difference between a major league player and a man off the street.

Look, let’s suppose that the replacement level is .290 or whatever people say that it is. Let us suppose that you have a .510 player; that is, a player who is just a little bit above average. His value is .220 times his playing time, right--.510, minus .290, times his playing time. But his ABOVE AVERAGE value is .010 times his playing time. A little more than 95% of his value is in just being an average major league player. Only a tiny bit of it is measured by his margin above average.

Suppose that you have an array of 10 players—a .650 player, a .600 player, a .560 player, a .530 player, a .515 player, a .500 player, a .450 player, a .420 player, a .390 player and a .385 player. On average, they’re average.

How much of their value is simply being average, or being less than average but better than .290?

The .650 player has a margin of .360, of which .150 is created by being better than average, so that’s 42%. The .600 player has a margin of .310, of which .100 is created by being better than average. That’s 32%. For the ten batters as a whole, they have an aggregate value of 2.1, of which .355 is created by being better than average, or 17%. 83% of the value of the group of players is created by being better than the replacement level, but not better than average. WPA provides us no way to get a handle on the value of being average.

In order to convert WPA into a passable imitation of an MVP ballot, you have to do several things. First, you have to figure some way to weight playing time so that each player gets credit for the value of being average. Second, you have to convert the WPA and the "playing time score" so that they are on the same scale, so that you can add them together. Third, you have to find some way to give each player credit for his defense. You have to figure out how much credit to give an average defensive shortstop as opposed to an average defensive first baseman. Fourth, you have to deal with base running. Then you get into the tricky stuff, the stuff that we don’t know the real answers to—does a player deserve extra credit for leadership, or for playing for a pennant-winning team.

I am not opposed to incorporating WPA into a comprehensive value system. It could be done well; it could be done poorly. But I think, in general, that. . .well, there is a steep side of the mountain, and there is a less steep slope on the other side. I think this is the steep side of the mountain. In general, you drive up the less steep slope of the mountain. This is driving up the steep side. It is easier to get where you’re trying to go if you come at it from a different angle.

Mike Trout

So why didn’t I discuss Mike Trout, when I was arguing with the world about whether Judge was on the same level as Jose Altuve?

Because he was never a real candidate for the Award. It was obvious that either Judge or Altuve was going to win the award. Why would I waste my time worrying about somebody who had no chance to win the award?

So do I believe that Mike Trout should have been an MVP candidate?

No, I do not. He missed 30% of the season. You have to discount his value for that, and it’s not a small discount; it’s a big discount. He’s a marginal candidate at best.

Home/Road Stats

Charlie Blackman finished fifth in the National League MVP voting. I think he should have done better, maybe. Blackmon hit .391 with a 1.239 OPS in Colorado, but .276 with a .784 OPS on the road. The voters knew that, and they discounted Blackmon as an MVP candidate in part because of that.

This is the way I see it. Voters know that Blackmon’s stats need to be discounted because he played in Colorado, but they have no clear idea of how much to discount them relative to this stat or that one. That being the case, some voters discounted his stats by a grossly inappropriate percentage, which was influenced by Blackmon’s individual home/road stats.

I know that this is a confusing issue. It’s not intuitively obvious what set of numbers you should focus on. But the example that helped me to get it clear in my mind was Bill Dickey vs. Elston Howard.

Elston Howard and Bill Dickey were both Yankee catchers, of course, both tremendous players. Elston Howard was a 12-time All Star and won an MVP Award but is not in the Hall of Fame; Bill Dickey is in the Hall of Fame but did not win an MVP Award. Both played in Yankee Stadium, and the effects of Yankee Stadium were essentially the same in Howard’s era as they were in Dickey’s. The Park Factors for Yankee Stadium from 1930 to 1935 were 80, 95, 83, 81, 84, and 79. From 1959 to 1964 (Howard’s best years) they were 81, 83, 85, 84, 96 and 100. It’s a matched set.

The thing is, though, that Yankee Stadium was great for Bill Dickey, a left-handed hitter, whereas it was absolutely turabull for Elston Howard. Dickey’s home-field advantage in some seasons was bigger than Blackmon’s. In 1935 Dickey hit .303 with 11 homers at home, but .257 with 3 homers on the road. In 1937 he hit 21 homers at home, 8 on the road. In 1939 he hit .357 with 23 homers, 84 RBI in Yankee Stadium, but .274 with 4 homers, 32 RBI on the road.

Elston Howard, on the other eyeball. In 1959 he hit 68 points higher on the road, with 13 of his 18 homers on the road. 21 RBI at home, 52 on the road. In 1962 he hit 18 of his 21 homers on the road. In 1964 he hit 65 points higher on the road, with 12 of his 15 homers on the road.

If Howard had been a left-handed hitter and Dickey a right-handed pull hitter like Howard, Elston would be in the Hall of Fame and Dickey would not—but that didn’t happen. Should we, then, adjust Howard’s numbers upward, because the park hurt him, but adjust Dickey’s numbers downward?

But you can’t do that, because the rock that you stand upon when analyzing stats is wins. The thing is that real and permanent wins resulted from Dickey’s good fortune, while real and permanent losses resulted from Howard’s tough luck. Because the park helped him, Dickey created more runs than Howard did, and more runs relative to the offensive context. That made him a more successful player. It’s luck, yes, but we don’t adjust luck out of existence, because you can’t adjust luck out of existence. If you adjust Howard upward and Dickey down, what you are in essence saying is that in a neutral park, Howard would have been better and Dickey would have been worse. Analysis is not about what would have happened in a different world. It is about the value of each player in the real world. Because Dickey and Howard created runs in the same park, the park adjustments that apply to them are the same for both players.

It is the same as dollars and purchasing power. What is relevant is the ratio between the dollars you have to spend and the cost of living in the place where you have to spend it. Blackmon created about 50% more runs in Colorado than he did on the road. It may be that Johnny Blackmon, as an individual worker, can earn $100,000 a year in St. Louis but $150,000 a year in Colorado, because his individual skills are more in demand in the Colorado area than they are in the St. Louis area; insert pot farming joke here. But that that doesn’t mean that Blackmon has an "individual cost of living" which is 50% higher in Colorado than it is in St. Louis. What it means is that he has more purchasing power in Colorado than he does in St. Louis.

Runs are used to purchase wins in a similar manner to how dollars are used to purchase groceries and pay rent. What is relevant is "How many runs do you have to work with?" and "What is the cost of a win, in terms of runs?" The cost of a win, in terms of runs, IS higher in Colorado than it is in an average NL park—about 33% higher, in 2017. You play half of your games at home, half on the road, so the appropriate discount for Blackmon’s stats is about 16%--not 50%.

The thing is that Blackmon hit .391 in Coors’ Field—but other players did not. If everybody hit .391 in Coors’ Field, or if everybody hit 115 points higher in Coors Field than they did on the road, then we would discount Blackmon’s performance at that rate. If everybody hit 60 points higher in Coors Field, you could mostly discount it. But in fact hitters as a whole hit only 34 points higher in Colorado in than Rockies’ road games.

The Garvey/Votto Rule

I was wrong about something, in the 1970s and 1980s; I was wrong about a lot of things. One in particular was this.

Steve Garvey in 1974 hit .312 with 21 homers, 111 RBI and 200 hits, and won the MVP Award. Then he did basically the exact same thing again every year until 1980, but he never won the MVP Award again. He finished 11^th in 1975, 7^th in 1976, 6^th in 1977, a distant second in 1978 (no first-place votes), 15^th in 1979, and 6^th in 1980.

Another example is Willie Mays in 1954. Mays in 1954 hit .345 with 41 homers, 110 RBI, and won the MVP Award. It was his first great season. Actually, he just as great as that pretty much every year after that, but he didn’t win another MVP Award until 1965.

Hank Aaron hit .322 with 44 homers in 1957, won the MVP Award. How many times did Hank Aaron have that season? He was better than that in 1959, and just as good as he had been in 1957 in 1960, 1961, 1963 and 1969, but he never won another MVP Award.

Based on these examples and others, I developed what I called the Steve Garvey rule. A player is more likely to win the MVP Award in his FIRST great season than he is in a subsequent season which is just as great. The reason for this, I thought, is that that which is moving attracts the eye. A stationary object does not attract attention. A moving object does. If a player does something every year, he becomes a stationary object. People stop noticing.

OK, I was wrong about that; not entirely wrong, but more wrong than right. There are two factors pulling in opposite directions, that one and a generalization asserting that a great player builds respect over time, and people start to feel that it is his turn to win. On balance, the larger of the two effects is the second one. Overall, players gain strength in the MVP voting when they repeat an outstanding season, more than they lose strength.

Nellie Fox in 1959 hit .306 with 2 homers, 191 hits, which really was just his normal season. He won the MVP Award in 1959 because (a) his team won the pennant and (b) people felt that it was his time. Barry Larkin wasn’t really any better in 1995 than he was in several earlier seasons, but by 1995 people had decided that it was his time, and he deserved a break.

Joey Votto was the same player in 2017 that he has been since 2010, but in 2017 he ALMOST won the MVP Award. It’s the Nellie Fox effect. People had decided that it was his turn—not quite enough people, but almost enough.

The Pennant Winners’ Bonus

The Altuve/Judge article ends with these two sentences:

What creates value for a baseball player is winning games. You cannot discard that principle, and have a valid analysis.

If I have a problem with that phrase, it is with the word "games". What creates value for a baseball player is winning. But do we truly believe that all wins are created equal? Let us say that one team, who we will call the 2014 San Francisco Giants, wins 88 games and goes on to win the World Series. Another team, who we will call the 2012 Angels, wins 89 games but does not qualify for the post-season. Must we give more win credit to the 2012 Angels than to the 2014 Giants? The purpose is not to win games; it is to win pennants. Should there not be a recognition of this, in determining a player’s value?

One COULD set up the Win Shares system in this way: that rather than crediting the team with three win shares for each win, we could credit a team with three win shares for each win, plus some number of "bonus" win shares if they make the post-season, and some larger number of bonus win shares if they win their division, thus are in a more advantageous position in the post-season.

If we were to do that, I believe that that would cause the "Win Shares MVP" to match up better with the elected MVP. To return to the examples from earlier in the article, Garvey won in 1974, Mays in 1954, Aaron in 1957, and Fox in 1959 won the MVP Award not because the players individually had better seasons, but because they received extra credit for the success of the team. Is this not actually rational?

One of the things that research is about is the search for a compelling logic. We are always trying to find an argument about some small issue which is so compelling that (a) there appears to be no way to escape that conclusion, or (b) we might reasonably expect that a consensus would develop in support of that conclusion.

I might almost argue that this is a compelling line of logic:

1) Value is created not merely by winning games, but also by winning pennants or post-season position, therefore

2) In assessing each player’s value, we will include some credit for winning enough games to qualify for post-season play, rather than simply straight-line credit for winning games.

So far, so good. But here’s where we lose it. How much extra credit should we give?

I would go for one extra game (three extra win shares) for qualifying for post-season play, and two extra games (six extra win shares) for winning the division. I might go one more game than that—six win shares and nine. That would make more difference in an MVP race than you might suspect. An MVP candidate typically earns a little more than 10% of his team’s Win Shares. If you give the team nine extra Win Shares, he gets one of them. MVP races often come down to one or two Win Shares. A small bias in the direction of the player from the winning team is not unreasonable, and would make a difference.

But if there is a compelling logic here, there has to be a compelling logic in favor of some SPECIFIC answer, some specific number, rather than just the general principle. We could give 3 Win Shares extra credit, we could give 6, we could give 9, we could give 50. Unless we know what the number is, we don’t really have a compelling logic.

As always, we’re trying to be faithful to multiple different premises. We are trying to be fair to a player who plays on any team, regardless of whether his teammates are good players or not. But at the same time, winning has to count, because that is the purpose of the effort.

Tom Tango’s comments on this post:

1. WPA in its original form was Player Win Averages from the Mills Brothers from 1970. I excerpted the relevant part of their book here:

http://tangotiger.net/PWA.html

I learned about it originally from Pete in The Hidden Game.

2. I agree about the attribution issue of WPA. It's really a limitation of the "availability of sequencing" of the data. We throw our hands in the air, and deal with the batter and pitcher, or in case of SB, CS, WP, PB, PK, BK, with the lead runner, pitcher, and catcher. More data is better, but we just deal with what we have available. And rely on sample size for things to work themselves out, which of course, kind of goes away from the idea of individual play attribution that started with the advantage of WPA.

If you look at WPA long enough, it does all work itself out. Pedro for example was +51 wins in WPA. He has a 219-100 record, which is +59 wins above a .500 record (though we really should use his team strength, but, let's say between Expos, Sox, Mets, it was .500). He gave up runs at about 2/3 the league average which is equivalent to around a .675 record, so with 2827 IP, divided by 9, is 314 "games", or 213-101, or +56 wins. No different than say a QB and Wide Receiver. We really don't know for that ONE PLAY, but look at Joe Montana long enough, and it works its way out.

But, I agree with your basic point, or at least I don't disagree with it.

3. I agree about your point of WPA and average. Studes and I have independently done a WPA above replacement some 10 years ago. Not that hard to do, but not as clean. Going back to what you said 30 years ago: "I can't do all this myself". There is a path to make a WPA as WAR like... it's just that others should step up. We have bigger fish to fry.

***

As for the other point regarding WAR and MVP: WAR really was not intended to be used in MVP discussion, just like FIP (or DIPS) was not really intended to be the end-all for pitcher evaluation. It just so happened that FIP, focusing on one-third a pitcher's plate appearances, is SO STRONGLY LINKED to a pitcher's ERA that we can get by with ignoring hits on balls in play and caught stealing or even "sequencing" (performance with men on base). That's what WAR is, that even though it is not "situationally aware", it works most of the time for MVP. And when it doesn't, we kind of forgot that it wasn't supposed to work all the time. Like with Judge.

--Tom

COMMENTS (22 Comments, most recent shown first)

PeteRidges
Agree with the comments on WPA. But that doesn't mean that it's not an informative stat.

Lets look at Batting Wins and (offensive) WPA for two Colorado players, Charlie Blackmon and Nolan Arenado, in 2017. On most measures, Blackmon is better, but on WPA, Arenado is.

Blackmon: 3.9 Batting Wins, 4.3 WPA (a difference of 0.4)
Arenado: 2.8 Batting Wins, 4.9 WPA (a difference of 2.1).

Now, correct me if I'm wrong, but these two stats purport to measure exactly the same thing, except that WPA takes into account context and Batting wins doesn't. Batting Wins doesn't know if a home run was a walkoff homer, but WPA does. Neither of them allows for defense at all. Both have a base of average, so a below average player would unfortunately get a negative score.

So a player with more WPA than Batting Wins was good in clutch situations, and vice versa.

Now, Arenado's high WPA score indicates that he was, quite probably by chance, productive in clutch situations. So was Blackmon, just a little bit, but there's a difference between 2.1 and 0.4: the difference is 1.7 wins.

So here's the idea. When we want to think about "clutchness" for these two players, we give Arenado an extra 1.7 wins over Blackmon. Up until then, use your favourite stat, that's Win Shares for me and WAR for most people, but without any allowance for clutch, or runners in scoring position, or anything. Then you put those 1.7 wins in right at the end.

Blackmon had 33 Win Shares, Arenado had 26. 1.7 wins are 5.1 Win Shares: give them to Arenado, and he's still 2 behind. Actually Win Shares include just a little for clutchness, so we shouldn't have given Arenado quite 5.1, but you get the idea: Blackmon is still ahead.

7:34 AM Dec 6th

Guy123
Brock: Nothing here is worth getting upset over. These are difficult issues, and there are various reasonable ways of slicing the apple. We've had many long discussions over at Tango's site over the years litigating these issues, and never reached a real consensus. Starting with offense and defense is also a valid approach to thinking about these issues. But it doesn't allow you to avoid the need for replacement value, or Bill's zero value, or something like it. Because the fundamental reality is that the marginal value of runs (.1 wins) is about twice the average value of runs. If you try to treat all runs the same, your system will never make sense, and you will reach silly conclusions like "replacement players are worth 60% as much as a league average player."

Anyway, let me just finish by clarifying a couple of things about WAR for you: the foundation of WAR is indeed an assumption about the difference between an average player and replacement (about 2 wins over full season). At BRef this is called "Rrep," and -- you should be sitting down for this -- it accounts for 100% of total WAR. Here are the 2017 WAR totals for MLB position players:
WAR 590
RAR 6224
RRep 6114
As you can see, RRep provides all of the total value -- everything else sums to zero (approximately).

And I'm surprised you bring up DWAR and OWAR to make your case, because here are those totals for 2017:
OWAR 590
DWAR 0
Now, OWAR has all the value only because it includes Rrep, while DWAR does not. Really, WAR measures both hitting and fielding against league average, valuing both at zero, and then Rrep provides 100% of total value.

6:22 AM Nov 30th

Brock Hanke
Guy - Oh, boy. We probably should stop this. You just put out several assumptions that I don't agree with.

1) I don't agree that the amount that players get paid is a valid estimation of their value. GMs are people just like you and me. They have access to a lot more info that I do, but you can make mistakes by being drowned in too many numbers just as much as you can by having too little, plus GMs make intuition mistakes every year, and owners force the GMs to make even more (how, in money, do you rate the Angels' acquisition of Albert Pujols, now that his ailments - bad elbow, plantar fasciitis - have reduced him to a DH?). I would never try to add up how many dollars someone gets paid and try to estimate value with that. As to why GMs would consistently make the mistake of overpaying pitchers relative to their value, the answer is that you can't play them every day, and they get hurt a lot more often than position players do, so you have to have more of them than you actually would need if no one ever got hurt. When I was a kid, in the 1950s, teams carried 9-10 pitchers. They now carry 12-14. They have to pay all of them to do the work that the 9-10 used to do. So, a team's cadre of pitchers, as a group, make more money than they used to, relative to the cadre of position players, which has dropped from 15-16 to 11-13.

2) You don't start breaking up value by dividing the pitchers from the position players. You start by dividing scoring runs (50%) from preventing runs (50%). You do that because it is the only thing that you know is true. Bill changed to 52% preventing runs because he, intuitively, decided that top starting pitchers were ranking too low - the example he uses is pitchers who deserve MVPs. I would disagree with Bill on that, but the math does work that way if you ignore very low levels of performance, as Bill does. The problem would resolve itself into 50/50 if Bill wasn't ignoring a small percentage of value so he can use a linear model. Replacement level ignores a lot more than Bill does.

3) There's no metaphysical issue if you divide up value into scoring runs and preventing runs, because you don't have to try to separate scoring runs from some unknown part of preventing runs. Which is what assigning 60% to position players and working from there does.

4) I'm not interested in playing the replacement player game, because it just presents another unknown into the math. I am aware that just what percentage is "replacement" has been a hot topic since the invention of the concept. I am aware that the two leading WAR systems worked out an agreement, and placed Replacement Level with, IIRC, .294. This means that .294 is the zero point of WAR. So WAR ignores not a very small percentage of value, it ignores almost 30%. Actually, because the replacement number is os high, it gets into negative numbers, which is always trouble. Pete Palmer's Linear Weights has its zero point at .500. It has even worse problems with negative numbers.

5) WAR does not examine the difference between replacement level and average. For each player, it examines HIS value compared to replacement value. I'm not sure if saying "average" was not a typo on your part, but it's not right.

6) WAR's creators very clearly list fielding value as well as batting value and pitching value. The only reason that total WAR for position players isn't the simple sum of O(ffensive)WAR and D(efensive)WAR is that both OWAR and DWAR contain the position adjustment, so you'd end up counting it twice. BB-Ref is very clear about this.

Given all these disagreements at the fundamental level, I don't think we're ever likely to come to an agreement. And we're all that's left of this thread. So, I'm going to suggest that we call it off before someone gets annoyed or frustrated. I have a procedure for doing this. You get the last word. I WILL read any response you write here, but I won't respond. YOU get the last word. I am the one who made the decision to call it off. Doing that and also getting in the last word is really rude, and I try not to do it. So, post up what you have in response, and don't think I'm angry or something when you don't get anything back. I WILL read what you write. I can guarantee you that.
3:54 AM Nov 30th

Guy123
Brock: I'm not sure I understand your argument. Why should we trust your (or anyone's) intuition over the collective wisdom of the marketplace? I'm not someone who believes markets always get it right, but if you are going to assign different values to pitchers and position players you at least need to offer some kind of evidence, and/or a theory about why the market would consistently make such a mistake. Otherwise, I think we are stuck with the conclusion that position players provide about 60% of value.

Now, how you divide up that 60% between hitting and fielding -- or even whether you try to do that -- is a somewhat metaphysical issue. WAR is officially agnostic: it just says that the average position player is 2 wins better than replacement, which can consist of any possible hitting:fielding ratio. I've always thought this was somewhat of a dodge, in that replacement players are in fact almost 2 wins below league average as hitters (on average), while providing roughly league-average defense. So, if you want to define value as the difference between average and replacement, then fielding makes up *at most* 10% of the value pie, probably less. But again, that's *not* what WAR's creators say: they just measure the overall value of position players.
11:10 AM Nov 29th

Brock Hanke
Guy - There's an underlying trap in giving pitching 40% of the game's value. You have to give offense right about 50%, and if you add in 40% for pitching, you only have TEN percent left for fielding. All fielding, from circus catches to pitch framing. Ten percent. 100-50-40=10. It's gonna take a LOT of explaining to convince me that fielding has so little value left in the game. The Win Shares / WAR gap is only about 5-6%, and that doesn't sound so bad compared to 50 or 40. But if you realize that WAR has said, essentially, that fielding has dropped down to only a tenth of the game, then a 5-6% difference looks a lot more damaging. Like, 1/3 of defensive value has gone away.
3:56 PM Nov 28th

Riceman1974
Brock:

Thanks for the clarification. That is a good question. Perhaps, as other have pointed out below, because WAR gives pitchers 40-42% of the value (as opposed to Win Shares 35%), a standout pitching performance can still lead the league overall.
2:35 PM Nov 28th

Guy123
Brock: I'm not sure there is such a disparity. In this decade (2010-2017) there have been 2.6 pitchers among the top 10 WAR players on average (26%). In the 1970s, there were 5.0 pitchers on average, or 50% of the top 10 (in 1971, the top 5 players were all pitchers). That's a pretty dramatic change. Perhaps Win Shares shows an even more dramatic decline, I don't know, but WAR is certainly not valuing starters as highly now as it did a few decades ago.

As for the right amount of value to give to pitchers, I don't think your intuition is serving you well here. About 40% of MLB payroll goes to pitchers, so the collective wisdom of the market suggests pitchers contribute about 40% of total player value, consistent with WAR's allocation. Because pitcher performance is much less predictable, one can argue that total pitcher value is likely more than 40%, perhaps closer to 50%. But it seems very unlikely that it could be as low as 35%, much less 33%.
10:19 AM Nov 28th

Brock Hanke
Riceman - I wasn't clear enough. I understand why Win Shares does what it does. Bill was VERY aware that starter IP have been dropping steadily since, literally, 1879, when Will White pitched a whole 1/3 of an inning more than Old Hoss Radbourne would do in 1884. He was very aware that this drop of IP was decreasing the value of starting pitchers compared to position players, and he was aware that the most-leveraged innings were going, even in 2000, to relievers. He agreed that the drop was something real, not a glitch in his system, and he wrote about there being no more MVPs who were pitchers, because of the IP drop.

What I don't get is why WAR isn't dropping the pitcher values anything like as fast as Win Shares is. WAR still coughs up a starting pitcher MVP (and we're talking one MVP for both leagues together) once every 3-5 seasons. Win Shares hasn't had a starting pitcher MVP since the days of Bob Gibson and Tom Seaver and Steve Carlton (Roger Clemens might have one; I'm not sure, he sometimes pitched more than 260 innings). I agree with Win Shares on this; it's WAR I don't understand.
9:55 AM Nov 28th

Riceman1974
Brock:

The reason for the gap is simple. Starting Pitchers pitch less innings today. Win Shares will give more credit to starting pitchers who pitch more innings. When Win Shares was designed 20 years ago, the top starters pitched in the 220-260 inning range. Now only a handful even hit 200 innings, and I believe this year saw a record low in 200-inning pitchers. No matter how great you are, a pitcher is not getting 30 Win Shares in 190 innings. I guess Bill could change the formula, although then guys like Walter Johnson and Lefty Grove would have 70 Win Shares seasons. Maybe they should.
7:05 AM Nov 28th

Guy123
We can dig a little deeper into Bill’s claim that responsibility for a team under-performing (or over-performing) its pythag record is proportional to a player’s individual WAR. To see whether a hitter produced at the right (hi leverage) or wrong (lo leverage) times, we can compare his WPA to his context-neutral hitting. Judge, for example, had 5.8 batting wins (above average) but just 2.0 WPA -- a huge gap. And if we look at the entire Yankee team, we find a strong correlation between batting wins and this gap -- stronger hitters tended to underperform more in terms of increasing team win probability. So this is entirely consistent with Bill's theory.

Unfortunately, when I looked at 7 other extreme over- and under-performing teams of recent years, this pattern vanishes. When we compare hitters’ batting wins with their “WPA deficit” (or excess), the correlations are generally low and appear to be close to zero once one accounts for playing time. In some cases, the correlation is actually negative -- on overachieving offenses, weak hitters frequently contribute more than good hitters to the team's excess WPA. Overall, the data does not indicate that high-performers have a disproportionate impact on the disparity between a team's context neutral offensive production and its W-L record. Where we *do* see a correlation -- though not terribly strong -- is between playing time and the WPA deficit: the more times a player hits, the larger his impact tends to be.

Obviously, one can quarrel with WPA as an overall measure of contextualized wins. However, Bill has focused mainly on the issue of how hitters performed in high- vs. low-leverage PA, and WPA does capture that very well. If Bill were right about the relationship between productivity and pythag luck, we should see a correlation between batting wins and a player’s excess WPA. But we don't.

Perhaps Bill or someone sympathetic to his theory will do a more rigorous study that points in a different direction. But for now, I see no reason to accept the claim that a player’s luck is proportional to his productivity (and really, why would we expect it to?). So if you want to distribute a deficit/excess of team wins to players, the best approach is to use playing time (which reduces Judge's context-WAR penalty to about -0.6 wins).
9:19 AM Nov 27th

Brock Hanke
Steve - It's rWAR, which I am pretty certain is BB-Ref WAR. When I check the header WAR with BB-Ref, they match up.

Guy - As far as I know, there is no updated version of Win Shares, until Bill finally gets all the work done on Loss Shares and publishes a new book. There is a site that continues to assign Win Shares to players after 2000, using the original formulas. Bill does have a day job.... And Win Shares does make an adjustment for relief pitchers, called "Save Equivalent Innings", so that's not a factor, unless WAR is assigning a much higher percentage to relievers.

However, if WAR is assigning 40% of total value to pitchers, that probably explains the whole thing, except the gap increase. I've always used, as a rule of thumb, the following: Baseball is half offense, 1/3 pitching, and 1/6 fielding. This produces good results, but it's only 33.3% pitching. Win Shares assigns 35% to pitching only because it assigns 52% to overall defense. Bill did this (short essay in Win Shares titled Why 52?) because, if he only had 50% defense, his pitcher rankings came up short, in his opinion.

I, personally, think that 40% is way too high. There are three changes in the game over time that I can think of that would affect the division of value: 1) the increase in Three True Outcomes favors pitchers at the expense of fielders, 2) the number of innings that starting pitchers pitch has been steadily going down, and 3) the decrease in starter IP is accompanied by a decrease in the leverage of starter IP as compared to reliever IP. Starters are pitching the LEAST-leveraged innings; relievers, the MOST. #1 favors all pitchers, but for this purpose, it favors starting pitchers, which contain, in both systems, all of the top pitchers and most of the innings. #2 and #3 both work against starting pitchers as opposed to relief pitchers. I don't see how #1 could be so overpowering that it would brush aside #2 and #3 and produce a 5-6% gain in assignment to starting pitching, especially compared to position players.

But that big of an assignment gap would sure be a starting point to look at. Thanks for the info. I just don't know enough about the guts of WAR to know that it is assigning 40% of value to pitching.
6:38 AM Nov 27th

steve161
Brock: my first question is Which WAR? Since BBRef uses Runs Allowed while Fangraphs uses FIP, it would be interesting to know if both systems show the same discrepancy vis-a-vis Win Shares.

Guy's guess, especially point 1, looks plausible.
12:27 PM Nov 26th

Guy123
Brock: I haven't read Win Shares in years, or kept up with any modifications Bill made over the years, so the following may not be quite right. But I think the discrepancy you note likely comes mainly from two differences in the systems:
1) Win Shares assigns 35% of value to pitching, while WAR assigns 40-41% of value to pitching;
2) Win Shares uses a single runs-allowed benchmark in determining individual pitcher value, while WAR uses different benchmarks for starters and relievers, to account for the "reliever advantage." There are many other components in WS, but this may mean it gives a smaller share of pitching value to starters, as compared to WAR.

As for why this starting pitcher gap would grow over time, I have no idea.
9:27 AM Nov 26th

Brock Hanke
Well, this is not directly on topic, but I've been waiting for months for a thread to be about the differences between WAR and Win Shares. So, here goes: On another project on another site, some of us are, essentially, revoting all the historical MVPs with combined leagues, so only one MVP per year. The guy who heads that project sets up a header with about 70 players, and lists their WAR scores and their Win Share scores with them. I've been pulling that header down into a Word table, and sorting by Win Shares and then by WAR. There are two systematic things going on that I don't understand, but someone here might.

1) WAR overrates pitchers, at least top pitchers, compared to Win Shares. WAR will have a pitcher first in the league while Win Shares has him 8th or something. The systems agree about who are the best pitchers, but do not agree as to where they fit within the context of position players.

2) This is steadily becoming worse. It's not uncommon, in looking at recent seasons, to see the best pitcher by WAR rank 15th or so in Win Shares. It didn't used to be that bad, but the gap has been steadily widening.

My question is whether anyone here knows WHY this is happening. I don't know nearly enough about the guts of WAR to compare the systems over a sample of pitchers. So, I'm stuck. Has anyone else noted this? Does anyone know why it happens? If so, please comment.

Also, Tom is right about the Markov Chain analysis coming into baseball. Pete Palmer did, in The Hidden Game, put up a chart for the seventh inning, with all base/out combinations. I don't know about the predecessor, but Pete was using the idea at the time of The Hidden Game, just like Tom says.
12:36 PM Nov 25th

DMBBHF
ScottSegrin,

Very well said. Thank you.

Dan
12:17 PM Nov 25th

ScottSegrin
I have long felt that parts of sabermetric research lean too heavily on the principals of traditional statistics. Not baseball statistics - mathematical statistics. The kind where you draw a sample of observations from a large population and analyze that sample using things like standard deviations, confidence intervals, and bell curves, to estimate the parameters of the entire population - most of which you did NOT observe.

When I was in college (many moons ago) a stats professor of mine chastised me for improperly applying statistical principals to baseball stats. He said something along the line of, “A guy’s 500 at bats aren’t a sample of anything. They are his 500 at bats. There is no larger population of at bats that you didn’t observe that are trying to estimate. The 500 at bats *is* the population. If the guy bats .290, you don’t apply a confidence interval to that because you observed all 500 of his at bats and all 145 of his hits. You are 100% certain that he batted exactly .290.”

I believe that same thinking applies to this discussion about WAR. The Yankees 91 wins were their 91 wins. There is no uncertainty to that - we are 100% certain that they won 91 games. So, if you want to calculate the contribution of each individual player in terms of wins, the sum of those calculations must equal 91 wins. Because that’s how many there were – that’s the population.
When we look forward and are trying to predict, I think many of the statistical principals apply. But when we are looking back, we are not trying to predict. We are trying to calculate. It’s different math.

5:39 AM Nov 25th

aagcobb
Seems to me the most accurate way to give MVP candidates credit for the postseason is to add their postseason win shares to their regular season win shares. The downside being that voting would have to wait until after the postseason is over.
2:16 PM Nov 23rd

steve161
Predictably, Bill's article has inspired responses from all over the sabermetric community. SABR's weekly email pointed me to a few of them and those pointed to a few more. Some of the writers seem to lose site of the fact that, as I understand him, Bill is talking specifically about WAR in the context of determining an MVP--as Guy says, it's the old value-ability dichotomy. In that context, I think Bill is right on the merits.

What's disappointing is that the discussion is scattered all over the internet. You can follow it if you've got eight hours a day to spend on it, but for those of us who are consumers rather than producers of sabermetrics, it's nigh impossible. Twitter, with its restriction to sound bites, is not the answer.
8:59 AM Nov 23rd

Guy123
one way to do that is to say that the Yankee win contributions, rather than being allowed to add up to 102, must add up to 91. That’s a good way to do it, and, of course, if you do that, it reduces Judge’s win contribution by 11%. Using WAR, it reduces his win contribution by MORE THAN 11%, because the replacement level remains the same while his win contribution diminishes, so the wins ABOVE THE REPLACEMENT LEVEL are decreased by more like 16%. Judge drops from 8.1 WAR to 6.8.
There’s been lots of great discussion about Bill’s argument that WAR should be constrained by actual team wins. But I think the question of *how* the blame for a shortfall in team wins (or credit for extra wins) should be apportioned deserves more attention. Bill says the penalty is proportional to a player’s WAR (quote above). But why? If we are going to distribute blame for a team’s bad luck in the arrangement of its RS and RA (or bad timing, or whatever you prefer to call it), why should the best players receive nearly all of it while poor performers receive little or none? Bill doesn’t say, and it would be great to hear his argument. But it seems unlikely this is the right answer.

First, it’s hard to imagine that only a team’s “above replacement” runs come at inopportune times. All of NYY’s 858 runs had the potential to impact game outcomes. In actual games we can’t identify the “above replacement” runs, there are just runs. Example: Matt Holliday had 427 PA but zero WAR. Bill’s adjustment says that Holliday gets no adjustment at all (0*.16=0). So even though Holliday made outs, got hits, ran the bases, even played a little in the field, Bill is saying that Holliday—by definition—could not have played *any* role in NYY underperforming its pythag W-L. That can’t generally be true (it might be true for Holliday, I have no idea) -- it’s certainly possible that the timing of Holliday’s outcomes (both hits and outs) was unusually bad.

So even if you think the win penalty should be proportional to a player’s production, it should be tied to runs created (RC) not runs above replacement (RAR). In Judge’s case, he created 17% of the team’s runs. Using B-Ref, NYY had 52 WAR and 48 replacement wins -- for a total of 100 -- but really only won 91 -- so NYY got 9 undeserved WAR (not sure why Bill says 11). We penalize the offense for 60% of that, or -5.4 wins, so Judge’s share is .17*-5.4= -0.9 wins, or about 30% less than Bill’s estimate.

But as the Holliday example also illustrates, it’s not obvious that a player’s impact on a team’s win deficit is proportional to his RC. When a team underperformed, its positive outcomes (H, BB) were less valuable than usual. But it will also be true that its negative outcomes (outs) were *more damaging* than usual. And therefore, weak hitters can have a big impact too: they may come up more often in important situations, and/or perform even worse than usual at those times. Again, I’d like to hear Bill’s argument, but it’s hard to see why only good outcomes – and thus good players – should see their value changed when we link to team wins.

Instead, our starting assumption should be that players are debited/credited based on playing time – their “footprint” on the game -- not their productivity. For Judge: position players account for 60% of wins, and Judge had 10.7% of NYY PAs, so the penalty for Judge is 9*.6*.11 = 0.6 wins, about half of Bill’s estimate. That reduces Judge's WAR from 8.1 to 7.5 -- not a trivial change, to be sure, but also not a huge deal (and 2017 NYY is about as big a pythag departure as we ever see).

(Yes, Bill also argues that Judge had a *disproportionate* impact on NYY's underperformance. But he offers only very fragmentary evidence for that conclusion, so let’s keep that issue separate for now.)

The “wins matter” vs “context-neutral” debate is certainly interesting, and obviously elicits great passion in many quarters. But as a practical matter, it’s much ado about relatively little. Even if you agree that team wins matter, incorporating them into WAR isn’t going to change individual player assessments very much.
7:52 AM Nov 23rd

brian14leonard
Bill, are you advocating that the MVP award voting should be delayed until after the World Series ends? If so, would you have advocated this in the pre-playoff days as well? Any idea why the MVP voting was set up to end with the end of the regular season? (In my opinion, the flaw in this was exposed when the 1978 American League East ended up in a winner-take-all game that was technically said to be part of the regular season, but was played after the voting deadline. And Jim Rice won instead of Ron Guidry.)

Thanks,
Brian L.
1:16 PM Nov 22nd

jeffburk
The logic of scaling the value of a player's runs created to his team's won-loss record is compelling. The complication comes in doing it proportionately to each player on the team. There must be occasions when a team's underperformance or overperformance in its won-loss record occurs in spite of, not because of, a player's individual contributions.

If a team's pitching staff blows a lot of close leads, leading to a poorer won-loss record than runs scored and runs allowed indicate, pinning that shortfall on the batters seems counterintuitive. Of course, unless and until we have a better understanding of the factors that lead to the underperformance or overperformance, and a reasonably accurate means of measuring those factors, scaling proportionately is better than nothing. Or is it?
12:29 AM Nov 22nd

jeffburk
Something akin to WPA was written by Gary Skoog and published in the 1987 Baseball Abstract under the title, "Measuring Runs Created: The Value Added Approach." It was published online at the Baseball Think Factory:

http://www.baseballthinkfactory.org/btf/pages/essays/scoog_var.htm
12:14 AM Nov 22nd

MVP Followup

COMMENTS (22 Comments, most recent shown first)

Leave a comment

Report inappropriate comment


Type of Abuse:
Comments: