Username:	Password:

Remember me

Forgot your username/password?

Print Email

Home>Articles

The Real Problem with WAR

By Bill James

November 30, 2020

The Biggest Problem With WAR

One time when I was maybe six years old, probably in mid-summer, 1956, I looked out the glass front of my father’s little small-town business, and there was a long line of farm trucks loaded with wheat, lined up to go the Grain Elevator. The Grain Elevator was a block away, and, while it was normal for there to be trucks waiting to deliver their wheat, it was not normal for the line to be so long. I asked my father, or someone, what was going on, and he said the scales here had been running a little loose, whereas the scales in Hoyt (a nearby town of the same size) were very tight, and the word had gotten around. Although I heard those expressions many more times growing up, I didn’t really understand what was meant by that until at least 25 years later.

It’s a problem of Comparison Derivatives. Problems of Comparison Derivatives are ubiquitous. I was talking to my son yesterday; he’s an actuary, and he was trying to explain a complex problem of Comparison Derivatives to his bosses who live and work in Europe. A small change in one component of a comparison derivative had caused a few million dollars of projected profit to disappear. They’re also actuaries, of course, but they couldn’t quite understand the problem, or he couldn’t quite make them understand the problem.

The problem is that when working with Comparison Derivatives, a 1% error can manifest itself as a 20% error, a 50% error, a 90% error, or a 200% error; can, and frequently does. I keep looking for an easy way to explain this and can never find it and haven’t found it yet, so I’ll go back to the farmers selling wheat to the grain elevator.

To sell grain to the Grain Elevator, the farmer drives onto a scale and weighs the truck, loaded with grain. Then he drives off the scale, off-loads the grain, and weighs the truck again. The difference, of course, is the weight of the grain. The farmer is paid based on the weight of the grain.

The problem is that a small percentage error can have huge consequences to the farmer. State inspectors would visit the elevators regularly to make certain the scales were accurate, but they’re like your bathroom scale. Weigh yourself, step off the scale, step back on the scale, you’re not always going to get EXACTLY the same weight the second time that you did the first.

A good-sized farm truck probably weighs 15,000 pounds, empty. It might carry 5,000 pounds of wheat; they don’t measure it by the pound, they measure it by the bushel, but we’re going to talk pounds. A pound of wheat is worth about 10 cents, I think; in 1956 it would have been about two cents. 5,000 pounds of wheat is worth about $500.

Except not quite. There will be testing equipment on site which tests the moisture component of the wheat (or corn, or whatever.) If the wheat tests at 87%, then the farmer gets $435, rather than $500. You don’t get paid for the water; you get paid for the dry weight. So you have three variables:

		Weight of the truck, loaded:	20,000
		Weight of the truck, empty:	15,000
		Weight of the grain:	5,000
		Value of the grain:	$500
		Test weight:	87%
		Payment to farmer:	$435

But suppose that the scales are 2% off, in the direction that harms the farmer:

Weight of the truck, loaded:	19,600	(2% low)
Weight of the truck, empty:	15,300	(2% high)
Weight of the grain:	4,300
Value of the grain:	$430
Test weight:	85%	(2% low)
Payment to farmer:	$365.50

The 2% errors cause the farmer to be cheated by 16%. Or suppose that the errors go in favor of the farmer:

Weight of the truck, loaded:	20,400	(2% high)
Weight of the truck, empty:	14,700	(2% low)
Weight of the grain:	5,700
Value of the grain:	$570
Test weight:	89%	(2% high)
Payment to farmer:	$507.30

The weight of the grain is derived from a comparison of two other weights—a comparison derivative. With a comparison derivative, a 2% error in the components can easily lead to a discrepancy in the calculated value of almost 40%. ($507.30 divided by $365.50 is 1.39).

This happens because, by making the comparison between two other weights and then relying on the margin between them, you’re shrinking the base, thus making the data unstable. The error occurs on a much larger scale than the resulting measurement. Imagine a cylindrical object which is 2 feet across at the base and 4 inches across at the top. It is very stable, because the weight of the base is large compared to the weight of the load. But if it is 4 inches at the base and 2 feet across at the top, then it is unstable. A comparison derivative captures all of the error in any of its components, and then states all of the error on a much smaller scale, making the error much larger relative to the value, making the calculation unstable.

That was what the farmers meant by saying that the scales were "loose" or "tight." All the grain elevators paid the same amount for a bushel of wheat or corn. But if the farmers believed that they were not getting fair weights from the elevator, they would say that the scale was "tight". If they felt they were getting good payments, they would say that it was "loose." Many times this was probably just a rumor. Sometimes there were probably some Grain Elevator operators who cheated the farmers with small discrepancies on the scales—particularly the bigger Grain Elevators, which used two different scales to weigh the truck before and after. Tiny discrepancies between the two scales could cost the Elevator thousands of dollars a day—or profit them thousands of dollars a day if the discrepancy was in their favor. So what do you think, was there anybody cheating the farmers by doing that, you think?

The REAL problem with WAR is that it is a Comparison Derivative—thus, highly sensitive to small errors. Let us suppose that a player has a "Run Value", however that is established, of 100 Runs, offense and defense combined. WAR estimates that a replacement-level player would have a value of 86 runs. There is a difference of 14 runs. WAR estimates that each run above the replacement level is worth .11 wins. The WAR of the player is 1.54, which would be rounded off to 1.5:

	Run Contribution of the Player:	100
	Run Contribution of Replacement Level Player	86
	Run Contribution above Replacement	14
	Win Value of a Run	0.11
	Wins Above Replacement (14 * .11)	1.54
	WAR, rounded off:	1.5

ALL of those things are estimates. "Weight" is a hard fact, capable of precisely accurate measurement. In the case of the grain, the sale price is a hard fact. In WAR, none of these things are hard facts. They’re all just estimates. The Run Contribution of the player is an estimate. The Run Contribution of a Replacement Level player can barely be described as an estimate; it is more like a made-up number. The Win Value of a Run is an estimate. So this problem is much, much more serious in WAR than in the Grain Elevator transaction, because we’re dealing here with estimates, rather than hard facts.

Let us suppose that there is a 3% error in each step of that, going in favor of the player:

Run Contribution of the Player:	103	(3% high)
Run Contribution of Replacement Level Player	83	(3% low)
Run Contribution above Replacement	20
Win Value of a Run	0.113	(3% high)
Wins Above Replacement (20 * .113)	2.26
WAR, rounded off:	2.3

With 3% errors, the player’s WAR goes up to 2.3. But suppose that there are 3% errors going against the player:

Run Contribution of the Player:	97	(3% low)
Run Contribution of Replacement Level Player	89	(3% high)
Run Contribution above Replacement	8
Win Value of a Run	0.107	(3% low)
Wins Above Replacement (8 *.107)	0.856
WAR, rounded off:	0.9

The player’s WAR can vary by 150%, based on 3 errors of 3% each. THAT is the basic problem with WAR. I mean, we argue about all kinds of things. We argue about whether clutch data should be included in runs created estimates; we argue about whether the fielding estimates are consistent with external evidence, etc. But the REAL problem is that:

1) Estimates are never exactly right; they are always just estimates, and

2) WAR uses an analytical system to process those estimates which has the potential to enormously magnify whatever inaccuracies are included.

In a WAR estimate, there are dozens and dozens of internal estimates—estimates of runs created, estimates of runs saved by fielding, estimates of the run value of a single, a double, a triple or a double play, estimates of the park effect, etc.

The problem is more serious than that. First of all, as I said, the replacement level is not really an estimate. It’s just a made-up number. It could be off by 20 or 25%--by itself, before it is magnified.

But that understates the problem, too. WAR assumes that the Replacement Level is a constant. It is NOT a constant; it’s a variable. Some teams, an outfielder gets hurt, it doesn’t really matter because they’ve got a fourth outfielder who is about as good as the starters. Other teams, it matters a lot because their fourth outfielder is a pair of stuffed pajamas. The actual replacement level is specific to the locale. Rather than trying to estimate what the replacement level actually is in this case, WAR simply assumes that it is always the same. To return to the analogy of the wheat farmer, this is like assuming that all trucks weigh the same. It leads to large inaccuracies.

Also, as best I understand this—which is poorly—one of the WAR systems introduces another potential error for pitchers by using a number that represents how many runs the pitcher SHOULD HAVE allowed, based on his strikeouts and walks and home runs allowed, rather than how many runs he ACTUALLY allowed. The system says "this pitcher actually allowed 100 runs, but, because he had really good strikeouts and walks, we’ll treat him as if he allowed only 87 runs." That is introducing yet another potential error, by substituting an estimate for a hard fact. That may be what causes them to conclude that the American’s League’s best player in 1966 was not Frank Robinson, who won the Triple Crown and was the unanimous MVP, but Earl Wilson, a pitcher whose ERA was not much better than the league average. And the people who believe in WAR will look at that and say, "Oh, well, if that’s what the numbers show, that’s what they show," rather than saying what they should say, which is "You know, that’s really a stupid thing to say."

Look, I am not saying that WAR has no value, or that no system of WAR could ever developed that is somewhat reliable. What I am saying is:

1) That the systems of WAR that we have now, while of course they are generally accurate in many cases, are not at all reliable,

2) That the primary reason that they are not reliable is not because of errors in any particular component, but rather, because in the calculation of a comparison derivative, there is the potential that the sum of the errors could be greatly magnified,

3) It is insane to rely on the outcome of a comparison derivative based on estimates, unless those estimates are fantastically accurate, and

4) It will be decades before sabermetrics has accurate estimates of all of the components of performance evaluation, if we ever get there. We certainly will never get there in my lifetime.

As I mentioned early in the article, the problem of comparison derivatives is ubiquitous in our culture. Another place where you see it is in political polling. Suppose that a candidate polls at 42% in one poll and 47% in the next. The networks will assume that this has huge significance, but the "5% gain" is a comparison derivative, based on two polls, neither of which was reliable in itself. Whether the Dow-Jones is "up" 25 points or down 25 points is a comparison derivative. The news networks will make up an explanation for why it is up or why it is down, but that’s pure fiction; it’s just a comparison derivative of soft numbers.

If based on solid numbers of a certain scale, a comparison derivative is of course meaningful. WAR is analogous to "profit", and "profit" is a comparison derivative. That’s why an accountant can juggle the books to make it looks like the company’s profit margin is high or low. If the population of your hometown was one million ten years ago and is 1.2 million now, that’s a meaningful fact. But the basic fact is still "the population of my city is 1.2 million", not "the population of my city is +200,000." The problem with WAR is that it urges people to throw away the basic fact without any recognition, to state value based on the comparison of the unstated number with an imaginary line. The term "WAR" should be replaced by "WAG". WAR isn’t an actual measurement; it’s just a wild-ass guess.

Look, I was the first person to say that a player’s real value was in how much better he was than a replacement-level player. Other people took that idea and ran wild with it, and they did that in good faith and with good intentions. But what I should have said at the time, or maybe I did say this and people just ignored it, I don’t know, is "of course, it is impossible to measure all of the elements of that with the accuracy that you would need to measure them in order to make the calculation meaningful." I didn’t say that then, and I didn’t say it for years afterward because I didn’t want to be kicking dirt into somebody else’s sandbox. But honestly, we’re never going to get to an accurate measurement of a player’s value unless people stop assuming that this kind of a process can be made to work.

I’ll open this up to comments in 24 hours. Thank you for reading.

COMMENTS (56 Comments, most recent shown first)

tangotiger
Brock: yes, the comments section on the articles are not really useful for those kinds of exchanges.

You can go to Reader Posts and create a thread there. If you tag me in there, then I'll eventually find your thread.
8:12 PM Dec 12th

Brock Hanke
Tom - Thanks for the new info. I'm not sure what to make of it, so I can't comment right now. Hopefully, when I do, it can be on a thread that is currently visible. But, as I said, thanks for trying to be helpful.
8:08 PM Dec 12th

tangotiger
Tom, Tango, Tangotiger all work.

"Talent found there": I meant FIELDING talent found there. So if you had Willie Mays and Curt Flood and Paul Blair and Devon White and Byron Buxton and Kevin Kiermaier all playing at the same time at their peak, their peer group is going to be far far higher than those in LF. So while a 1 win adjustment between LF and CF is appropriate on average, for any particular era, it can be much higher, say a 1.5 win gap, or much lower, say a 0.5 win gap.

12:11 PM Dec 11th

Brock Hanke
Tango (Tom, or heck, I have no idea what to call you. What would you prefer?) - Thanks for the answers, and for being polite about it. It does make sense that, if your replacement for hitters is .294 and your defensive replacement rate is .500, your overall replacement rate would be about .380.

As for the "talent found there", I hadn't considered that angle. Thinking about it for a couple of minutes, I have what I call the "CFs of the 1950s" objection. In the 1950s, as we all know, there were five Hall of Fame CF playing at the same time: Mays, Mantle, Snider, Doby and Ashburn. The result of this was that the Hall of Fame took longer than it probably should have to honor Doby, much less Ashburn. But another part is that the "talent found there" in CF in the 50s was an abnormally high amount, due to having five Hall of Famers there. I haven't looked at the positional adjustments for the 1950s, but I would guess that CF has a spike there; a spike in BAD defensive rankings. As a consequence, CF in the 50s would look "easier to play" (to use terminology I'm sure of) than the same position would have in most time periods. The positional adjustment would be subject to what is really a random event - how many HoFers there are in the game at one position at one time.

That also leads to the question, "If LF in the 1960s and 1970s contained four Hall of Famers - Yaz, Brock, B. Williams, and Stargell (not to mention the dregs of Williams and Musial) - would that not also imply that the 'amount of talent' at that position at that time would show an abnormal spike downward, leading to defensive rankings that are not justified by the LFers play, but by the random fact that there were a lot of HoFers in LF at that time?" Does WAR make any adjustment for that? If not, they should.

I'm not going to go into any more detail here because 1) we're off the main page, so nobody may even read this comment, and 2) you doubtless have a copy of Win Shares, and can look up Bill's essay on .500 zero points any time you want. I will admit that my opinion of WAR's positional adjustments reminds me of Bill's sentence there, "Well, of course he's not a better hitter because he plays second base." I am quite sure that I have no need to provide you with any more info than that for you to figure out all of what I'm asking about.
12:17 AM Dec 11th

tangotiger
Kaiser: Frank Robinson is +22 runs above average for his position.

***

Brock: the replacement level is ~.300 win percentage for the team. That sets is at about .380 for position players, .380 for starting pitchers and .470 for relief pitchers. A team of replacement level players will win ~.300

Brock: I have considered it as "harder to play". That's the purpose of the fielding spectrum. However, the adjustments are only implicitly about harder to play. They are explicitly about "talent found there". Naturally, in most cases, the harder to play, the more talented fielders are there. But this not always true. It's not true in high school for example. And it's not true in MLB recently, for the two corner outfielders, where the "harder to play" is about even for LF and RF, but the gap in talent in LF and RF is more than that. Hence the positional adjustments reflect that.

12:30 PM Dec 9th

KaiserD2
I only just saw this thread, because I've tried several times to get notifications for articles on this site turned back on, and for some reason it never works. Or maybe gmail is putting it in my promotions tab.

I think Bill is 100% right about one problem with WAR. Replacement level is entirely arbitrary. I think that they actually set it too low, but that's a very long discussion there isn't time for here. That's why I have used Wins Above Average (WAR) instead of WAA. Average performance is very measurable and much more meaningful. If you want your team to finish over .500, you need above average players--very simple.

The second problem with WAA in baseball-reference is that for most of the history of baseball they use a very poor fielding metric. That's the explanation for Frank Robinson's poor performance. According to DRA, the best historical fielding metric, Frank Robinson over his career was +20 runs better than average in the field, worth two wins. For much of his career he was a lot better than that.

As for Earl Wilson v. Frank Robinson in 1966, I think the reason for this is that WAR on baseball-reference for some reason vastly overstates the contributions of a good hitting pitcher. To cite another spectacular example of this,. baseball-reference.com in 1965 shows Don Drysdale as the fourth most valuable offensive player on the Dodgers, with +6 batting runs, behind Gilliam, Fairly, and Lefebvre and ahead of Louis Johnson, Wes Parker, Willie Davis and Maury Wills. They also give Drysdale a 15-run position adjustment as a pitcher. This is clearly, shall we say, a very anomalous result.

I don't think it's correct, though, that baseball-reference judges pitchers based on how many runs they should have given up, rather than how many they did give up. They take actual pitchers' runs allowed and compare them to what an average pitcher would have allowed in the same parks, with the same defense. (That's what they do for WAA anyway. I'm not sure about WAR which I never used.)

I'll stop there.

David Kaiser
8:44 AM Dec 9th

Brock Hanke
Also for Tango, of course it is about "harder to play." That's the whole idea. I just went over to BB-Ref and looked up the popup box they have to explain positional adjustments. They say it is about comparing positions. They then list positions - Catcher, Shortstop -which are hard to play (and therefore require players of greater defensive quality). Then they list positions - LF, RF - which are easy to play (and therefore have much lesser requirements in terms of the defensive quality of the players who play there).

I realize that the terminology I use - "harder to play" - is not the standard terminology in use. But that does not mean that is not valid terminology. And, being a seldom-seen terminology, it leads you to think of positional adjustments in a new way - as "harder to play." The whole reason that C and SS have higher-quality defensive players than LF or RF (much less DH) is that C and SS are harder to play than RF and LF.

"Harder to play" is WHY there are higher-quality defensive players at C and SS. So, of course the concept is about "harder to play." Why else would you give C and SS a bonus and LF and RF a minus? "Harder to play" means "fewer major league hitters can play there." That's why I use the terminology. It is accurate, and it leads you to think of positional adjustments from a different angle than the usual terminology. Try it from that angle. You may discover things I haven't yet.
2:25 AM Dec 9th

Brock Hanke
Tango, could you do me a favor? If the replacement rate is at the player level and not the individual component level, then what IS the replacement rate? It can't be .294, because that's just one component, right? And it can't be .500, because that's just another component. So, if neither of those is the replacement level, then what IS?
1:52 AM Dec 9th

tangotiger
And I can't verify for the change in positional adjustments specifically used, but it's not about "harder to play". It's about the quality of players there.

I can say that in the current era, the baseline level of LF is worse than RF. And so if you have an average-for-LF and an average-for-RF and you want to compare the two corners, you have to apply an adjustment so that a LF of 0 and a RF of 0 are not equals.
9:15 AM Dec 8th

tangotiger
Replacement level is at the player level. It is NOT at the individual component level. The entirety of Brock's first point does not apply.
9:13 AM Dec 8th

Brock Hanke
Well, heck. I double-checked, and I made a mistake. Bill's essay about the .500 zero point in Linear Weights is not in the New Historical at all. It's in the book Win Shares, page 102, the first essay in the Whys and Wherefore's section. Sorry about that.
6:22 AM Dec 8th

Brock Hanke
1) Finding out that WAR uses a different Replacement Rate for a hitter (a .294 Offensive Winning Percentage, if I've read their definitions right) from a fielder (.500) was mind-boggling. Even aside from the essay in the New Historical that goes into detail about the problems caused by the .500 rate used in Linear Weights, what is up with two different replacement rates for offense and defense?

And, what is worse, although you can read all sorts of WAR supporters saying that the .500 on defense is the result of looking at actual fielding values for players, none of them wants to discuss the mathematical zero point problem. OWAR has its zero point at .294, but its DWAR zero point is at .500. That, mathematically, is what replacement rate means. Mathematical zero point. Those aren't in the same ballpark. And, when you stop trying to solve the problem of replacement by looking at individual players, and start focusing on the mathematical results of having two different zero points, you will see the problem. But you have to focus on the math, not on observation or whatever it is that WAR is using that convinces itself that you can always find a .500 glove if you want one.

And that doesn't include the mathematical fact that a .500 replacement rate on defense means exactly the same thing as saying that front office people, and also theorists about baseball value, have come to the conclusion that, when a minor leaguer is promoted to the majors, the front office of the team should pay no attention to defense at all, leaving it randomly distributed. A .500 replacement rate would mean that - if you promoted players based on only their hitting, you would come out with a .500 rate for gloves, because that's where random leads you. I have serious doubts that major league front offices consider defense to be a random thing not worth looking at if you're trying to decide what AAA player to promote.

2) And then there's the positional adjustment. Someone was giving me a hard time about Lou Brock's defense, claiming that the large negative numbers were completely justified by the positional adjustment. So, I decided to look at some Left Fielders who were either major base stealers, like Lou, or left fielders whose careers are very close to exact comps to Brock's career. To my surprise, I found three such LF right away. Yaz, Billy Williams, and Willie Stargell have careers very contemporary with Brock's. Rickey Henderson and Tim Raines seemed like the best candidates for serious base stealers. They both also played in the 1980s and 90s, not the 1960s and 70s, like the first four LF.

Here are the position adjustments, which I got by the formula OWAR + DWAR - Final WAR (the one they display as the final result in the left upper part of the page), ordered best to worse, although they are all negative numbers, because having the zero point at .500 is not going to help left fielders:

Rickey -8.3
Tim -8.7
Stargell -12.9
Brock -13.2
Billy Williams -14.2
Yaz -19.2

So, the claim there is that left field became MUCH harder to play in the 1980s and 90s than it had been in the 1960s and 70s. As far as I know, this massive shift in the difficulty of playing LF is completely undocumented. But WAR says that it's there. And in both the AL and the NL.

Also, the easiest position on this list, in terms of how hard LF was to play, was Left Field in Fenway Park in the 1960s and 1970s, where Yaz played. Apparently, "playing The Monster" is trivially easy. Really. That's what this list shows. Yaz might disagree with that. And remember, these are the POSITIONAL numbers. There's no factor of how well the individual LF played defense. This is just how hard LF was to play over the player's career.

As you might imagine, I have serious doubts about WAR and its defensive replacement rates and its positional adjustments, even aside from the clear point that Bill made.

5:41 AM Dec 8th

3for3
@Riceman said "The end result is Win Shares never has the laughable conclusions that so often plagues WAR."

Really? How about this: Compare Dwight Gooden and George Foster, 1984. Same team. DG 17-9, 2.60 ERA. 276 Ks. GF: 269/311/443. A 35 year old left fielder, whose best years were behind him. WinShares, the same 18.

If you think a 35 year old left fielder with a .311 OBA should be as valuable as those Dwight Gooden numbers...
11:44 AM Dec 5th

FrankD
Interesting article. On a previous Reader Post I brought up that WAR, like other measurements including weighing loads of grain, should have error bars. Well, others went to Fan Graphs and other sites and they quoted that WAR has a +/- 1 error bar (standard deviation). So if a player has a WAR of 7 and the other has a WAR of 6.2 it can be argued that the 6.2 can exceed the the 7. So, it is interesting to argue players values, but if the WAR of both is within +/- 1, well, we are in the noise or error of the calculation when we argue who is 'better'. Error is a funny thing. You can ignore it or even think you've suppressed it but it doesn't go away. It just crops up somewhere else. I'm not trashing WAR or any of its variants. Those who developed these estimations have come a long way in improving our quantitative view on baseball performance. Thank God or error or noise that we haven't gotten to the point where we can't debate who was better: Goose Goslin or Reggie or ........ who is better, there is enough error in the calculations that we can reasonably disagree
12:39 AM Dec 5th

jgf704
> tangotiger
> Or WAR - oWAR = fielding wins above average at position

> 107.2 - 106.9 = +0.3 fielding wins above average at position

Thanks! My 20 lines vs. your 1 line... I prefer your version!

(And the silly thing was I have the short version figured out in the RP thread. :))
8:18 PM Dec 3rd

tangotiger
Or WAR - oWAR = fielding wins above average at position

107.2 - 106.9 = +0.3 fielding wins above average at position

There's methods to the madness.
11:52 AM Dec 3rd

jgf704
Also, wdr1946 wrote:

Frank Robinson's career Defensive WAR was minus 14.8, which presumably means that he was responsible for losing nearly 15 games compared with a "replacement level" player at his position, through poor defensive skills.

In BB-Ref, the dWAR number includes the position adjustment that tangotiger mentioned. So he is -14.8 dWAR relative to all fielders. If you want to know the quality of his defense vs. average at the position(s) he played, one option is to look at Rfield like tangotiger said.

There is another option, slightly more work... As it turns out, both oWAR and dWAR include the positional adjustment. On the one hand, this means oWAR + dWAR is NOT equal to WAR. But it also means we can back out the magnitude of the positional adjustment: POS = oWAR + dWAR - WAR. For Robinson, this means POS = 106.9 - 14.8 - 107.2 = -15.1. And so relative to an average fielder, Robinson was dWAR - POS = -14.8 - (-15.1) = +0.3 wins.

Obviously, it is much quicker to look at Rfield and assume 10 runs per win. :)

9:19 AM Dec 3rd

tangotiger
Sorry, I take that back.

Fielding Runs above average is labelled as RField. Frank Robinson is +22 runs relative to the average player at his position.

RF (and DH) have a lower benchmark, and so in order to compare a RF to a SS, we need to have an adjustment. (Think Bill's Fielding Spectrum.) And for someone of a long career like Frank, that's -146 runs of baseline adjustment.

So his defensive impact is +22 -146 = -124 runs.

This translates to -14.8 wins. And that's the figure you see under dWAR.

***

As it relates to Frank Robinson, he's being given credit as ~average. If you just treated him as being an average fielder, he's 106.9 wins. That's what the oWAR is saying.

If you DO include fielding, he's 107.2 WAR.
8:45 AM Dec 3rd

tangotiger
willi: It *is* relative to average. The label is misleading. Technically, it is: Fielding Runs Above Average, used in WAR. And all that has the misleading label of dWAR.
8:39 AM Dec 3rd

willibphx
One of the problems I have with WAR which feeds into the comments in Bill's articles and wdr's comment on Frank Robinson is that the definition of DWAR is in my opinion fatally flawed in its assumption that the replacement level is an average player. I would argue that either a better replacement level should be used or disclose it as DWAA which is what it currently is.
7:11 AM Dec 3rd

wdr1946
Frank Robinson's career Defensive WAR was minus 14.8, which presumably means that he was responsible for losing nearly 15 games compared with a "replacement level" player at his position, through poor defensive skills. Can someone explain what exactly this means? How does anyone know the defensive skills of replacement level players, who aren't in the Major Leagues? Does this take ballpark factors into account, or such factors as whether his team had high strikeout pitching? What does it mean to say that his poor glove and arm costs his teams 15 games, and how was this calculated? Seems pretty debatable.
4:42 AM Dec 3rd

sroney
The thing that I object to in the article is the suggestion that WAR should use the actual replacement value on the team he is on, rather than a global, standard value. First off, that actually causes you to use an estimate that varies from player to player, which would make more errors, but mostly, if you are trying to measure the value fo a player, or especially how good he is, Mookie Betts shouldn't come out as less valuable (good) just because the Dodgers have a lot of outfielders.
7:19 PM Dec 2nd

willibphx
Bill, regarding your basis argument. I tend to agree with your point that accumulating errors can really compound and magnify the error. Your example to highlight the issue would be an extreme example of errors aligning all in one direction. If the errors were random or unbiased, the errors would tend to offset in the majority of cases in a large sample. The more areas of potential error, the less likely chance of an extreme outcome.

In general WAR (like WS) tends to give results that align with our best knowledge of baseball but occasionally throws a result that makes us pause.

One of the things I valued about WS was your testing of whether certain types players were rewarded or penalized for certain situations. Playing for good teams, bad teams etc.. I am unware if anyone had done anything similar for WAR, are slow catchers over or under rated, etc. Lots of anecdotal comments about over valuing defense but not sure if any one has compiled an analysis.
6:46 PM Dec 2nd

MWeddell
Tom Seaver had 290 innings pitched in 1973, facing 1,147 batters. That strikes me as more playing time than Joe Ferguson's 585 plate appearances in 1973 even if Seaver's playing time was concentrated into fewer games.
2:44 PM Dec 2nd

John-Q
JWilt, You’re trying to make a straw man argument with Seaver and Ferguson in 1973. You’re using Ferguson because he’s not a big name All Star or HOF type player. You didn’t mention that Morgan, Stargell, Rose, Perez & Bo. Bonds, Da. Evans, all had more win shares than Seaver in ‘73. I think Cedeno and Simmons tied Seaver.

Even BWAR agrees that Ferguson had a great season. He’s 10th overall in OWAR and 11th in BWAR among position players. Ferguson tied for the HR lead among catchers (25) with Johnny Bench. Ferguson had a 136ops+. He was 9th overall in Walks. He led the league in sac flys. He was 6th in the league in WPA. So Ferguson was one of the top 10 offensive players in the N.L. In ‘73 and on top of that he was a catcher.

BWAR has Ferguson ranked 11th among position players in ‘73 and Win Shares has him tied for 7th so both systems agree that he had a great season.

Win Shares has Seaver ranked as the best pitcher in the N.L. that season so it’s not like Win Shares throws him to the curb. I think Seaver had something like 29 win shares which is a great season for a pitcher. I think Seaver’s ‘73 season is ranked in the top 10 pitching seasons for the 1970’s according to win shares.

The difference with the 2 systems is that win shares rewards playing time more than BWAR. Position players therefore usually always finish with more win shares than pitchers simply because they appear in more games and play more innings. Ferguson appeared in 136 games and had 585 plate appearances.

The difference between pitchers and position players gets more pronounced as we get into the 80’s-90’s-00’s and pitchers make less game starts and pitch less innings.
1:23 PM Dec 2nd

TheRicemanCometh
Jwilt - it's not cherry-picking. I could list 100 examples. This is just the latest to come up here.

And your Win Shares examples are poor. According to Baseball Guage, Seaver's WS was slightly higher than Ferguson's in 1973. Ferguson was a catcher having his best offensive year on a 95-win team in a pitcher's park. Seaver had his typical outstanding season for an 83-win team in a pitcher-friendly park. Again, WS is bounded by real Wins, not an imaginary number.

Goose and Johnson obviously played on the same 92-win team. Goose edged Johnson in WS 28.9 to 28.6. Goose had an incredible season, .344/.421/.516 with 129 RBIs playing in a cavernous park (although in an offense-heavy year). Johnson was 23-7, 2.72 ERA with a 149 ERA+, albeit in a heavily pitcher-friendly park. Just looking at these numbers, it's not remotely surprising that their WS are close.

Also, unlike Wilson or Kralick or a host of others, Goose put up those type of numbers every year for a decade. He's in the Hall of Fame. That a young HOFer Goose had slightly more Win Shares than an old HOFer Walter Johnson in 1924 is not even remotely egregious. If Win Shares said Bucky Harris was more valuable than Johnson, or Bill Russell over Seaver you'd have a point.
12:04 PM Dec 2nd

Steven Goldleaf
OK, that was easy enough. Yet Aguirre is Gary's example of egregiousness because--well, he's Hank Aguirre. Who the hell is he to hobnob with Mantle and Robinson, it's absurd, it's outrageous, it's--what's that you say? He had a purdy good year that mostly we never noticed? Oh. My bad. Never mind.
11:42 AM Dec 2nd

TheRicemanCometh
Hank was 16-8 and led the league in ERA, ERA+, WHIP, and H/9. It's not the most egregious example.
11:36 AM Dec 2nd

TheRicemanCometh
Regarding Hank, that's at least a debatable position. The top 4 guys are separated by 0.2 WAR:

1. Aguirre • DET 6.2
2. Pascual • MIN 6.1
3. Robinson • BAL 6.1
4. Mantle • NYY 6.0

11:34 AM Dec 2nd

jwilt
Riceman - I think that Win Share's conclusion that Joe Ferguson and Tom Seaver were equal in value in 1973 is more absurd than any WAR comparison of Frank Robinson and Earl Wilson. Or that in 1924 Goose Goslin was as good as Walter Johnson. Or that Jay Bell's best season was the equal of Pedro Martinez' best.

In any case, cherry picking outliers may not be the best way to judge an entire framework.
11:24 AM Dec 2nd

phorton01
Steven,

Re: Aguirre in 1962 -- I think it was pretty clearly his batting.
10:45 AM Dec 2nd

Steven Goldleaf
Gary, I'm totally open to examining any WAR total anyone would like---I have no skin in this game, and if the discussion ends with jail time for anyone who ever uses WAR again, I'm fine with that. I'd like to know how Aguirre ranks so high. Let's discuss, here or elsewhere. But let's really discuss it.
8:32 AM Dec 2nd

garywmaloney
Steven, that's "TRBL," per Frank Caliendo's brilliant Barkley impression.

Every so often, there's laugh-out-loud moments while reading Bill, especially in one of his intellectual beatdowns (punctuated by a touch of snark).

And every so often, there are "diamond bullet" moments (Apocalypse Now reference), where clarity of thinking and expression result in . . . . well, revelation for the reader.

This brief, pungent piece has both.

Win Shares portray things that actually happened, actually were achieved on a ballfield.

WAR is -- well, something else, and a lot less substantive.

It is sad that a writer as good as Jay Jaffe relies so very heavily on WAR, while completely ignoring something more substantial like WS. Idolatry, on a certain level.

Completely agree with JohnQ's earlier comment. And Steven, you left out Hank Aguirre, Greatest Ballplayer of 1962 (according to WAR).
8:20 AM Dec 2nd

Steven Goldleaf
I'm asking to study each case closely, Riceman. (Kralick's name is Jack, btw, not Jim.) If Bill wants to cite Wilson, or you want to cite Kralick, or John-Q wants to cite Belanger, I'm glad to do so, but not all of them lumped together with a blanket condemnation of "This is TURRIBLE!!" It's hard to see what "this" is, if we do that. For example, let's look closely at Earl Wilson's batting with Boston in the 1966 season--maybe that strikes as absurd, to credit him with over half a Win Above Replacement in only 32 at bats? So let's start there (you can pick any example you like, if this isn't what strikes you as criminally misguided.) First place, I think we'd have to say that these 32 at bats represent a clear improvement over any other pitcher: 2 HRs, 5 RBI, 7 Runs scored. Hell, if you multiply 32 at bats by 20, to get a full season's worth of at-bats for an everyday player, that's 40 HR, 100 RBI, 140 runs scored, which is not only better than a replacement pitcher, it beats out some MVP batters. So I'd say you have to give Wilson something for those 32 at bats. Do you want to give him only 0.1 WAR above a replacement level pitcher? I'd think something higher than that would be warranted--we can quibble over whether 0.4 seems fair or 0.6 but my point is that now we're quibbling over tiny sums--we agree he should get something for those 32 at-bats. So your argument that crediting Wilson for his batting generally in 1966 is starting to lose its appeal, and I believe if we continue this calm rational discussion, it will continue to do so--but we need to have the discussion, not elide over it hastily to reach the conclusion you want to reach.
8:10 AM Dec 2nd

jgf704
I would think that, in the end, the uncertainty in WAR would be be roughly the same for all full time players. Let's say that at 50% confidence level, the uncertainty is equal to 1 WAR. So, sure, for a full-time player who earned 1 WAR, it could be that he is actually a 0 WAR player. And yes, the uncertainty for a 1 WAR player is 100%. But I don't think that matters. Generally speaking, WAR (and WS) are used to compare players. And so if we are comparing a full time 9 WAR player to a full-time 10 WAR player, the effect of the uncertainty on decisions would be the same as when comparing a 1 WAR player to a 0 WAR player.
7:55 AM Dec 2nd

TheRicemanCometh
By the way, I'm not against WAR the concept, but how it's calculated, especially on the defensive side. My favorite version of WAR was the one presented in "Paths To Glory" by Mark Armour and Daniel Levitt. Their version was a strictly offensive WAR (no defense, no pitching), and they explained in detail how they calculated it so you could do it yourself for any team in history. I guess you could take their formulas and do it for pitchers based on their runs allowed or Component ERA. But the point is their version of WAR was used to analyze a players offensive contribution to his team, and its conclusions made sense.
7:00 AM Dec 2nd

TheRicemanCometh
Stephen, the problem is not just this one example. There has been a mountain of examples given on this site and elsewhere on the information superhighway of WAR's risible conclusions, whether 1-year (Earl Wilson 1966, Jim Kralick 1961) or multiple years like John-Q gave. The point is any analytical measure that reaches such conclusions, in any way, shape, or form has serious issues. And its defenders seem to ignore this and combat such obvious problems with a mania bordering on religious zealotry.

Win Shares doesn't have this problem. No matter how you slice it: 1 year, 1973-78, July 1908 to Aug 1911, or whatever, it never gives you a ridiculous result that WAR so often does, which for a statistic that is regarded as the End-All-Be-All of Stats, makes it completely worthless.
6:13 AM Dec 2nd

TheRicemanCometh
The one issue where Win Shares has problems is not giving enough credit to pitchers, at least modern ones since the main driver for pitching Win Shares is innings pitched. But even that's not a serious problem to me. I think modern starters are greatly overvalued because they only pitch 180-200 innings now. That was what Gibson and Palmer hit by late July. Everybody was going nuts about deGrom's 1.70 ERA 2 years ago. But doing that in 217 innings isn't that impressive to me. Seaver did it in 286 innings. Gibson's 1.12 ERA was over 300 innings. These are guys many of us saw pitch, unlike Cy Young, Christy Mathewson, etc. whose numbers seem superhuman to us now. We saw Palmer, Gibson, Seaver, Koufax. The only starter worth a damn today is Verlander. He'll at least give you 230 innings, which seems like 400 now. And get off my lawn!
5:57 AM Dec 2nd

Steven Goldleaf
Again, I would try to look at this one issue at a time. My problem with John-Q's categorical complaint is that he's all over the map, comparing Belanger (in a very specific period, 1973-78, for some reason) to George Foster, Dave Parker, and several other players. What I'd like to see is one season for Belanger compared to one season for Parker (or whoever) in there and seeing which of the elements doesn't make sense. Do a few of these one-on-one comps, and see if there's any discernable pattern there. Don't make a blanket statement about "unfairness" and let it stand for itself--that doesn't move anything along. How much is the maximum we can assess a great shortstop's defensive play at? How close did Belanger come in a given year to achieving that maximum? What is he credited with over and above what his numbers show? etc. That's where I'd start the discussion, not a general "this is SOOOO stupid" statement.
5:54 AM Dec 2nd

TheRicemanCometh
Another thing. This is why Win Shares, even Ye Olde Version, is far better than WAR. Like WAR, Win Shares is a series of estimates, but these estimates are bounded by "actual" numbers. Its run estimates are bounded by the actual team runs scored. Its Win estimates by actual team wins. It also bounds its various estimates by Team, as opposed to League or MLB-wide which I believe WOBA does (may be wrong there). This is also crucial because Frank Robinson isn't creating runs with Bench or Clemente, or Earl Wilson. He's doing it with Brooksie, Boog, and the rare times he was on base, Bellanger. Win Shares also is very conservative regarding defensive analysis, because there is just so much we don't know about defense. The end result is Win Shares never has the laughable conclusions that so often plagues WAR.
5:45 AM Dec 2nd

TheRicemanCometh
Reader John-Q nails it in his comment earlier. WAR so overvalues good defenders and overpunishes poor ones that it makes its conclusions laughable. The notion that Mark Bellanger, as brilliant a defensive shortstop as he was, had the same value as Dave Parker, George Foster, or Ken Singleton, over any period of time, is completely and utterly preposterous. I don't understand how WAR is even taken seriously. It's a joke.
5:25 AM Dec 2nd

wdr1946
To elaborate on Earl Wilson vs. Frank Robinson in 1966: his WAR as a pitcher was 1.4 for Boston and 4.5 for Detroit. He gains an additional 0.6 WARs as a batter for Boston and 1.3 as a batter for Detroit. It is unclear how reasonable his batting WARs are. In terms of Win Shares (using the stats on Baseball Gauge, not in Bill James's book), Wilson's had 22.5 Win Shares, compared with 40.6 for Frank Robinson, which obviously seems more reasonable. Wilson had the second highest total of WS for a pitcher, behind Jim Kaat (24.9). There was one Cy Young award for both leagues in 1966, won of course unanimously by Sandy Koufax. In the AL MVP voting (won unanimously by Frank Robinson), Wilson finished in 14th place, behind 11 position players and three pitchers- Kaat, Stu Miller, and Jack Aker. Most observers thought that Wilson was good, but not that good.
2:16 AM Dec 2nd

shthar
If everything was perfect, what would we argue about?
12:15 AM Dec 2nd

MWeddell
By the way, it is the Fangraphs WAR system that estimates a pitchers' value based on strikeouts, walks and home runs, not on runs allowed, but it is the Baseball-Reference WAR system that claims that Earl Wilson was about as valuable in 1966 as Frank Robinson. The speculation that the first decision may have led to the Wilson / Robinson conclusion is not correct.
10:30 PM Dec 1st

MarisFan61
Bill: One of the main things being discussed in Reader Posts (I confess to being the first to suggest it) is that while indeed combining estimates by comparing one to another can magnify error....

The cited examples involve significant errors in opposite directions on the two things, which really magnifies the error -- and that it doesn't seem like such a thing would be much expected to occur.
(In case it's unclear what I mean by "in opposite directions on the 2 things":
Like, in the first 'weighing' example:
Weight of the truck, loaded: (2% low)
Weight of the truck, empty: (2% high)

Of course we wouldn't expect that to occur generally; I figure that without intentional cheating, the more likely thing would either be errors in the same direction on both or random variation, rather than such diverging and great compounding.
Still, supporting how you cite this, we do understand how in the weighing situation, because of the real possibility of cheating, it could very well often go in opposite directions like that.

BUT, how much does that have to do with "WAR"? We do see the basic principle about combining estimates by comparing one to another. But how common would you expect it to be, that the errors on the things being compared would be in opposite directions and significant enough to produce a large error due to this thing?

BTW, I say this as someone who (as you may know) isn't a great fan of "WAR," and I talk all the time about errors on the basis of other things. But I still raise this question, because it does seem that the way you present this gives an exaggerated impression of what this causes.
7:37 PM Dec 1st

BrianNash
Responding to phorton01:

(1) My take was that Bill is calling into question the earnings statements. An accountant can manipulate estimates and make $3 EPS look like $4 or $2. Review the following paragraph in the article.

The DJIA IS hard money as you trade. You get real cash based on a quote. But does that quote multiply so many estimates?
6:00 PM Dec 1st

MWeddell
I don't disagree with any of the points in Bill's comment that kicked off this discussion. I disagree with his conclusion about where that leaves us. The article seems to conclude with a plea to stop using WAR because it is too inaccurate.

I can't see the baseball community putting that genie back into the bottle. Having a system that summarizes each player's season into a single number (Bill wrote "integer" in the intro to his Win Shares book) is quite useful. It also avoids a user cherry-picking from traditional stats to support his/her working hypothesis. I think it's more realistic to advocate (1) being aware of the wide error bars and (2) using multiple systems (baseball-reference WAR, Fangraphs WAR, Baseball Prospectus WARP, and (if they ever become readily available given that Bill has disavowed the Baseball Gauge numbers) Win Shares (and Loss Shares) to see if the conclusion changes based on the metric selected.
5:37 PM Dec 1st

arnewcs
Of the 14 leaders for most single-season WAR on Baseball-reference, 12 of them are pitchers from the 1800s. Pitchers from that century dominate the top 50.
5:30 PM Dec 1st

phorton01
Two comments, both of which may reveal my ignorance.

1. I'm not sure your Dow Jones example is a fair comparison. You may think stock values are crazy or irrational, but the numbers themselves are real -- they represent the actual price you pay for that stock (not an estimated price). The DJIA is just a weighted average of a particular basket of stocks. So if it goes up by 25 points a given day, that is a real movement based on real prices. (Yes the individual stock price changes that caused that movement could be a million different combinations, but whatever they are on a given day are actual price movements, not estimates.)

2. Isn't the real problem with WAR (as opposed to Win Shares) that it doesn't have a limiting "real" factor to control for the error amplification you mention? In other words, if you always had to add up to a given number of wins you might have the distribution wrong, but at least the totals could not be too far (at all) out of whack.
5:16 PM Dec 1st

BrianNash
Responding to Tigfanman:

"Some estimates make me think of “saving money” by buying something I don’t need at half price, there’s still less money in my account than not buying it at all."

I had this same thought... A value of $599 but yours now for only three easy payments of $39.99!
5:15 PM Dec 1st

Tigfanman
I just laughed at the pair of stuffed pajamas and remembered how the old abstracts always had several lines that made me laugh. I think that WAR isn’t the only place in the world where estimates times estimates equal facts. And questioning things like the Frank Robinson/Earl Wilson issue when the “facts” sound questionable. Some estimates make me think of “saving money” by buying something I don’t need at half price, there’s still less money in my account than not buying it at all.
4:36 PM Dec 1st

John-Q
I see 3 problems with BWAR:

1-They big problem IMO is that BWAR skews defense drastically. Good/Great defenders get skewed much more favorably than in Win Shares.

EX: M. Belanger 1973-78 ranks 33rd in BWAR. Roughly the same value as D. Parker, G. Foster or K. Singleton over that period. He ranks 114th in Wiin Shares. He ranks roughly the same as Bob Boone, F. Patek or B. McBride. Singleton ranks 6th overall in win shares during that period. Parker & Foster rank in 30’s because they didn’t become full timers until ‘75.

2-Poor/below average defenders are crushed 2 times in BWAR. First they get subtractions because of their defense. Secondly they lose out because of the positional adjustment penalty.

Take Ryan Howard 2006-2011. He ranks 13th overall during that time period in win shares. Roughly the same as A-Rod or Cano. He ranks 96th in BWAR. Roughly the same as Marco Scutro or Casey Blake. Seriously.

You see this play out in career value between Clemente and R. Jackson. I think Clemente has about 20 more BWAR than Jackson in career value which is huge. That’s also factoring that Clemente’s career ended abruptly when he was 37. Jackson has about a 70 more win shares than Clemente. R. Clemente has the most BWAR in MLB from 1964-1972. Win Shares ranks him about 9th during that time period. That’s a bit before my time but did anybody consider Clemente to be the best player in baseball during that time period?

F. Mcgriff gets crushed from the defense in BWAR so does J. Kent & L. Brock. BWAR ranks K. Lofton a HOF caliber player.

3-BWAR overvalues a pitcher’s offense. This is what happened in the Earl Wilson example. He earned about 2 BWAR just from his offense. F. Robinson ‘66 on defense in BWAR. First he loses -8 runs from his runs value. Than he loses another -7 or -8 because of the positional adjustment.

Frank Robinson goes from a 8.9 in OWAR to a 7.7 in overall BWAR. Earl Robinson goes from a 5.7 in pitching WAR to a 7.9 overall BWAR because of his offense.

I think Koufax loses about 5 BWAR because he was a poor hitter.
4:25 PM Dec 1st

bjames
Responding to pgups6. .. no, your interpretation of the analogy is clearly wrong. The replacement level is what you SUBTRACT, what you DON'T count--just as you subtract the weight of the truck, and just as you don't count the weight of the truck.

But don't put too much on the analogy, either; it will break down somewhere. All analogies do.
4:14 PM Dec 1st

pgups6
My question isn't so much about the argument, it's more concerning the WAR value and how replacement is defined.

"The actual replacement level is specific to the locale."

Is that the case? I thought replacement wasn't an actual player per se, I thought it was the averages of all averages put together. That in itself is a problem as well because it relies on estimate upon estimate upon estimate...

So in the wheat example, the replacement isn't the average size of the truck, it is the average amount of wheat, right? Or am I totally missing something?
2:56 PM Dec 1st

tigerlily
Bill there are a couple of threads about your article on the BJOL Readers Posts that have been very active.
2:26 PM Dec 1st

bjames
Was expecting argument here, but since no one has posted yet, I will try to start the discussion.

These three things would seem to be to be almost self-evidently true:

1) That WAR necessarily involves a large number of estimates (as Win Shares does, or any other sophisticated value-measuring tool.

2) That every estimate contains some amount of error. That seems inherent in the definition of the term "estimate", that it isn't EXACTLY right, it is just the best we can do.

3) That when you combine estimates by comparing one to another, you greatly magnify whatever error you had before the comparison.

What I would ask is, if you disagree with my argument here, which one of those three points do you disagree with? Or is there a fourth issue that you disagree about, that I'm not seeing? Thanks you.
2:10 PM Dec 1st

The Real Problem with WAR

COMMENTS (56 Comments, most recent shown first)

Leave a comment

Report inappropriate comment


Type of Abuse:
Comments: