The Biggest Problem With WAR
One time when I was maybe six years old, probably in mid-summer, 1956, I looked out the glass front of my father’s little small-town business, and there was a long line of farm trucks loaded with wheat, lined up to go the Grain Elevator. The Grain Elevator was a block away, and, while it was normal for there to be trucks waiting to deliver their wheat, it was not normal for the line to be so long. I asked my father, or someone, what was going on, and he said the scales here had been running a little loose, whereas the scales in Hoyt (a nearby town of the same size) were very tight, and the word had gotten around. Although I heard those expressions many more times growing up, I didn’t really understand what was meant by that until at least 25 years later.
It’s a problem of Comparison Derivatives. Problems of Comparison Derivatives are ubiquitous. I was talking to my son yesterday; he’s an actuary, and he was trying to explain a complex problem of Comparison Derivatives to his bosses who live and work in Europe. A small change in one component of a comparison derivative had caused a few million dollars of projected profit to disappear. They’re also actuaries, of course, but they couldn’t quite understand the problem, or he couldn’t quite make them understand the problem.
The problem is that when working with Comparison Derivatives, a 1% error can manifest itself as a 20% error, a 50% error, a 90% error, or a 200% error; can, and frequently does. I keep looking for an easy way to explain this and can never find it and haven’t found it yet, so I’ll go back to the farmers selling wheat to the grain elevator.
To sell grain to the Grain Elevator, the farmer drives onto a scale and weighs the truck, loaded with grain. Then he drives off the scale, off-loads the grain, and weighs the truck again. The difference, of course, is the weight of the grain. The farmer is paid based on the weight of the grain.
The problem is that a small percentage error can have huge consequences to the farmer. State inspectors would visit the elevators regularly to make certain the scales were accurate, but they’re like your bathroom scale. Weigh yourself, step off the scale, step back on the scale, you’re not always going to get EXACTLY the same weight the second time that you did the first.
A good-sized farm truck probably weighs 15,000 pounds, empty. It might carry 5,000 pounds of wheat; they don’t measure it by the pound, they measure it by the bushel, but we’re going to talk pounds. A pound of wheat is worth about 10 cents, I think; in 1956 it would have been about two cents. 5,000 pounds of wheat is worth about $500.
Except not quite. There will be testing equipment on site which tests the moisture component of the wheat (or corn, or whatever.) If the wheat tests at 87%, then the farmer gets $435, rather than $500. You don’t get paid for the water; you get paid for the dry weight. So you have three variables:
|
|
Weight of the truck, loaded:
|
20,000
|
|
|
Weight of the truck, empty:
|
15,000
|
|
|
Weight of the grain:
|
5,000
|
|
|
Value of the grain:
|
$500
|
|
|
Test weight:
|
87%
|
|
|
Payment to farmer:
|
$435
|
But suppose that the scales are 2% off, in the direction that harms the farmer:
|
|
Weight of the truck, loaded:
|
19,600
|
(2% low)
|
|
|
Weight of the truck, empty:
|
15,300
|
(2% high)
|
|
|
Weight of the grain:
|
4,300
|
|
|
|
Value of the grain:
|
$430
|
|
|
|
Test weight:
|
85%
|
(2% low)
|
|
|
Payment to farmer:
|
$365.50
|
|
The 2% errors cause the farmer to be cheated by 16%. Or suppose that the errors go in favor of the farmer:
|
|
Weight of the truck, loaded:
|
20,400
|
(2% high)
|
|
|
Weight of the truck, empty:
|
14,700
|
(2% low)
|
|
|
Weight of the grain:
|
5,700
|
|
|
|
Value of the grain:
|
$570
|
|
|
|
Test weight:
|
89%
|
(2% high)
|
|
|
Payment to farmer:
|
$507.30
|
|
The weight of the grain is derived from a comparison of two other weights—a comparison derivative. With a comparison derivative, a 2% error in the components can easily lead to a discrepancy in the calculated value of almost 40%. ($507.30 divided by $365.50 is 1.39).
This happens because, by making the comparison between two other weights and then relying on the margin between them, you’re shrinking the base, thus making the data unstable. The error occurs on a much larger scale than the resulting measurement. Imagine a cylindrical object which is 2 feet across at the base and 4 inches across at the top. It is very stable, because the weight of the base is large compared to the weight of the load. But if it is 4 inches at the base and 2 feet across at the top, then it is unstable. A comparison derivative captures all of the error in any of its components, and then states all of the error on a much smaller scale, making the error much larger relative to the value, making the calculation unstable.
That was what the farmers meant by saying that the scales were "loose" or "tight." All the grain elevators paid the same amount for a bushel of wheat or corn. But if the farmers believed that they were not getting fair weights from the elevator, they would say that the scale was "tight". If they felt they were getting good payments, they would say that it was "loose." Many times this was probably just a rumor. Sometimes there were probably some Grain Elevator operators who cheated the farmers with small discrepancies on the scales—particularly the bigger Grain Elevators, which used two different scales to weigh the truck before and after. Tiny discrepancies between the two scales could cost the Elevator thousands of dollars a day—or profit them thousands of dollars a day if the discrepancy was in their favor. So what do you think, was there anybody cheating the farmers by doing that, you think?
The REAL problem with WAR is that it is a Comparison Derivative—thus, highly sensitive to small errors. Let us suppose that a player has a "Run Value", however that is established, of 100 Runs, offense and defense combined. WAR estimates that a replacement-level player would have a value of 86 runs. There is a difference of 14 runs. WAR estimates that each run above the replacement level is worth .11 wins. The WAR of the player is 1.54, which would be rounded off to 1.5:
|
Run Contribution of the Player:
|
100
|
|
Run Contribution of Replacement Level Player
|
86
|
|
Run Contribution above Replacement
|
14
|
|
Win Value of a Run
|
0.11
|
|
Wins Above Replacement (14 * .11)
|
1.54
|
|
WAR, rounded off:
|
1.5
|
ALL of those things are estimates. "Weight" is a hard fact, capable of precisely accurate measurement. In the case of the grain, the sale price is a hard fact. In WAR, none of these things are hard facts. They’re all just estimates. The Run Contribution of the player is an estimate. The Run Contribution of a Replacement Level player can barely be described as an estimate; it is more like a made-up number. The Win Value of a Run is an estimate. So this problem is much, much more serious in WAR than in the Grain Elevator transaction, because we’re dealing here with estimates, rather than hard facts.
Let us suppose that there is a 3% error in each step of that, going in favor of the player:
|
Run Contribution of the Player:
|
103
|
(3% high)
|
|
Run Contribution of Replacement Level Player
|
83
|
(3% low)
|
|
Run Contribution above Replacement
|
20
|
|
|
Win Value of a Run
|
0.113
|
(3% high)
|
|
Wins Above Replacement (20 * .113)
|
2.26
|
|
|
WAR, rounded off:
|
2.3
|
|
With 3% errors, the player’s WAR goes up to 2.3. But suppose that there are 3% errors going against the player:
|
Run Contribution of the Player:
|
97
|
(3% low)
|
|
Run Contribution of Replacement Level Player
|
89
|
(3% high)
|
|
Run Contribution above Replacement
|
8
|
|
|
Win Value of a Run
|
0.107
|
(3% low)
|
|
Wins Above Replacement (8 *.107)
|
0.856
|
|
|
WAR, rounded off:
|
0.9
|
|
The player’s WAR can vary by 150%, based on 3 errors of 3% each. THAT is the basic problem with WAR. I mean, we argue about all kinds of things. We argue about whether clutch data should be included in runs created estimates; we argue about whether the fielding estimates are consistent with external evidence, etc. But the REAL problem is that:
1) Estimates are never exactly right; they are always just estimates, and
2) WAR uses an analytical system to process those estimates which has the potential to enormously magnify whatever inaccuracies are included.
In a WAR estimate, there are dozens and dozens of internal estimates—estimates of runs created, estimates of runs saved by fielding, estimates of the run value of a single, a double, a triple or a double play, estimates of the park effect, etc.
The problem is more serious than that. First of all, as I said, the replacement level is not really an estimate. It’s just a made-up number. It could be off by 20 or 25%--by itself, before it is magnified.
But that understates the problem, too. WAR assumes that the Replacement Level is a constant. It is NOT a constant; it’s a variable. Some teams, an outfielder gets hurt, it doesn’t really matter because they’ve got a fourth outfielder who is about as good as the starters. Other teams, it matters a lot because their fourth outfielder is a pair of stuffed pajamas. The actual replacement level is specific to the locale. Rather than trying to estimate what the replacement level actually is in this case, WAR simply assumes that it is always the same. To return to the analogy of the wheat farmer, this is like assuming that all trucks weigh the same. It leads to large inaccuracies.
Also, as best I understand this—which is poorly—one of the WAR systems introduces another potential error for pitchers by using a number that represents how many runs the pitcher SHOULD HAVE allowed, based on his strikeouts and walks and home runs allowed, rather than how many runs he ACTUALLY allowed. The system says "this pitcher actually allowed 100 runs, but, because he had really good strikeouts and walks, we’ll treat him as if he allowed only 87 runs." That is introducing yet another potential error, by substituting an estimate for a hard fact. That may be what causes them to conclude that the American’s League’s best player in 1966 was not Frank Robinson, who won the Triple Crown and was the unanimous MVP, but Earl Wilson, a pitcher whose ERA was not much better than the league average. And the people who believe in WAR will look at that and say, "Oh, well, if that’s what the numbers show, that’s what they show," rather than saying what they should say, which is "You know, that’s really a stupid thing to say."
Look, I am not saying that WAR has no value, or that no system of WAR could ever developed that is somewhat reliable. What I am saying is:
1) That the systems of WAR that we have now, while of course they are generally accurate in many cases, are not at all reliable,
2) That the primary reason that they are not reliable is not because of errors in any particular component, but rather, because in the calculation of a comparison derivative, there is the potential that the sum of the errors could be greatly magnified,
3) It is insane to rely on the outcome of a comparison derivative based on estimates, unless those estimates are fantastically accurate, and
4) It will be decades before sabermetrics has accurate estimates of all of the components of performance evaluation, if we ever get there. We certainly will never get there in my lifetime.
As I mentioned early in the article, the problem of comparison derivatives is ubiquitous in our culture. Another place where you see it is in political polling. Suppose that a candidate polls at 42% in one poll and 47% in the next. The networks will assume that this has huge significance, but the "5% gain" is a comparison derivative, based on two polls, neither of which was reliable in itself. Whether the Dow-Jones is "up" 25 points or down 25 points is a comparison derivative. The news networks will make up an explanation for why it is up or why it is down, but that’s pure fiction; it’s just a comparison derivative of soft numbers.
If based on solid numbers of a certain scale, a comparison derivative is of course meaningful. WAR is analogous to "profit", and "profit" is a comparison derivative. That’s why an accountant can juggle the books to make it looks like the company’s profit margin is high or low. If the population of your hometown was one million ten years ago and is 1.2 million now, that’s a meaningful fact. But the basic fact is still "the population of my city is 1.2 million", not "the population of my city is +200,000." The problem with WAR is that it urges people to throw away the basic fact without any recognition, to state value based on the comparison of the unstated number with an imaginary line. The term "WAR" should be replaced by "WAG". WAR isn’t an actual measurement; it’s just a wild-ass guess.
Look, I was the first person to say that a player’s real value was in how much better he was than a replacement-level player. Other people took that idea and ran wild with it, and they did that in good faith and with good intentions. But what I should have said at the time, or maybe I did say this and people just ignored it, I don’t know, is "of course, it is impossible to measure all of the elements of that with the accuracy that you would need to measure them in order to make the calculation meaningful." I didn’t say that then, and I didn’t say it for years afterward because I didn’t want to be kicking dirt into somebody else’s sandbox. But honestly, we’re never going to get to an accurate measurement of a player’s value unless people stop assuming that this kind of a process can be made to work.
I’ll open this up to comments in 24 hours. Thank you for reading.