The What If Universe
In analyzing value, we try to avoid engaging in speculation about what a player would have done or might have done in a different set of circumstances. This is one of those things that I have explained a hundred times and many people still don’t get it, but let me try again. What is relevant is not what the player might have done or would have done in some other set of circumstances, but rather, the value of what he actually did in the circumstances in which he played. The case that I have found useful to try to illustrate this point is Bill Dickey versus Elston Howard.
Bill Dickey was the Yankee catcher before Yogi Berra; Elston Howard was the Yankee catcher after Yogi Berra, spent several years backing up Yogi and trying to find at bats in other positions. Both played in same park, Yankee Stadium. They had careers of similar length, and both were outstanding defensive players. Dickey has twice the career WAR of Howard—55.8 to 27.0. But the thing is, much of Dickey’s advantage or most of Dickey’s advantage is just luck. Because he came to the Yankees with Yogi still in mid-career, Howard didn’t get to play regularly until he was 32 years old. The more relevant point, however, is that whereas Yankee Stadium was great for Bill Dickey, it was absolutely terrible for Elston Howard. Dickey hit 202 home runs in his career—135 in Yankee Stadium, 67 on the road. Howard in his career hit 167 home runs—54 at home, and 113 on the road. Many of the season-by-season home/road splits for both Dickey and Howard are absolutely astonishing—astonishingly good for Dickey, astonishingly bad for Howard.
It is the same park, but it is a tremendously good park for one player, a tremendously bad one for the other. In my mind, there is no doubt that, given the same opportunities, the same luck, Elston Howard would have had as great a career as Bill Dickey. But that is not relevant to how we evaluate them.
Look, if you pursue the question "How good or how great would this player have been, in another set of circumstances or in a neutral set of circumstances?" you can never find the answer. It is impossible to answer that question, because it requires not one, but a whole series of air-castle assumptions. What if the Yankees had needed a catcher when they signed Howard? What if he had signed with Kansas City, which was a park where Howard always hit great, rather than with the Yankees? What if he had played in the 1930s, like Bill Dickey?
We don’t pursue those questions, because there is no way to answer them. What the player would have done in a neutral park. . .it’s got nothing to do with anything. We stick with the questions that we CAN answer:
How many runs did he create?
In the time and place where he did play, how many runs did it require to win a ballgame?
Therefore, how many games did he win for his team?
This is one of the principles upon which sabermetrics is founded, because it has to be. You can’t actually do it the other way.
BUT.
But the distinction between what we are doing and asking what the player would have done in some other set of circumstances is not always clear. We deal constantly with people who confuse the two sets of questions, and thus try to push the discussion into the what-if world. What would Thurman Munson have done if he hadn’t been killed in a plane accident? People want to argue that Thurman should be a Hall of Famer because he would have been a Hall of Famer. People want to argue that Nolan Arenado isn’t a great player because his hitting stats aren’t that great in road parks. It’s not relevant. What is relevant is the value of what he has done in the place where he has played—not what his value would be in some other park.
This is a bright-line distinction in my mind, but it is not .an easy distinction to see in all cases. It’s confusing. We do, of course, park-adjust Nolan Arenado’s hitting stats. Isn’t that the same as adjusting for what he would have done in another park?
It isn’t, of course, because we are adjusting what actually happened. We know how many runs Arenado actually created, or anyway, we can estimate that pretty accurately. We know how many runs Colorado scored and allowed in Colorado, and in other parks. We know how many games they won. We can make calculations based on known facts.
Further confusion: there are some limited situations in which I believe it appropriate to consider what a player would have done. I would argue, for example, that Minnie Minoso should be in the Hall of Fame, because it is clear that he would have met all of the Hall of Fame markers had he been a white player and allowed to begin his major league career at age 21 or 22 like Stan Musial or Ted Williams or Joe DiMaggio or the black players five years younger than him.
Or the war-time ballplayers, World War II. I would argue that Joe DiMaggio should be evaluated and compared to other players the same as if he had played through World War II. Isn’t that evaluating him on a what-if projection? What is the difference between saying that Joe DiMaggio would have been a great player in 1944, and saying that Thurman Munson would have been a great player in 1980?
It is the same, and then again it isn’t. First, Joe DiMaggio was alive in 1944. Thurman Munson was dead in 1980. This is a significant difference.
What I am really saying is not that Joe DiMaggio would have been a great player in 1944, but rather, that Joe DiMaggio WAS a great player in 1944. There is no doubt that he was; there is clear proof that he was. No reasonable person could doubt that he was. What we are missing is merely the statistical proof of his greatness—or rather, the 1944 installment of the statistical proof of his greatness. It is much, much, much more speculative to suggest that he was not a great player in 1944 than to suggest that he was.
That’s how I see it, but is this a clear distinction? Not absolutely. It’s confusing. We’re trying to think it through as carefully as we can, but it is confusing.
I am writing this here because I need to come back to it in the next article, which is the concluding article of this long series of little articles.
A Conclusion in Regard to Baseball Reference WAR
We never arrive at a true understanding of anything; we merely progress toward it. A true understanding of anything is hidden behind a series of screens, each of which partially (but only partially) obstructs our view, but all of which, working together, will almost completely hide the truth about any subject.
I say this as a way of defending the sportswriters of 1960, who voted the Cy Young Award to Vern Law when there were several better pitchers in baseball, or the sportswriters of 1982, who voted for Pete Vuckovich over Dave Stieb, or the sportswriters of 1990, who voted for Bob Welch over Roger Clemens.
We are not any smarter than they are; we are merely in a better position to see the truth because there are several screens which have been removed since then. The first of those had to do with Park Effects. While sportswriters and athletes have known at least since the 1880s that ballparks had an effect on statistics, there is knowing something, and then there is knowing it. Pete Palmer, more than anyone else, brought Park Effects out of the dark, and changed us from a culture which knew this is a general way to knowing it in specific way—being able to measure it, being able to adjust for it. This removed a screen. This enabled us to understand pitching records better than we previously had. Data changes the discussion.
The sportswriters of my youth universally believed that won-lost records took precedence over ERA because some pitchers had an ability to WIN, rather than just an ability to compete effectively. They won the close games. They bore down when the game was on the line.
I would argue that this was a perfectly reasonable thing for people to believe, in that time. It was what their eyes told them; it was what their experience told them. No one should criticize those sportswriters, in my view, for believing that.
But when the park-effects screen was removed, we had a somewhat clearer view of what was really there. We understood, then, that if one pitcher went 20-12 with a 3.50 ERA and another pitcher went 12-15 with a 2.80 ERA, that might be because of the parks in which they pitched.
The next screen to go down was the belief that offensive support would even out over the course of a season. We know, mathematically, that this could not be true. If two pitchers are assigned random numbers of runs of support, between 0 and 10 runs in each start, the differences would not begin to even out in less than 200 starts, and, over the course of a season, 35 starts or something like that, there would logically be very significant random differences between them.
Again defending the old-time sportswriters, it would seem, intuitively, that this would even out, that the breaks would even out over the course of a long season. This assumption, wrong though it was, was what we were told was true, growing up in the 50s or 60s, and we were told that not just once, but very regularly. I would guess that the phrase "the breaks even out over the course of a season" was probably used about once per baseball broadcast. A person growing up in that time frame heard this asserted hundreds or thousands of times. It came to be universally accepted, not as an absolute rule but as a general rule.
The removal of this screen had two parts: first, the realization that mathematically this could not be true, and second, the systematic introduction and publication of data which showed that it was not true. In 1977 Larry Christenson was supported by 6.35 runs per game. His teammate, Barry Lerch, got only 4.96. But the next season, 1978, Larry Christenson was supported by 3.4 runs per start. Randy Lerch, still his teammate, was supported by 5.4.
Anyone could see that run support did NOT even out over the course of a season, that it did not come close to doing so. The systematic publication of that data, over the course of decades, re-ordered our thinking on this issue, and thus removed a screen. It took out of a play a common misconception which was preventing us from seeing the truth. This enabled us to consider the possibility that there was no such thing as an ability to win, separate from an ability to prevent runs, that differences in won-lost records were not necessarily a function of ability at all, but were, in some limited number of cases, merely something that happened.
Another screen was the Voros McCracken insight: that pitchers had very little control over balls put in play against them. Growing up in the 1960s, we all assumed that a pitcher "allowed" a certain number of hits. This assumption permitted a collateral assumption: that the pitcher was responsible for his own outs. That assumption, in turn, allowed a second collateral assumption: that it didn’t matter whether a pitcher got a ground ball out or a strikeout. What difference does it make, it’s an out either way? That’s what we all believed, in 1980.
Voros proved that, given the number of balls in play behind a given pitcher, the number of outs that he received vs. the number of hits that he allowed was essentially random. A pitcher would allow a .320 batting average on balls in play one year—240 hits in 210 innings—and a .250 batting average on balls in play the next year, 170 hits in 225 innings. The tendency to get outs from balls in play did not follow a pitcher to any appreciable extent. Walter Johnson allowed the same percentage of hits related to balls in play, over time, that the worst pitcher in the league did. Warren Spahn allowed the same percentage as Ken Raffensberger. Sandy Koufax allowed the same percentage as Jack Fisher. Steve Carlton allowed the same percentage as Steve Renko. It didn’t matter who the pitcher was. Over time, the percentage of balls in play which became hits and the percentage which became outs was the same—not absolutely, of course, not perfectly and 100%, but generally and most of the time, within a few points. It varied more with the park than it did with the pitcher.
(I should break in here to keep the record straight. There actually IS a "pitcher’s contribution" to the batting average on balls in play (BABIP), and almost everyone now under-estimates what that is. We have been taught to think, post-Voros, that that variation is negligible, but it isn’t actually negligible; it is just smaller than we assumed it was until Voros argued it was negligible. Sandy Koufax actually DIDN’T allow the same BABIP that Jack Fisher did, although Steve Renko’s BABIP was generally lower than Carlton’s. We over-corrected on that one.)
Anyway, once that screen was removed, we were able to see the real work of the pitcher more clearly. I have counted these as three screens, but you could count them differently. The understanding which follows in the wake of the screen’s removal is distinct from the removal of the screen itself. There are other screens which have been removed; for example, the belief that some pitchers had an ability to work out of trouble. Some pitchers, in any season, "cluster" their hits and walks so that in 200 innings 200 hits and 60 walks will become 80 runs, while, for another pitcher, it might become 60 runs, because the pitcher doesn’t allow clusters of events to form. While that may not be AS untrue as the other misconceptions that used to dominate pitcher evaluation, it is more false than true; the belief in such a skill was more fiction than fact, and it has largely dissipated as the screens have been removed.
Baseball Reference WAR takes the Park away from the pitcher, adjusting it out of existence, and takes the defensive support away from the pitcher, putting him in the position of pitching in a neutral park with an average defense behind him. That is, it attempts to do these things, but this is what I am trying to get to here. Baseball Reference Pitcher’s WAR needs to be re-worked. It’s got issues. I am saying this as a friend, not as a critic. It works pretty well about 80, 90% of the time. That’s not enough. You’ve got to do better than that. There are too many cases where it doesn’t work. Aaron Nola was pretty clearly not better than Jacob deGrom; it’s not really a reasonable thing to say. The Most Valuable Player in the American League in 1966 was Frank Robinson; it was not Earl Wilson; that’s not really a reasonable thing to say. Teddy Higuera was not better than Roger Clemens in 1986, and Don Drysdale was not the 7th-best starting pitcher in the National League in 1962. These are not reasonable things to say.
Working through this long process, I believe that I have come to an understanding of the problem, and I believe that the 1980 Oakland A’s, and Mike Norris, are the vehicle to explain it to you. The problem has to do with the "RA9def" adjustment in the evaluation. Mike Norris would appear to anyone except a 1980 Cy Young Voter to be the best pitcher in the American League in 1980; he was 22-9 with a 2.53 ERA, pitched 284 innings with 180 strikeouts, 83 walks. Baseball Reference, however, believes that the best pitcher in the league was not Norris but Britt Burns. Burns pitched 46 fewer innings with a higher ERA, 47 fewer strikeouts, 15-13 won-lost record. Norris had an ERA+ of 147, Burns of 143. Norris looks like he’s better, but. . .sure, OK, let’s see what you have to say.
Burns was a good pitcher; I think he was the second-best pitcher in the league, but certainly he was a good pitcher. But why does Baseball Reference think he was better than Norris? Quoting what I wrote earlier:
B-R says that Norris defensive support was outstanding (0.56 runs per nine innings, or 18 runs for the season) whereas Burns’ was poor (-.07 runs per nine innings, or negative a run and a half.) They thus give Burns a 20-win push for dealing with them lousy fielders, and this puts Burns in front.
That was not something that Norris did, Baseball Reference says. That was the defense. Their defense was outstanding, and this makes Norris look better than he was. But if you adjust for that, Burns was actually better than Norris.
Well, that’s a reasonable argument. It raises issues, but it could be true. But there are four problems.
The first problem is applying a variable as if it was, in essence, a constant. It’s not a true constant, but it is essentially a constant. The 1980 Oakland A’s had five pitchers pitching 200 innings—Matt Keough, Mike Norris, Rick Langford, Steve McCatty and Brian Kingman. The RA9def numbers are .60 for Keough, .56 for Norris, .63 for Langford, .61 for McCatty and .61 for Kingman.
Suppose that you implemented an OFFENSIVE support system with the same approach. Suppose that you said "We realize that the offensive support for different pitchers is not always the same, so we’re going to adjust for that,"—but then you adjusted for it basically on the team level, with only very minor variations between teammates. We estimate that the offensive support for these teammates was 4.49 runs per game for Keough, 4.47 for Norris, 4.56 for Langford, 4.52 for McCatty and 4.47 for Kingman.
That’s not the real world. In the real world, offensive support fluctuates wildly between pitchers, and we know this now because we measure this now. But logically, defensive support MUST fluctuate from pitcher to pitcher as much as offensive support does, and actually more on a percentage basis, because it is operating on a smaller base. Dealing with Oakland, 1980, and Chicago, 1980, the White Sox scored only 587 runs on the season, while the A’s scored almost exactly 100 more runs, 686. OK, 99 more.
Using the "standard deviation of support" which is reflected in the "RA9def" numbers, we would thus conclude that every Oakland pitcher had to benefit from more offensive support than any Chicago pitcher. But we know, because we have ACTUAL numbers for offensive support, that this is not true. Richard Dotson, pitching for Chicago, was supported by 4.27 runs per game, while Brian Kingman, working for Oakland, was supported by only 2.87 runs per game.
Why, then, would we not suppose that the same variation MUST happen on defense? The difference between the two teams on offense (99 runs) is roughly equal to the difference between the teams which is attributed to their fielders by the RA9def numbers. Why, then, would we suppose that the variation between the pitchers fielding support was so tiny?
My point is that these are not real effects. They are externally derived estimates which are applied as if they were real effects. But the real issue is the SIZE of the numbers.
The 1980 Oakland A’s had five pitchers pitching 200 innings—Matt Keough, Mike Norris, Rick Langford, Steve McCatty and Brian Kingman. The RA9def numbers are .60 for Keough, .56 for Norris, .63 for Langford, .61 for McCatty and .61 for Kingman.    
.60 runs per nine innings. That’s 97 runs a season. The operating assumption which leads to the conclusion that Britt Burns was better than Mike Norris in 1980 is that the Oakland defense was 97 runs better than average over the course of the season.
But is that a reasonable number?
Well, the American League average number of runs allowed in 1980 was 729 runs per team. Oakland played in a pretty extreme pitcher’s park, with a park factor of 79 in 1979, 86 in 1980, 89 in 1981. Let’s say that 86 is the right number; it’s going to reduce the expected runs allowed in the park by 6%, which is 44 runs. Playing in that park with average pitching and defense, Oakland can be expected to allow 685 runs.
If their defense is 97 runs better than average, then the expectation for their pitchers is 588 runs. Oakland allowed 642 runs, which was the best figure in the league, per inning, but based on this logic, they were 54 runs WORSE than average, the pitchers were. But is there any evidence that their pitchers were in fact 54 runs worse than average?
Well, the league average in strikeouts was 740. Oakland pitchers struck out 769.
The league average in walks was 516. Oakland pitchers walked 521.
The league average for home runs allowed was 132. Oakland allowed 142.
Where is the evidence that the Oakland pitchers were 54 runs worse than average?
Actually, there IS a logical pathway through the numbers which will get you close to that conclusion. I wouldn’t walk down that pathway myself, but you can get there. Ignore the strikeouts and walks; Oakland pitchers pitched 1% more innings than the American League average, and issued 1% more walks, it’s nothing. It’s the home runs. Oakland had a park factor of 86, but a park factor for home runs of 70. The park should reduce home runs by 14%. The league average of home runs allowed was 132; reduce that by 14%, that’s 115. Oakland A’s pitcher should have allowed 115 home runs. They allowed 142. That’s 27 homers. Oakland pitchers were average in strikeouts and walks allowed, but they allowed 27 more home runs than an average pitching staff should have allowed. If each home run is worth, let’s say, 1.65 runs, then that’s 43 runs. It comes pretty close to being a closed-end explanation for the Oakland pitcher’s data:
Average pitchers in an average park would have allowed 729 runs,
The Oakland Coliseum reduced that expectation to 685 runs,
The Oakland defense was 97 runs better than average, which reduces that expectation to 588 runs,
The A’s allowed 642 runs, which was the fewest in the league per inning, but was actually 54 runs OVER expectation,
Which is largely explained by the fact that the A’s pitchers gave up 27 more home runs than they could been expected to allow, given the park that they were playing in.
OK, so there IS a pathway through the numbers that will get you to that conclusion. But is that the RIGHT way to walk through those numbers? Is that the best way?
There is a risk of confusing park effects with defense. Again quoting what I wrote in the Norris/Burns comparison in the 1980 section:
What I think has happened is that ballparks and fielders sometimes have very similar statistical signatures. The Oakland Coliseum, with large foul territory and cool, damp air, was a tough place to hit. The A’s as a team hit .251 at home in 1980, .267 on the road, with a slugging percentage 42 points higher on the road. Their opponents hit .234 at home, .254 on the road, with a slugging percentage 65 points higher on the road.
The park reduces the in-play average and the other run elements—but good defensive play also reduces the in-play average and the other run elements. This creates the possibility of confusion between the two. What I THINK has happened here is that the park’s run-suppressing characteristics are being double-counted as if they were also evidence of superior defense, thus adjusting twice for the park.
Not saying that that’s what happened, the peculiar numbers for Oakland pitching and defense may result from that, or they may result from some other cause, but that’s the second problem. The third problem is whether you should be doing this or not. Voros McCracken fundamentally changed sabermetric analysis 20 years ago, with the realization that the outcome of balls in play against a given pitcher in a given year is just random, or mostly just random.
I would be concerned that Baseball Reference may have trapped itself on the wrong side of the Voros revelation. I wonder if perhaps Baseball Reference-WAR is attributing to the fielders a deviation in performance which Voros established does not exist. What is "defense"? It is the outcome of balls in play, isn’t it?
Yes, of course, some defenses are better than other defenses. Yes, we can measure that. But what Baseball Reference is attempting to do, is to measure that and hold the pitcher responsible for it. They THINK that because they’re attributing it to the defense, rather than directly to the pitcher, it is OK, but it isn’t OK. When you give it to the defense, you take it away from the pitcher. The effect is the same, whether you attribute it directly to the pitcher, or whether you attribute it to the defense and then take the defense away from the pitcher. Either way, it is mostly just random.
There is an argument, at least, that the Baseball Reference approach is treating noise in the data as if it was valid information.
I should say, before I go any further, that separating what the pitcher has done from what the fielders behind him have done is a tremendously tricky business, and that none of us really knows how to do that. The way that it is done in Win Shares is not too whippy, either. There are serious problems with the way that it works in Win Shares.
But I am not convinced that Baseball Reference-WAR has this right. They have a closed-end system, but there are MANY closed-end, logical systems which could be chosen. They have a pathway through the statistics toward an answer, but there are thousands of possible pathways through the data. The issue is whether we should believe that it is true.
I do not believe, and I do not see how anyone reasonably could be expected to believe, that the Oakland A’s pitchers in 1980 were 54 runs worse than average, despite leading the league in ERA, or that their fielders were 97 runs better than average. I don’t believe it.
It rather seems to me that the Baseball Reference Pitcher’s WAR has been constructed in a what-if universe. Returning for a moment to Jack Kralick in 1961, I think that what Baseball Reference has concluded by the logical pathway that they have chosen is not that Jack Kralick was the best pitcher in the American League, but rather, that in another park with another defense behind him, he would have been the best pitcher in the league. And I would suggest that we don’t actually know that. It’s highly speculative. That’s putting it kindly. It’s wrong.
I estimate that the Baseball Reference Pitcher’s WAR system is, let’s say, 85% accurate. That’s not a bad number. I have propagated dozens and dozens of analytical systems, over the course of my career, that were not 85% accurate. Hell, I have put out there systems that were, in retrospect, not 50% accurate, 50% accurate meaning that you don’t know a damned thing.
But is 85% accurate good enough, given the extent to which B-R WAR is relied upon in the public discussion? Salaries are negotiated based on this number. Front offices rely on this number in structuring teams. Hall of Fame campaigns cite this number for support.
I would argue that
a) The accuracy of this system does not justify the extent to which it is relied upon in the public discourse,
b) The level of accuracy needed in a number to be relied on to this extent should be AT LEAST 99%, and preferably 99.9%,
c) We probably can’t get to 99% very soon, so we probably should be more cautious than we are in citing these kinds of numbers, but
d) It would actually be relatively easy to improve the system so that the accuracy would go to 93-95%.
What would we have to do to the system to boost it up to 93-95%? Seven things:
First, get rid of the RA9def adjustment, which is a speculative number, derived externally to the pitcher,
Second, replace it with an estimate of the pitcher’s individual defensive support, derived from the pitcher’s own data such as his BABIP, his un-earned runs allowed, and his stolen bases allowed/caught stealing compared to the team norms,
Third, be absolutely certain that the "defensive support number" is park-adjusted,
Fourth, split the credit for deviations from park-adjusted BABIP between the pitcher and the defense, rather than crediting it all to the defense and thus taking it away from the pitcher,
Fifth, place a realistic boundary on the defensive support number. I would suggest 0.25 runs per nine innings, positive or negative, as a realistic upper boundary,
Sixth, figure the pitcher’s level of effectiveness based on his three true outcomes, and
Seventh, make your final estimate of the pitcher’s level of effectiveness a blend of the runs-allowed based measurement and the three true outcome estimate, perhaps a 70-30 blend (weighted toward actual runs allowed) or an 80-20 blend.
If Baseball Reference were to do those things, I believe that they would very significantly improve the accuracy of their evaluations.