Username:	Password:

Remember me

Forgot your username/password?

Print Email

Home>Articles

A Conclusion in Regard to Baseball Reference WAR

By Bill James

March 6, 2019

The What If Universe

In analyzing value, we try to avoid engaging in speculation about what a player would have done or might have done in a different set of circumstances. This is one of those things that I have explained a hundred times and many people still don’t get it, but let me try again. What is relevant is not what the player might have done or would have done in some other set of circumstances, but rather, the value of what he actually did in the circumstances in which he played. The case that I have found useful to try to illustrate this point is Bill Dickey versus Elston Howard.

Bill Dickey was the Yankee catcher before Yogi Berra; Elston Howard was the Yankee catcher after Yogi Berra, spent several years backing up Yogi and trying to find at bats in other positions. Both played in same park, Yankee Stadium. They had careers of similar length, and both were outstanding defensive players. Dickey has twice the career WAR of Howard—55.8 to 27.0. But the thing is, much of Dickey’s advantage or most of Dickey’s advantage is just luck. Because he came to the Yankees with Yogi still in mid-career, Howard didn’t get to play regularly until he was 32 years old. The more relevant point, however, is that whereas Yankee Stadium was great for Bill Dickey, it was absolutely terrible for Elston Howard. Dickey hit 202 home runs in his career—135 in Yankee Stadium, 67 on the road. Howard in his career hit 167 home runs—54 at home, and 113 on the road. Many of the season-by-season home/road splits for both Dickey and Howard are absolutely astonishing—astonishingly good for Dickey, astonishingly bad for Howard.

It is the same park, but it is a tremendously good park for one player, a tremendously bad one for the other. In my mind, there is no doubt that, given the same opportunities, the same luck, Elston Howard would have had as great a career as Bill Dickey. But that is not relevant to how we evaluate them.

Look, if you pursue the question "How good or how great would this player have been, in another set of circumstances or in a neutral set of circumstances?" you can never find the answer. It is impossible to answer that question, because it requires not one, but a whole series of air-castle assumptions. What if the Yankees had needed a catcher when they signed Howard? What if he had signed with Kansas City, which was a park where Howard always hit great, rather than with the Yankees? What if he had played in the 1930s, like Bill Dickey?

We don’t pursue those questions, because there is no way to answer them. What the player would have done in a neutral park. . .it’s got nothing to do with anything. We stick with the questions that we CAN answer:

How many runs did he create?

In the time and place where he did play, how many runs did it require to win a ballgame?

Therefore, how many games did he win for his team?

This is one of the principles upon which sabermetrics is founded, because it has to be. You can’t actually do it the other way.

BUT.

But the distinction between what we are doing and asking what the player would have done in some other set of circumstances is not always clear. We deal constantly with people who confuse the two sets of questions, and thus try to push the discussion into the what-if world. What would Thurman Munson have done if he hadn’t been killed in a plane accident? People want to argue that Thurman should be a Hall of Famer because he would have been a Hall of Famer. People want to argue that Nolan Arenado isn’t a great player because his hitting stats aren’t that great in road parks. It’s not relevant. What is relevant is the value of what he has done in the place where he has played—not what his value would be in some other park.

This is a bright-line distinction in my mind, but it is not .an easy distinction to see in all cases. It’s confusing. We do, of course, park-adjust Nolan Arenado’s hitting stats. Isn’t that the same as adjusting for what he would have done in another park?

It isn’t, of course, because we are adjusting what actually happened. We know how many runs Arenado actually created, or anyway, we can estimate that pretty accurately. We know how many runs Colorado scored and allowed in Colorado, and in other parks. We know how many games they won. We can make calculations based on known facts.

Further confusion: there are some limited situations in which I believe it appropriate to consider what a player would have done. I would argue, for example, that Minnie Minoso should be in the Hall of Fame, because it is clear that he would have met all of the Hall of Fame markers had he been a white player and allowed to begin his major league career at age 21 or 22 like Stan Musial or Ted Williams or Joe DiMaggio or the black players five years younger than him.

Or the war-time ballplayers, World War II. I would argue that Joe DiMaggio should be evaluated and compared to other players the same as if he had played through World War II. Isn’t that evaluating him on a what-if projection? What is the difference between saying that Joe DiMaggio would have been a great player in 1944, and saying that Thurman Munson would have been a great player in 1980?

It is the same, and then again it isn’t. First, Joe DiMaggio was alive in 1944. Thurman Munson was dead in 1980. This is a significant difference.

What I am really saying is not that Joe DiMaggio would have been a great player in 1944, but rather, that Joe DiMaggio WAS a great player in 1944. There is no doubt that he was; there is clear proof that he was. No reasonable person could doubt that he was. What we are missing is merely the statistical proof of his greatness—or rather, the 1944 installment of the statistical proof of his greatness. It is much, much, much more speculative to suggest that he was not a great player in 1944 than to suggest that he was.

That’s how I see it, but is this a clear distinction? Not absolutely. It’s confusing. We’re trying to think it through as carefully as we can, but it is confusing.

I am writing this here because I need to come back to it in the next article, which is the concluding article of this long series of little articles.

A Conclusion in Regard to Baseball Reference WAR

We never arrive at a true understanding of anything; we merely progress toward it. A true understanding of anything is hidden behind a series of screens, each of which partially (but only partially) obstructs our view, but all of which, working together, will almost completely hide the truth about any subject.

I say this as a way of defending the sportswriters of 1960, who voted the Cy Young Award to Vern Law when there were several better pitchers in baseball, or the sportswriters of 1982, who voted for Pete Vuckovich over Dave Stieb, or the sportswriters of 1990, who voted for Bob Welch over Roger Clemens.

We are not any smarter than they are; we are merely in a better position to see the truth because there are several screens which have been removed since then. The first of those had to do with Park Effects. While sportswriters and athletes have known at least since the 1880s that ballparks had an effect on statistics, there is knowing something, and then there is knowing it. Pete Palmer, more than anyone else, brought Park Effects out of the dark, and changed us from a culture which knew this is a general way to knowing it in specific way—being able to measure it, being able to adjust for it. This removed a screen. This enabled us to understand pitching records better than we previously had. Data changes the discussion.

The sportswriters of my youth universally believed that won-lost records took precedence over ERA because some pitchers had an ability to WIN, rather than just an ability to compete effectively. They won the close games. They bore down when the game was on the line.

I would argue that this was a perfectly reasonable thing for people to believe, in that time. It was what their eyes told them; it was what their experience told them. No one should criticize those sportswriters, in my view, for believing that.

But when the park-effects screen was removed, we had a somewhat clearer view of what was really there. We understood, then, that if one pitcher went 20-12 with a 3.50 ERA and another pitcher went 12-15 with a 2.80 ERA, that might be because of the parks in which they pitched.

The next screen to go down was the belief that offensive support would even out over the course of a season. We know, mathematically, that this could not be true. If two pitchers are assigned random numbers of runs of support, between 0 and 10 runs in each start, the differences would not begin to even out in less than 200 starts, and, over the course of a season, 35 starts or something like that, there would logically be very significant random differences between them.

Again defending the old-time sportswriters, it would seem, intuitively, that this would even out, that the breaks would even out over the course of a long season. This assumption, wrong though it was, was what we were told was true, growing up in the 50s or 60s, and we were told that not just once, but very regularly. I would guess that the phrase "the breaks even out over the course of a season" was probably used about once per baseball broadcast. A person growing up in that time frame heard this asserted hundreds or thousands of times. It came to be universally accepted, not as an absolute rule but as a general rule.

The removal of this screen had two parts: first, the realization that mathematically this could not be true, and second, the systematic introduction and publication of data which showed that it was not true. In 1977 Larry Christenson was supported by 6.35 runs per game. His teammate, Barry Lerch, got only 4.96. But the next season, 1978, Larry Christenson was supported by 3.4 runs per start. Randy Lerch, still his teammate, was supported by 5.4.

Anyone could see that run support did NOT even out over the course of a season, that it did not come close to doing so. The systematic publication of that data, over the course of decades, re-ordered our thinking on this issue, and thus removed a screen. It took out of a play a common misconception which was preventing us from seeing the truth. This enabled us to consider the possibility that there was no such thing as an ability to win, separate from an ability to prevent runs, that differences in won-lost records were not necessarily a function of ability at all, but were, in some limited number of cases, merely something that happened.

Another screen was the Voros McCracken insight: that pitchers had very little control over balls put in play against them. Growing up in the 1960s, we all assumed that a pitcher "allowed" a certain number of hits. This assumption permitted a collateral assumption: that the pitcher was responsible for his own outs. That assumption, in turn, allowed a second collateral assumption: that it didn’t matter whether a pitcher got a ground ball out or a strikeout. What difference does it make, it’s an out either way? That’s what we all believed, in 1980.

Voros proved that, given the number of balls in play behind a given pitcher, the number of outs that he received vs. the number of hits that he allowed was essentially random. A pitcher would allow a .320 batting average on balls in play one year—240 hits in 210 innings—and a .250 batting average on balls in play the next year, 170 hits in 225 innings. The tendency to get outs from balls in play did not follow a pitcher to any appreciable extent. Walter Johnson allowed the same percentage of hits related to balls in play, over time, that the worst pitcher in the league did. Warren Spahn allowed the same percentage as Ken Raffensberger. Sandy Koufax allowed the same percentage as Jack Fisher. Steve Carlton allowed the same percentage as Steve Renko. It didn’t matter who the pitcher was. Over time, the percentage of balls in play which became hits and the percentage which became outs was the same—not absolutely, of course, not perfectly and 100%, but generally and most of the time, within a few points. It varied more with the park than it did with the pitcher.

(I should break in here to keep the record straight. There actually IS a "pitcher’s contribution" to the batting average on balls in play (BABIP), and almost everyone now under-estimates what that is. We have been taught to think, post-Voros, that that variation is negligible, but it isn’t actually negligible; it is just smaller than we assumed it was until Voros argued it was negligible. Sandy Koufax actually DIDN’T allow the same BABIP that Jack Fisher did, although Steve Renko’s BABIP was generally lower than Carlton’s. We over-corrected on that one.)

Anyway, once that screen was removed, we were able to see the real work of the pitcher more clearly. I have counted these as three screens, but you could count them differently. The understanding which follows in the wake of the screen’s removal is distinct from the removal of the screen itself. There are other screens which have been removed; for example, the belief that some pitchers had an ability to work out of trouble. Some pitchers, in any season, "cluster" their hits and walks so that in 200 innings 200 hits and 60 walks will become 80 runs, while, for another pitcher, it might become 60 runs, because the pitcher doesn’t allow clusters of events to form. While that may not be AS untrue as the other misconceptions that used to dominate pitcher evaluation, it is more false than true; the belief in such a skill was more fiction than fact, and it has largely dissipated as the screens have been removed.

Baseball Reference WAR takes the Park away from the pitcher, adjusting it out of existence, and takes the defensive support away from the pitcher, putting him in the position of pitching in a neutral park with an average defense behind him. That is, it attempts to do these things, but this is what I am trying to get to here. Baseball Reference Pitcher’s WAR needs to be re-worked. It’s got issues. I am saying this as a friend, not as a critic. It works pretty well about 80, 90% of the time. That’s not enough. You’ve got to do better than that. There are too many cases where it doesn’t work. Aaron Nola was pretty clearly not better than Jacob deGrom; it’s not really a reasonable thing to say. The Most Valuable Player in the American League in 1966 was Frank Robinson; it was not Earl Wilson; that’s not really a reasonable thing to say. Teddy Higuera was not better than Roger Clemens in 1986, and Don Drysdale was not the 7^th-best starting pitcher in the National League in 1962. These are not reasonable things to say.

Working through this long process, I believe that I have come to an understanding of the problem, and I believe that the 1980 Oakland A’s, and Mike Norris, are the vehicle to explain it to you. The problem has to do with the "RA9def" adjustment in the evaluation. Mike Norris would appear to anyone except a 1980 Cy Young Voter to be the best pitcher in the American League in 1980; he was 22-9 with a 2.53 ERA, pitched 284 innings with 180 strikeouts, 83 walks. Baseball Reference, however, believes that the best pitcher in the league was not Norris but Britt Burns. Burns pitched 46 fewer innings with a higher ERA, 47 fewer strikeouts, 15-13 won-lost record. Norris had an ERA+ of 147, Burns of 143. Norris looks like he’s better, but. . .sure, OK, let’s see what you have to say.

Burns was a good pitcher; I think he was the second-best pitcher in the league, but certainly he was a good pitcher. But why does Baseball Reference think he was better than Norris? Quoting what I wrote earlier:

B-R says that Norris defensive support was outstanding (0.56 runs per nine innings, or 18 runs for the season) whereas Burns’ was poor (-.07 runs per nine innings, or negative a run and a half.) They thus give Burns a 20-win push for dealing with them lousy fielders, and this puts Burns in front.

That was not something that Norris did, Baseball Reference says. That was the defense. Their defense was outstanding, and this makes Norris look better than he was. But if you adjust for that, Burns was actually better than Norris.

Well, that’s a reasonable argument. It raises issues, but it could be true. But there are four problems.

The first problem is applying a variable as if it was, in essence, a constant. It’s not a true constant, but it is essentially a constant. The 1980 Oakland A’s had five pitchers pitching 200 innings—Matt Keough, Mike Norris, Rick Langford, Steve McCatty and Brian Kingman. The RA9def numbers are .60 for Keough, .56 for Norris, .63 for Langford, .61 for McCatty and .61 for Kingman.

Suppose that you implemented an OFFENSIVE support system with the same approach. Suppose that you said "We realize that the offensive support for different pitchers is not always the same, so we’re going to adjust for that,"—but then you adjusted for it basically on the team level, with only very minor variations between teammates. We estimate that the offensive support for these teammates was 4.49 runs per game for Keough, 4.47 for Norris, 4.56 for Langford, 4.52 for McCatty and 4.47 for Kingman.

That’s not the real world. In the real world, offensive support fluctuates wildly between pitchers, and we know this now because we measure this now. But logically, defensive support MUST fluctuate from pitcher to pitcher as much as offensive support does, and actually more on a percentage basis, because it is operating on a smaller base. Dealing with Oakland, 1980, and Chicago, 1980, the White Sox scored only 587 runs on the season, while the A’s scored almost exactly 100 more runs, 686. OK, 99 more.

Using the "standard deviation of support" which is reflected in the "RA9def" numbers, we would thus conclude that every Oakland pitcher had to benefit from more offensive support than any Chicago pitcher. But we know, because we have ACTUAL numbers for offensive support, that this is not true. Richard Dotson, pitching for Chicago, was supported by 4.27 runs per game, while Brian Kingman, working for Oakland, was supported by only 2.87 runs per game.

Why, then, would we not suppose that the same variation MUST happen on defense? The difference between the two teams on offense (99 runs) is roughly equal to the difference between the teams which is attributed to their fielders by the RA9def numbers. Why, then, would we suppose that the variation between the pitchers fielding support was so tiny?

My point is that these are not real effects. They are externally derived estimates which are applied as if they were real effects. But the real issue is the SIZE of the numbers.

The 1980 Oakland A’s had five pitchers pitching 200 innings—Matt Keough, Mike Norris, Rick Langford, Steve McCatty and Brian Kingman. The RA9def numbers are .60 for Keough, .56 for Norris, .63 for Langford, .61 for McCatty and .61 for Kingman.    

.60 runs per nine innings. That’s 97 runs a season. The operating assumption which leads to the conclusion that Britt Burns was better than Mike Norris in 1980 is that the Oakland defense was 97 runs better than average over the course of the season.

But is that a reasonable number?

Well, the American League average number of runs allowed in 1980 was 729 runs per team. Oakland played in a pretty extreme pitcher’s park, with a park factor of 79 in 1979, 86 in 1980, 89 in 1981. Let’s say that 86 is the right number; it’s going to reduce the expected runs allowed in the park by 6%, which is 44 runs. Playing in that park with average pitching and defense, Oakland can be expected to allow 685 runs.

If their defense is 97 runs better than average, then the expectation for their pitchers is 588 runs. Oakland allowed 642 runs, which was the best figure in the league, per inning, but based on this logic, they were 54 runs WORSE than average, the pitchers were. But is there any evidence that their pitchers were in fact 54 runs worse than average?

Well, the league average in strikeouts was 740. Oakland pitchers struck out 769.

The league average in walks was 516. Oakland pitchers walked 521.

The league average for home runs allowed was 132. Oakland allowed 142.

Where is the evidence that the Oakland pitchers were 54 runs worse than average?

Actually, there IS a logical pathway through the numbers which will get you close to that conclusion. I wouldn’t walk down that pathway myself, but you can get there. Ignore the strikeouts and walks; Oakland pitchers pitched 1% more innings than the American League average, and issued 1% more walks, it’s nothing. It’s the home runs. Oakland had a park factor of 86, but a park factor for home runs of 70. The park should reduce home runs by 14%. The league average of home runs allowed was 132; reduce that by 14%, that’s 115. Oakland A’s pitcher should have allowed 115 home runs. They allowed 142. That’s 27 homers. Oakland pitchers were average in strikeouts and walks allowed, but they allowed 27 more home runs than an average pitching staff should have allowed. If each home run is worth, let’s say, 1.65 runs, then that’s 43 runs. It comes pretty close to being a closed-end explanation for the Oakland pitcher’s data:

Average pitchers in an average park would have allowed 729 runs,

The Oakland Coliseum reduced that expectation to 685 runs,

The Oakland defense was 97 runs better than average, which reduces that expectation to 588 runs,

The A’s allowed 642 runs, which was the fewest in the league per inning, but was actually 54 runs OVER expectation,

Which is largely explained by the fact that the A’s pitchers gave up 27 more home runs than they could been expected to allow, given the park that they were playing in.

OK, so there IS a pathway through the numbers that will get you to that conclusion. But is that the RIGHT way to walk through those numbers? Is that the best way?

There is a risk of confusing park effects with defense. Again quoting what I wrote in the Norris/Burns comparison in the 1980 section:

What I think has happened is that ballparks and fielders sometimes have very similar statistical signatures. The Oakland Coliseum, with large foul territory and cool, damp air, was a tough place to hit. The A’s as a team hit .251 at home in 1980, .267 on the road, with a slugging percentage 42 points higher on the road. Their opponents hit .234 at home, .254 on the road, with a slugging percentage 65 points higher on the road.

The park reduces the in-play average and the other run elements—but good defensive play also reduces the in-play average and the other run elements. This creates the possibility of confusion between the two. What I THINK has happened here is that the park’s run-suppressing characteristics are being double-counted as if they were also evidence of superior defense, thus adjusting twice for the park.

Not saying that that’s what happened, the peculiar numbers for Oakland pitching and defense may result from that, or they may result from some other cause, but that’s the second problem. The third problem is whether you should be doing this or not. Voros McCracken fundamentally changed sabermetric analysis 20 years ago, with the realization that the outcome of balls in play against a given pitcher in a given year is just random, or mostly just random.

I would be concerned that Baseball Reference may have trapped itself on the wrong side of the Voros revelation. I wonder if perhaps Baseball Reference-WAR is attributing to the fielders a deviation in performance which Voros established does not exist. What is "defense"? It is the outcome of balls in play, isn’t it?

Yes, of course, some defenses are better than other defenses. Yes, we can measure that. But what Baseball Reference is attempting to do, is to measure that and hold the pitcher responsible for it. They THINK that because they’re attributing it to the defense, rather than directly to the pitcher, it is OK, but it isn’t OK. When you give it to the defense, you take it away from the pitcher. The effect is the same, whether you attribute it directly to the pitcher, or whether you attribute it to the defense and then take the defense away from the pitcher. Either way, it is mostly just random.

There is an argument, at least, that the Baseball Reference approach is treating noise in the data as if it was valid information.

I should say, before I go any further, that separating what the pitcher has done from what the fielders behind him have done is a tremendously tricky business, and that none of us really knows how to do that. The way that it is done in Win Shares is not too whippy, either. There are serious problems with the way that it works in Win Shares.

But I am not convinced that Baseball Reference-WAR has this right. They have a closed-end system, but there are MANY closed-end, logical systems which could be chosen. They have a pathway through the statistics toward an answer, but there are thousands of possible pathways through the data. The issue is whether we should believe that it is true.

I do not believe, and I do not see how anyone reasonably could be expected to believe, that the Oakland A’s pitchers in 1980 were 54 runs worse than average, despite leading the league in ERA, or that their fielders were 97 runs better than average. I don’t believe it.

It rather seems to me that the Baseball Reference Pitcher’s WAR has been constructed in a what-if universe. Returning for a moment to Jack Kralick in 1961, I think that what Baseball Reference has concluded by the logical pathway that they have chosen is not that Jack Kralick was the best pitcher in the American League, but rather, that in another park with another defense behind him, he would have been the best pitcher in the league. And I would suggest that we don’t actually know that. It’s highly speculative. That’s putting it kindly. It’s wrong.

I estimate that the Baseball Reference Pitcher’s WAR system is, let’s say, 85% accurate. That’s not a bad number. I have propagated dozens and dozens of analytical systems, over the course of my career, that were not 85% accurate. Hell, I have put out there systems that were, in retrospect, not 50% accurate, 50% accurate meaning that you don’t know a damned thing.

But is 85% accurate good enough, given the extent to which B-R WAR is relied upon in the public discussion? Salaries are negotiated based on this number. Front offices rely on this number in structuring teams. Hall of Fame campaigns cite this number for support.

I would argue that

a) The accuracy of this system does not justify the extent to which it is relied upon in the public discourse,

b) The level of accuracy needed in a number to be relied on to this extent should be AT LEAST 99%, and preferably 99.9%,

c) We probably can’t get to 99% very soon, so we probably should be more cautious than we are in citing these kinds of numbers, but

d) It would actually be relatively easy to improve the system so that the accuracy would go to 93-95%.

What would we have to do to the system to boost it up to 93-95%? Seven things:

First, get rid of the RA9def adjustment, which is a speculative number, derived externally to the pitcher,

Second, replace it with an estimate of the pitcher’s individual defensive support, derived from the pitcher’s own data such as his BABIP, his un-earned runs allowed, and his stolen bases allowed/caught stealing compared to the team norms,

Third, be absolutely certain that the "defensive support number" is park-adjusted,

Fourth, split the credit for deviations from park-adjusted BABIP between the pitcher and the defense, rather than crediting it all to the defense and thus taking it away from the pitcher,

Fifth, place a realistic boundary on the defensive support number. I would suggest 0.25 runs per nine innings, positive or negative, as a realistic upper boundary,

Sixth, figure the pitcher’s level of effectiveness based on his three true outcomes, and

Seventh, make your final estimate of the pitcher’s level of effectiveness a blend of the runs-allowed based measurement and the three true outcome estimate, perhaps a 70-30 blend (weighted toward actual runs allowed) or an 80-20 blend.

If Baseball Reference were to do those things, I believe that they would very significantly improve the accuracy of their evaluations.

COMMENTS (34 Comments, most recent shown first)

tangotiger
Brock: I'm simply saying that if you look at the pool of players that are on the bubble that would replace Koufax, those guys are essentially:
- have a runs allowed that is noticeably worse than league average
- hit like league average pitchers

So, if you have a pitcher who has a .300 OBP and .350 SLG, he will look fantastic in comparison to the pool of his replacements.

10:59 AM Mar 21st

Brock Hanke
Tango - I'm not completely sure that I understood your last comment, because I don't know what the term "Koufax pool" means as opposed to "pitcher pool." It LOOKS like you have done research and have discovered that an average-hitting pitcher hits right at the replacement rate (.294 or whatever it is now, which would be a huge surprise to me - I would have assumed that the average-hitting pitcher hits at a lower rate than .294). So, when you're comparing Koufax, as a hitter, to an average-hitting pitcher, you are essentially comparing him to a replacement-level-hitting position player. Is this right, or have I missed something? Probably something having to do with the definition of "Koufax Pool."
2:57 PM Mar 18th

tangotiger
Steven:

I don't compare players to the average player. I compare players to the bubble player, the replacement-level player.

On an INTERMEDIATE step, you can certainly compare Koufax's horrible hitting to a league-average hitter.

But, then you'd want to compare the replacement level player for the pool that Koufax belongs to also.

You can BYPASS that intermediate step altogether, and realize that the replacement level player in the Koufax pool is essentially an average hitter among the pitcher pool.

And so, you, ultimately, are comparing Koufax-as-a-hitter to the replacement-level-pitcher-as-a-hitter.
12:56 PM Mar 18th

Steven Goldleaf
Tango--are you really saying that poor hitting knocks NL pitchers back so far, you must pump up their value? Isn't .100 BA, 0 HRs, 50 K in 100 AB worth exactly the same whether it's contributed by a shortstop or a pitcher? If not, could you clarify why those 100 AB get evaluated differently depending on whose ABs they are?
4:41 PM Mar 15th

travisbecker
To klamb819, thanks for the clarification! Obvious miss on my part.
2:56 PM Mar 12th

tangotiger
It seems to me that pitchers hitting has to be compared to other pitchers, and not to all hitters.

Otherwise, you will get Koufax and others take an enormous hit. Really ALL NL pitchers will take an enormous hit.

Unlike the other 8 positions (though I could make an exception for catcher if I REALLY had to), the pitcher is unique enough that we should treat them as their own universe.
8:55 AM Mar 12th

Brock Hanke
lamb - I sometimes use Warren Spahn as an example of a pitcher whose arm was worked VERY little in his youth, but there is one big problem: Johnny Sain.

Warren Spahn came up late with the Braves in 1941, but was clearly major league ready. He then spent the next three years in the WWII military. Then he came back, established himself again, and pitched forever.

Johnny Sain has EXACTLY the same pattern, but not the same conclusion. He came up, also with the Braves, late in 1941, was major league ready, spent the next three years in the military, came back and established himself again, and pitched real good through 1948, when he led the league in IP, with 314.2, and had an ERA+ of 149. Then his arm fell off. It would take him seven years to get back to an ERA+ over 100, and that would last for only a couple of years.

So, why did Spahn keep going and Sain fail? They had exactly the same sort of workloads, in their first five seasons. The biggest difference between Spahn and Sain was that Spahn was 3 years younger. The other big difference was that Spahn was a fastball / slider pitcher, while Sain was a curve ball artist. Curves are not, IMO, good for the ol' arm there.

But if you encounter any more of this received wisdom in Milwaukee, remind them of Sain. He is a REAL problem if you want to attribute Spahn's great longevity to workload when young.
3:12 AM Mar 12th

klamb819
To TravisBecker: Earl Wilson's pitching WAR in 1966 was 5.9, but he also had an oWAR of 1.9 for his good hitting. That made Wilson's total 7.8 to Robinson's 7.7.
... Not only was the 5.9 questionable for Wilson's pitching, but his oWAR was apparently relative to other pitchers' hitting and appears to be higher than position players got that year for comparable hitting.
12:00 AM Mar 12th

George.Rising
Great series, Bill! We're all very lucky that you continue to do such interesting work and are willing to share it with the public.

My two cents:
I usually avoid b-ref WAR (and really all WAR models) in analysis. I find the WAR defense adjustments--which I believe are more fuzzy for all the reasons that Bill points out--to be so drastic that they often overwhelm the more-accurate stats of hitting. It appears, based on this series by Bill, that the defense adjustments can also overwhelm pitching stats. Until they improve, I really think that WAR is no better than OPS+ and ERA+, although latter two aren't as comprehensive. I'm still very open to being convinced otherwise, but I'm not there yet.

Another thing is that WAR jumps around more than other stats do. In general, however, players tend to be relatively consistent, so that too makes me question WAR somewhat. (Not to digress, but the same problem is in basketball WAR: Oscar Robertson, for example, was the most consistent of players, and yet his WAR bounced up and down far more than did his other stats).

I partially disagree with Bill's comment: "What is relevant is not what the player might have done or would have done in some other set of circumstances, but rather, the value of what he actually did in the circumstances in which he played."

Yes, that is relevant for someone who wishes to assess value after the fact. But I, personally, am also interested in trying to discern how all players would perform on an equal footing. Finding a neutralized context is difficult, but there were a lot of difficult issues that we've gotten a better handle on over the years.

The neutralized context also gives insight into how a player might play in a different context, after a trade or free agency. I remember (or I think I remember) how Fred Lynn thought that he would hit better as Angel because he was going home to California. Bill James pointed out that hitting in Fenway had helped Lynn's numbers. So Bill predicted--contrary Lynn himself--that Lynn's stats would decline. Sure enough, Bill was right.
2:38 PM Mar 11th

travisbecker
I'm confused by the continuing 1966 Frank Robinson/Earl Wilson comments. Maybe I need to finish reading the earlier posts. But Baseball-reference.com currently lists these WAR values for the two players in 1966:

Frank Robinson: 7.7
Earl Wilson: 5.9 (1.4 for BOS, 4.5 for DET)

So where is B-R stating that Earl Wilson should have been MVP? Or have the posted numbers changed since Bill originally tweeted about it?

Travis

11:52 AM Mar 11th

dackle
That table looks messy and there is a typo in the post (group 5 is pitchers parks, not hitters parks). Here are groups 1 and 5 again:

Group 1 (hitters parks)
Park factor: 106.2
xRA_def_pitcher: -1.0 (their fielders cost them 1.0 runs)
RS_def_total: -7.3 (their teams were 7.3 runs below average defensively for the year)
WAR: 2.4

Group 5 (pitchers parks)
Park factor: 94.5
xRA_def_pitcher: +1.1
RS_def_total: +6.8
WAR: 2.0
12:57 AM Mar 8th

dackle
Following up on this comment: "What I THINK has happened here is that the park’s run-suppressing characteristics are being double-counted as if they were also evidence of superior defense, thus adjusting twice for the park."

I took all pitchers who have started at least 15 games and have a "custom" park factor on Baseball Reference. The custom park factor is calculated based on the parks they actually started in, and is available since around 1908. This provides 11,345 "pitcher seasons" for the study.

I sorted the seasons based on custom park factor, grouped into quintiles, and calculated the averages for a few stats of interest. Group 1 is pitchers in hitters parks, group 5 is hitters parks. Pitchers in each group averaged about 185 innings per season.

Group PPF xRA_def_pitcher RS_def_total WAR
1 106.2 -1.0 -7.3 2.4
2 101.9 0.0 -0.5 2.2
3 99.9 0.4 1.9 2.2
4 97.8 0.8 4.6 2.2
5 94.5 1.1 6.8 2.0

xRA_def_pitcher -- runs saved by the defense (expressed in runs rather than runs/9 ip)
RS_def_total -- runs saved by the team defense in all games

Data file and column headings are available here: https://www.baseball-reference.com/about/war_explained.shtml (scroll to Download Our WAR Numbers Daily)

So, the group 1 row shows that these pitchers threw in parks with an average park factor of 106.2. Their fielders cost them 1.0 runs per season while they were on the mound, and -7.3 overall for all games. The pitchers in group 1 averaged 2.4 WAR.

So, based on BBRef's calculations, it does appear there is double counting of the park effects and fielding runs. Pitchers in hitters park played in front of defenses assessed at -7.3 runs overall, while the defenses in pitchers parks were worth +6.8 runs on the season.

Also it appears that WAR is artificially boosted by this adjustment. Pitchers in hitters parks averaged 2.4 WAR, while those in pitchers parks averaged 2.0. Not a huge difference, but it does appear there is an effect.
10:05 PM Mar 7th

raincheck
This is great, and insightful. It will ruffle some feathers, but will probably also result in some improvements to R-WAR. Maybe it will also lead to one or two making more wise use of WAR.

To some degree the same point you are making about defense applies to park effects. Dodger Stadium is a different run environment on a warm, dry afternoon than it is on a cool, damp evening.
2:11 PM Mar 7th

KaiserD2
To Nettles: That's a good question which I never systematically investigated. Top figures for shortstops and center fielders tend to be in the same range. My off-the-cuff reaction is that you are more likely to have a range of ability among four infielders than three outfielders. But no defense has ever saved 90-100 runs, as some do, without superior play in both the infield and the outfield.

To Manushfan: Bill made his name in part by claiming, rightly, that Larry Bowa was a mediocre shortstop. The establishment press went crazy over that one. In our own time it turns out that neither Rey Ordonez nor Omar Vizquel was really an unusually effective shortstop. Bill did argue in a later Abstract that he thought contemporary opinion had to be taken seriously about defense, but I could never understand why he thought sportwriters of earlier generations could have seen more than our own contemporaries.

David K
1:39 PM Mar 7th

nettles9
Is outfield defense more important than infield defense?
11:59 AM Mar 7th

Manushfan
The War total for Pie Traynor always seems so bonkers to me--it states he was a below average fielder (-32 over his career). I just don't see it. And if a rather famously applauded defensive whiz is mis-read by one relied upon methodology, seems to me that there's got to be other chinks in the armor like it or not out there. So, it's grain of salt for me.
9:17 AM Mar 7th

KaiserD2
Several things occurred to me reading through this article once. The first has to do with park effects.

What Bill is saying about Dickey and Howard is in a way obvious, which makes it amazing, but hardly unprecedented, that few if any of us have ever noticed it before. We have been using park effects for nearly 40 years to evaluate anyone. But we've only used it in a limited way. We understand that the raw runs created (or WAA, or whatever stat you want to use) by Ted Williams weren't quite as valuable as the raw runs created or WAR earned by Joe DiMaggio because he was playing in Fenway instead of Yankee Stadium. But Bill has taken it to the next level, by showing quite simply that certain players (like Howard's) raw RC or WAA would have been very different had they been playing somewhere else. That is very important and it's tempting (not tempting enough for me to try to do it) to create sets of offensive stats based entirely on road performance.

Now on to the pitcher evaluations where I see things a bit differently.

To begin with, I think Bill is letting the sportswriters of earlier eras off too easily. As I have pointed out, there are four variables that will determine a pitcher's won loss record:

1. His skill as a pitcher--is it better or worse than average, or how much?
2. The skill of his fielders--same question.
3. The ability of his team to score runs--better or worse than average?
4. Luck. To what extent did these factors 2-3 deviate from average while he was pitching?

Now the only one of those factors that the pitcher controls is (1), and that clearly is well under 50% of the total inputs to his won-loss record. That should have told everyone, long ago, that that record was not a fair record of his ability. It should have occurred to some one that teams with "great pitching staffs" (three starters with 60 wins between them, let's say), tended to be teams with excellent offenses that played in big parks. But it never did.

Now for the heart of the matter.

I wrote my book largely because Michael Humphreys had invented DRA, a very accurate historical fielding measure, and made all his data available for download. That let me put together a data set of every team since 1901 and how many runs its defense had saved. Is there luck in that measurement? Of course. But the evidence is overwhelming that there is far more skill than luck in it.

It so happens that DRA showed that the 1980 Oakland Athletics did have absolutely extraordinary fielding--their outfield was one of the most effective ever. The fielding of the earlier Athletics' dynasty was also very good. That's one reason that Catfish Hunter (who also benefited for a couple of years from great Yankee defense) looks a good deal better than he was. Now baseball-reference uses different fielding stats than DRA, and its my understanding that the stat it uses for the history of modern baseball isn't very good. So in my book, using WAA (based on the same basic stats as their WAR), I also corrected for defense but I used team DRA instead. This turned out to be very complicated and I explained it in my book's appendix. But I did the best job I could.

Something else emerged regarding the 1980 As that I have put into a presentation that I've given at the Boston and Rhode Island SABR chapters. Billy Martin obviously understood fielding. Bill asked in one of the first of his Abstracts how Martin could improve all his teams so quickly. Fielding was a huge part of that. His Twins (especially), his Tigers and his Rangers all improved their fielding dramatically, and thus, their runs allowed totals. The Yankees were already tremendous in the field and he kept them that way. (That' s one reason he might have been upset about signing Reggie, who was no longer a fielding asset in 1977.) And he did it again, more dramatically than ever, with the As.

So yes, I absolutely think that it's right to factor in the overall fielding skill of the team in evaluating pitchers, and I found that this yielded a lot of interesting results. For one, as I pointed out earlier this week I think, the 1963-66 Dodgers had very good fielding which accounted for a lot of the supposed superiority of their pitching staff. (Not Koufax, but everybody else.) For another, the 1956 Reds had one of the worst defenses of a contending team ever, and their pitchers were actually a lot better than they seemed to be. Marichal, as I noted, was really better than Koufax in 1965 because his defense (except for Willie) was awful and Sandy's was very good. Etc.

I'll stop there.

David K
7:57 AM Mar 7th

willibphx
As I understand WAR, as noted previously, individual fielding is against an average for the position not a replacement player. As such would not all non-pitchers be understated against pitchers as all of pitchers WAR is against replacement (or at least a much larger portion)? Happy to be educated if I have missed something.
10:15 PM Mar 6th

FrankD
Great series. I think that the effort to increasingly improve our estimates of player contributions is a never ending quest - as Bill has stated we will never get the true or exact number. But we can get nearer. RWAR and DWAR are informed estimates but nevertheless are also opinions. How the data are processed and the assumptions made are the opinion of the user. And since these are opinions we like to argue about whose opinion is 'correct' and we defend our own opinion. If the assumptions/parameterizations of the analysis are not hidden but published for all to see and debate then analysis can be improved or even rejected.

I've never really gotten into the weeds of how WAR was calculated and I appreciate the explanations. I can now see why certain parameterizations were made (for example, determining an avg defense for the year and then assuming that each pitcher gets this same defensive parameter) and how/where these assumptions fail. And, yes, there should be error bars on WAR estimates, especially if these estimates are used in salary negotiations. I surprised somebody like Scott Boris hasn't made his own WAR number to jack his player's salaries. (Or maybe he has or maybe he shops the various publications and cherry picks the numbers he wants)
8:06 PM Mar 6th

klamb819
Tom makes an excellent point that averaging rWAR and f-WAR can smooth out some of the wrinkles in each approach to pitching. The same is true of their figures for fielding WAR. And also DRS and Total Zone. It doesn't solve the problems but does reduce them.

The bigger problem is unwarranted certainty. Baseball Reference is such a great site, in both its breadth of data and its ease of use, that people want to believe its estimates are indisputable facts. Adding in people's resistance to error margins, and they're apt to present as irrefutable the superiority of German Marquez and Jameson Taillon over Patrick Corbin in 2018. Based on their 0.1 advantage in r-WAR. And without sharing the difference-making presumption that without Arizona's solid defense, Corbin would have allowed 18% more runs over the season, or one extra run every 15.8 innings.

Which is just a short stride away from lecturing all of us foolish saps who still believe Frank Robinson was the American League's best player in 1966. (I had forgotten this: The Red Sox considered Earl Wilson so valuable in 1966 that they traded him in June, with washed-up Joe Christopher, for post-prime Don Demeter. What vaulted Wilson past Robinson in '66 was his 1.9 WAR as a hitter. He batted .240 with 7 home runs and a 120 OPS+.)

This has been a superb series. And I confess to saying that with no margin of uncertainty. :)
6:48 PM Mar 6th

DaveNJnews
Outstanding writing. I am now going to go back over it and read it again.
4:51 PM Mar 6th

trn6229
Great article. The 1980 A's had a super defensive outfield of Rickey Henderson, Dwayne Murphy and Tony Armas. The White Sox had Chet Lemon would could field too. The 1958 Orioles got a good year out of Jack Harshman. He earned 22 Win Shares. Mike Norris had 25 Win Shares in 1980, Britt Burns had 21, Steve Stone had 20, Larry Gura had 22, Doug Corbett a reliever for the Twins, had 24, Jim Clancy of the Blue Jays had 19 Win Shares.

Thank you.

Tom Nahigian
3:47 PM Mar 6th

MWeddell
Is anyone actively in charge of baseball-reference's WAR methodology? They have made an occasional change in the methodology since Sean Smith (Rally, which is why some call it rWAR instead of bb-refWAR) was hired away by an MLB team. Still, I don't get the sense that anyone is actively trying to improve it. This is the website that still uses OPS+ instead of their own version of wRC+.

I should say that this criticism isn't unique to baseball-reference.com. I also wonder whether the smart (but perhaps insular) people behind WARP at BaseballProspectus.com read Hareeb's Hangout and engage seriously with his critiques.
3:42 PM Mar 6th

Steven Goldleaf
So now it's clear-cut? I felt better with "confusing."
2:45 PM Mar 6th

bjames
For Christ's sake, Goldleaf, it isn't about what DiMaggio DID in 1944, or about what he WOULD HAVE DONE. It is about what he WAS in 1944.
2:03 PM Mar 6th

tangotiger
Here are my responses:

www.tangotiger.com/index.php/site/article/rejoinder-and-or-amicus-brief-to-bill-james-on-pitching-rwar

www.tangotiger.com/index.php/site/article/sins-of-the-father-who-gets-the-random-variation
12:16 PM Mar 6th

shthar
When they factor out the defense in pitcher's WAR, are they looking at the defense in just the games the guy pitched, or the whole season?

If it's the whole season, doesn't that give us a skewed result? Kind of like W-L records, but for the opposite reason?

11:18 AM Mar 6th

evanecurb
Baseball Reference is wonderful. I use it almost every day. But I simply do not believe many of their defensive measurements, such as defensive runs saved at the team level, and, defensive WAR at the individual level. Some of the numbers just don't make sense.

BBRef says that the 1991 Braves defense cost Tom Glavine 0.57 runs per game. But the Braves actually led the NL in defensive efficiency that year, with .714. They did commit 13 more errors than the average team, 138 to 125, but they still converted more balls in play into outs than any other team in the league. To say they were costing the team more than a run every two games is not a logical thing to conclude.

The "what if" statistics are useful as predictive measures. They can be helpful in answering questions like "how many runs will Manny Machado create playing in Petco Park?" There's real value there.

It's kind of fun to think about how Rick Reuschel might have performed pitching for the 1970s Reds, Orioles, or A's instead of for the Cubs. It's kind of fun as a thought experiment, that is. I personally believe Reuschel would have had multiple 20 win seasons with the 1970s O's, won multiple Cy Young Awards, and made the Hall of Fame. But it's a whole lot different to think about it as a thought experiment than to say that Reuschel actually did those things.

9:50 AM Mar 6th

Steven Goldleaf
klamb--we should probably have this discussion in the "comments" section of the next installment, where Bill is planning to elaborate (I think) on the conundrum of speculating on players' different types of missing seasons, and it's premature to discuss it at length here, since this article is mostly not about that subject.

As Bill notes, "It is much, much, much more speculative to suggest that (DiMaggio) was not a great player in 1944 than to suggest that he was.

That’s how I see it, but is this a clear distinction? Not absolutely. It’s confusing. We’re trying to think it through as carefully as we can, but it is confusing."

Part of my own confusion that we apparently fill in Dimaggio's missing 1944 season differently if he returned to play ball after the war than if he'd gotten killed during the war. That's the part that makes little sense to me: either way, he didn't play in 1944. Same with Minoso--in a fair world he would have had a longer MLB career, but it's no more Munson's fault that he was a poor pilot than it is that Minoso didn't have much lighter skin, and I say that as a fan of Minoso's and no fan of Munson's. I think Bill gets into a very tricky area when he's unwilling to speculate about some gaps or ends or thwarted beginnings to players' careers but not about others. As he notes, "It's confusing."
9:42 AM Mar 6th

klamb819
My impression, reading BB-Ref's WAR methodology, is that the process is so committed to neutralizing away every whiff of randomness, and so determined not to leave anything out, that it latches on to and magnifies things that nobody else can see. This article explained that very well, and also recalled your opening plea from the previous installment: that acknowledging margins of uncertainty is greatly preferable to false precision.

I also like the suggested repairs, especially capping the defensive adjustment to protect itself from a 97-run defense that doesn't pass the "Really? How?" test. Stray thought: could R-WAR's excessive adjusting help explain why D-WAR values run a bit higher than corresponding R-WARs?

To raise another WAR issue that could merit some attention: Baseball Reference all but cops to over-valuing pitchers with these two sentences in its explanation of methodology: "We assign 41% to the pitchers and 59% to the position players. This corresponds to the salaries of free agent pitchers vs. hitters over the last four seasons." (It's not clear whether the four seasons were 2009-12 for version 2.2 or 2008-11 for v2.0. Nor is it clear whether "free agent" means only 6-year veteran qualifiers or also includes DFA'd and non-tendered players, most of whom meet the definition of replacement players.)

Between versions 2.0 and 2.2 came version 2.1, out of recognition that "Pitchers were being overvalued due to a runs-to-win estimate that broke down for extreme performances." BBR made several adjustments in response, but maintained pitching's fixed 41% share, based on the shaky premise that 4 years of relatively few players' Financial Value were a solid basis for apportioning 100+ years of Baseball Value.

Win Shares, by contrast, are apportioned according to the observation that offense and defense have roughly equal value, and on the satisfactory premise that fielding has half as much value as pitching. The result is to allocate a flexible 34.67% share of MLB value to pitching. So WAR gives MLB pitching about* 118% as much value as Win Shares does.
.... (*I say "about 18%" because, quoting again from BBR's WAR explanation, "pitcher fielding is included in Pitcher WAR." Pitchers' fielding has some value, but the replacement level for fielding is .500 — not the .294 that BBR uses for offense and pitching — which means all pitchers' dWAR in a given season theoretically adds up to 0.)

If anyone can correct any misunderstandings I have here, please do so.
8:29 AM Mar 6th

klamb819
Steven: Maybe it's because I've spent so much time in and near Milwaukee, but I've heard Spahn's longevity attributed so often to his low workload before age 26 that it qualifies as conventional wisdom.

I think this was the point Bill was making about having more confidence in estimating DiMaggio's hypothetical 1944 season than in Munson's hypothetical 1980 season: We have data for DiMaggio in 1946 but none for Munson in 1982.

8:21 AM Mar 6th

ksclacktc
I would argue that using 50% ERA and 50% FIP is a pretty good starting point.
6:58 AM Mar 6th

SteveN
Steven, I'm pretty sure that Craig Wright addressed this issue in 'The Diamond Appraised'. (I hope that is the right title.
6:02 AM Mar 6th

Steven Goldleaf
The quarrel I have with your distinction between the games Joe DiMaggio missed because he was serving in the army and the games Thurman Munson missed because he was dead is that DiMaggio might have (MIGHT have) derived some benefit to serving in the army, while there’s very little upside to being dead.

Let’s call it the Warren Spahn, or Bob Feller, effect: these two great pitchers are often cited as among those who lost a lot to army service during World War II, but no one speculates on the other side of the equation, that perhaps their arms were protected by spending a few years in mid-career where they weren’t compelled to throw 250+ MLB innings. If Messrs. Spahn and Feller had been available to pitch from 1943 through 1946, who knows if those extra innings would have injured their arms, perhaps permanently? Not you and not I—we’re merely speculating, and speculating the same answer: “Nothing would have happened to their arms.”

But maybe they, and maybe Joe DiMaggio and a hundred others, would have suffered a debilitating on-field injury of a different type if they had played during those years, the type of injury that occurred to, let’s say, Dizzy Dean or Pete Reiser. Would it have? Probably not. But certainly SOME player with a long, productive post-War career would have gotten injured on the field if he had played during the War. I’m not saying that war is good for pitchers and other living things, of course, but I am pointing out the inevitability of some player getting spared an on-the-field injury by being off the field during World War II.

We just can’t know which player or players we’re discussing here. From a baseball perspective, WWII might have been the best thing that ever happened to Joe D.
4:31 AM Mar 6th

A Conclusion in Regard to Baseball Reference WAR

COMMENTS (34 Comments, most recent shown first)

Leave a comment

Report inappropriate comment


Type of Abuse:
Comments: