As I threatened to do if no one came up with a head-to-head comparison of Win Shares to WAR, I ran one such comparison myself, and came up with one genuine head-scratcher, which I’ll describe in much detail below.
For convenience’s sake, I chose to compare Win Shares to WAR on the first team I remember rooting for actively, the 1961 Yankees. At age eight, I actually rooted for the Reds, due to the Yankee-hating and Pinson-loving older cousin who introduced me to baseball, but I saw nothing wrong at that age in also following the ’61 Yankees. The next year or the year after, I think, I played my first baseball board game, which was called "Beat The Yankees," or something like that, in which a virtual All-Star team of the rest of the AL (or maybe all of MLB?) struggled to defeat the Yankees. In retrospect, I think the absurd conceit behind this game became an eventual source of my contempt for the Yankees and their fans, that in their universe a fair fight would pit the best players on every other team against their Yankees, and the Yankees would still win most of those contests. They did in the board-game universe version, anyway—now I peg the Yankees-vs.-the-rest–of-MLB at about a .250 winning percentage, maybe. Anyway, the team is sufficiently legendary that most of you should be familiar with it.
The chart comparing their top 25 contributors (as Win Shares rank them) should be self-explanatory, but I’ll explain it briefly anyway: first column is how the players rank according to Win Shares, second column is the players’ names, third is Win Shares themselves, fourth is WAR, fifth is how these 25 players rank according to WAR (blanks denote that the ranking is unchanged from the Win Shares ranking) and the final column is the ratio of Win Shares to WAR for the first dozen players.
I’ve used (copied, actually) both rankings from thebaseballgauge.com, the Win Shares being their version (which differs slightly beyond just rounding differences from Bill’s 2002 Win Shares book) and the WAR being what they call bWAR, the "Baseball-Reference Wins Above Replacement" version.
Both rankings agree on the order, for the first few players at least: each one ranks the top four 1961 Yankees identically. Then the two charts begin to diverge, but they don’t get downright in-your-face disputatious until we reach WS ranking #13, Bobby Richardson, whom Win Shares seems to view as a solid if undistinguished contributor but whom WAR diagnoses as a blind, plague-carrying leper in the final stages of tertiary syphilis, the 34th best Yankee on the 1961 team, a guy who cost the Yankees close to one victory, compared to some minor league zhlub whom they could have acquired for the cost of a phone call, give or take a nickel. Since Richardson’s WS and WAR disagree by so much, that’s where I’ve focused my attention. Why does he rank 13th by WS but 34th by WAR?
My assumption is that WAR has must have some rational metric that finds Richardson lacking, which WS does not weigh nearly as heavily, and that their negative overall WAR for him means that he played below replacement level. I don’t think they’re claiming that Bud Daley, for example, with a WAR of 0, contributed nothing at all to the Yankees, just that his contributions are roughly what their top AAA pitcher could have contributed. Richardson’s -0.7 WAR rating means to me that they would have done better to send Richardson to the minors and bring up the best middle-infielder they had down there. Translated into words, I take "-0.7" to mean that Richardson cost the Yankees nearly a full win in 1961, compared to whoever would have replaced him on the roster if they’d decided to send him to Richmond for the summer. That move, presumably, wouldna helped much, but it wouldna hurt.
This assumption is probably wrong, because it’s not only insulting to Richardson, it’s pretty insulting to the Yankees’ front office, the thought that they promoted a half-dozen or so guys who couldn’t play as well as the guys they chose not to promote. But what else can a negative Wins Above Replacement figure mean, if not that this player is below the level of an easily available replacement?
I don’t mean to be insulting to anyone, but this stuff puzzles me the closer I look at it. I don’t mean to insult the 1961 Yankees’ bench (I forget where, but Bill has someplace in his writings described the ’61 Yankees bench as a "wretched hive of scum and villainy"—or maybe that was some other old guy with a beard describing the happy-hour crowd at Mos Eisley. In any case, Bill has published a good deal of persuasive disparagement of the quality and depth of the 1961 Yankees’ bench, aside from their backup catching). And I don’t want to question the Yankees’ front office judgment in keeping Richardson on the team, nor the good men who’ve devoted hours and years of their lives to fine-tuning WAR. Maybe I’m clueless, but if you call something "Wins Above Replacement," a negative number means to me a player whose contributions were below replacement, and that seems harsh regarding Richardson.
I was able to compare the ratios of the Yankees’ top dozen players by WS and WAR (I gave up when the prospect of figuring out the proportion of WS to negative WAR presented itself), and they seem to linger somewhere around 5 or 6 Win Shares for every integer of WAR, though the disparity between Yogi Berra’s ratio (7.3) and Clete Boyer’s (3.8) seems disturbingly large, nearly double. Perhaps playing time accounts for the disparities, but if that’s so, shouldn’t two full-time Yankees like Mickey Mantle (646 Plate Appearances) and Bill Skowron (608 PA) have ratios closer than they are (4.7/6.9) ? Both WS and WAR are attempts to quantify players’ contributions, so I was a bit surprised to see them diverging as widely as they did at times.
When I saw Bobby Richardson literally place off the WAR chart, I thought "Ah ha! Win Shares thinks more highly of defensive players who don’t hit much!" but the very next thing I noticed was that Clete Boyer, also a light-hitting defensive specialist, ranked much higher on the WAR chart and lower on the WS chart, the opposite of Richardson, which kind of blew a crater in that theory.
So that’s where I began comparing the two methodologies: what did Clete Boyer do in 1961 that made WAR rank him so far (28 places) above Richardson, and what did Richardson do in 1961 that made Win Shares rank him so close (2 places) to Boyer? The answer to this one is clear: WAR says that Boyer fielded third base with exceptional skill (2.9 dWAR), while Richardson at second base did his best to negate Boyer’s contributions (-0.9 dWAR).
Another anomaly was Richardson’s stats compared to those of his opposite number, leftfielder Hector Lopez, who ranked thirteen places apart on Win Shares (#19) and WAR (#32). Unlike Richardson, Lopez was an inept fielder at a low-priority defensive position, yet he also did much better in Win Shares than he did in WAR. Lopez had a brutal year with the bat, which was the only reason to put him on the roster in the first place, so I’d also like to know how WAR rates him so close to Richardson: I mean, Lopez couldn’t field at all (he was known as "What a Pair of Hands!" in the same way that Hank Aaron was known as "Bad Henry") and in 1961 he batted even worse (.596 OPS) than he fielded. Both Boyer and Richardson had higher OPSes, which is saying something, so I don’t understand how WAR makes him slightly better overall than it makes Richardson in 1961 (which isn’t saying much). It’s hard to see why Lopez rates spot #32 on the WAR chart while Richardson rates spot #34: Lopez didn’t play every day while Richardson literally did play every day (274 PA as opposed to Richardson’s 704) so where exactly did Lopez outplay him? Not in the batter’s box, either quantitatively or qualitatively, and it’s impossible to grok how a part-time leftfielder with bad hands could conceivably have a defensive edge over a middle infielder with five lifetime Gold Gloves. But that’s exactly what WAR seems to assert: it credits Lopez with a positive 0.3 dWAR and debits Richardson with that nasty negative -0.9 dWAR.
(Lopez’ raw stats, Incidentally, juxtapose neatly with those of supersub Johnny Blanchard: each had the same exact opportunities in 1961 –93 games, 243 ABs. Blanchard had a legendary career year while Lopez had one of the most unproductive years of his career. Learn something every day. But let’s leave Hector out of this discussion for now, and try to make sense of WAR’s disparaging evaluation of Richardson’s defense in 1961.)
The mysteries here derive from the different evaluations of these players’ defense: their offensive play seems comparable, Boyer in 1961 getting 6.9 WAR, Richardson 5.3, which seems about fair.. I could see how Boyer gets judged a fielding superstar—I actually did see him play for much of his career, and he was a flashy, spectacular fielder. But I’ve always thought Richardson played pretty good defense, too. WAR doesn’t see it that way, and Win Shares does. Win Shares makes Richardson a pretty smooth glove-man in 1961, crediting him with 6.4 defensive WS, more than half his total WS for the year. So where one method sees Richardson as being a defensive liability, the other sees him as a defensive star.
It is not necessarily to be expected that WS and WAR should agree on their evaluations of every player—I was actually impressed that the top twelve Yankees on the Win Shares rankings placed as close as they did to the WAR rankings—but when they diverge sharply, as with Richardson and Boyer and Lopez in 1961, that’s probably a good place to start examining how each methodology works. I’m sure I’m misunderstanding some vital aspects of each one, but it does seem curious that they would reach such different evaluations of a few players—the one player on the 1961 Yankees WAR and WS disagree about the most is Richardson, and the place they disagree the most is his defense. According to WAR, Richardson was literally nothing defensively—for his entire career, WAR has him at -0.0, which I take to mean he was ever-so-slightly below the replacement level for an AL second baseman in the 1960s, which is certainly a minority assessment, since Richardson was a regular at second base for eight seasons, 1959-1966, and he won those five Gold Gloves. I’m no fan of Gold Gloves as an accurate metric of fielding excellence, and less of a fan of Bobby Richardson, but even I have to view WAR’s assessment skeptically. My chief complaint about Gold Gloves is that they are often awarded to the best hitting player at a given position, rather than the best fielding player, on the somewhat sound reasoning that players who can’t hit much are going to be playing partial games and often benched to fit a bat into the lineup, so they might not play enough in the field to outshine their offensively more gifted brethren. But the Yankees manager, as Bill has pointed out several times, mistook Richardson’s modest ability to hit singles for true offensive prowess, hence the 700+ plate appearances in 1961, so my chief complaint about the Gold Gloves is demoted to Assistant Chief here.
A superficial examination of Richardson’s career stats second base doesn’t really support the five Gold Gloves: according to baseball-reference (https://www.baseball-reference.com/players/r/richabo01-field.shtml ) he posted a career fielding average at 2B (.978), almost precisely matching the league average (.979) and his Range Factor per 9 innings (5.12) was just below the league’s figure of 5.21. In 1961 specifically, those figures are not far from that range: .978/.978 and 5.05/5.24. Still, he played 161 games at second base in 1961, and won his first Gold Glove.
Those aren’t great raw stats by any means (looking over the league, I’d say that Billy Moran deserved the GG in 1961) but I’d like to know how WAR makes Richardson into not only a defensive liability in 1961 but by far the 1961 Yankees’ single greatest defensive liability: of the 36 players they rank defensively, including pitchers, 21 rank at 0 defensive WAR or better, and of the other 15 Yankees with negative defensive WAR, 14 rank at -0.0 or -0.1. Only Richardson ranks below that: -0.9.
As a whole, the team ranked pretty well in defensive WAR, a team total of 5.4 Wins Above Replacement, so Richardson really stands out in that group for his defensive ineptitude. The three Yankees who score the highest in defensive eptitude are Richardson’s three regular infield mates, Boyer (with that whopping 2.9 dWAR), Kubek (1.3) and Bill Skowron (0.6), which you would think would make a butcher like Richardson stand out all the more, yet Richardson earned that infield’s only Gold Glove in 1961.
Since I haven’t tried to penetrate the shrouded secrets of how WAR is computed, and probably would mess up the math if I did, I will just leave these questions open for my betters to answer: on what basis does WAR make Richardson such a poor fielder in 1961? Why does it assess his 1961 fielding so much differently from Win Shares’ 1961 assessment? How come WAR makes Richardson into a positive fielder in 1963, his peak according to WAR, which is also his defensive peak according to Win Shares?
I’ll share some of the high points, such as I understand them, to see if anyone can come up a plausible explanation of Richardson’s atrocious dWAR in 1961, all taken from https://www.baseball-reference.com/teams/NYY/1961-fielding.shtml#all_players_advanced_fielding_2b . I’ll post the highlights that seem meaningful or comprehensible to me: the percentage of Richardson’s time at 2B when a right-handed batter was at the plate was 69%, exactly the league average, ruling out any sort of handedness bias, and his percentage of balls in play and of ground balls in play was also exactly at the league average figures. (Boyer’s comparable numbers are also virtually identical to the AL as a whole, the exception being that Boyer had 75% of Balls In Play compared with the AL’s 74%.) In the more mundane, non-advanced fielding stats, Richardson’s 1961 stats were just about the league standard, a hair below in both fielding percentage (.978/.979) and in Range Factor per 9 Innings (5.05/5.09). In comparison, Boyer at 3B was just a hair above league average in fielding percentage (.967/.966) and in RF/9 (3.78/3.73), which does indicate that Boyer had a better year with the glove than Richardson, perhaps, but not the night and day difference that WAR presents. Richardson also played over 200 more innings at 2B than Boyer played at 3B (though Boyer also played 75 innings at SS, with no particular statistical distinction). All in all, I’m still scratching my head over this one: what accounts for Richardson’s abysmal defensive rating in 1961 according to WAR?
Mind you, I’ve got no dog in this fight: I couldn’t care less if WS and WAR had some general disagreement on assessing fielders or hitters, or if they disagreed on the degree to which Richardson’s fielding stunk or excelled. It’s just that what’s going on here seems potentially instructive and, unless Richardson’s fielding is a one-off that never occurs again, this problem should be able to point to a failing in one methodology or the other.
Personally, I’ve come to think in recent years that Richardson was generally over-rated, benefiting beyond his contributions to his association with the Yankees’ dynasty, and with batting lead-off for the multi-championship teams. If WS agreed even partly with WAR’s assessment of Richardson’s fielding in 1961, I’d probably chalk that off (if I would have even noticed it) to a different emphasis in the two methodologies, but this is not a partial agreement or any kind of agreement, prompting me to want to examine this disparity more closely.
The WAR assessment is out of whack with several things: 1) Richardson’s Gold Gloves, 2) Richardson’s general reputation as an adroit fielder, which I can support beyond my own subjective impressions with his consistent "1" ratings in Strat-O-Matic, their highest rating, 3) his holding down the regular 2B job for better than five years with no one ever claiming the Yankees needed a defensive upgrade (unusual for a questionable fielder, especially in the New York press, who focus relentlessly on the hometown teams’ weaknesses), 4) his winning the job in the first place under Stengel, who insisted on excellent fielding, especially on the DP, from his middle infielders, 5) the 1961 Yankees’ W-L record, hard to feature with a truly brutal defensive middle infielder, 6) Richardson being virtually never substituted for defensively in late innings at 2B, which a butcher certainly would have been on a team that so often led late in games, 7)his raw fielding stats, and finally 8) his Win Shares rating.
That said, I’d be willing to entertain the notion that Bobby Richardson was an unacceptable fielder in 1961 if someone could explain the WAR rating in a way that makes some sense. Joe Posnanski’s column on Bill James’ antipathy to WAR (http://joeposnanski.com/more-on-war/ ) reduced WAR’s formulas to sophisticated doubletalk and mumbo-jumbo. (The nut quote from Bill is "They tend to get so far into the data, throw up their arms and make a wild guess.") There is a far more temperate follow-up piece by Posnanski (https://medium.com/joeblogs/even-more-on-war-493267e5d49 ) that I recommend strongly, taking into account a lot of the blowback his first column got from the gentlemen defending WAR, the nut quote from this column being from a fictional religious character in a Woody Allen movie who, pressed to the wall, says that he will "always prefer God to truth." I like that quote, because the religion:baseball analogy always appeals to me. I have always viewed baseball as essentially rational (and to my mind independent of any system of belief), but there is an element of beauty, of balance, of mystical justice in baseball that encourages others to find something almost spiritual there.
WAR and Win Shares exemplify the rational element to me, but at the heart of Posnanski’s followup piece is the part of WAR (and WS) that leaves open a small percentage of baseball that cannot be explained rationally. That percentage of the game, if I’m understanding Joe correctly, is 13%. That is, according to one of WAR’s defenders quoted in the piece
WAR accounts for about 87 percent of all runs in baseball. That’s pretty darned good and not atypical for the various WAR systems. Does that leave 13% of what happens on a baseball field unaccounted for? It sure does. But again, so what? WAR doesn’t pretend this variance does not exist; it merely refuses to punish individual players for the inherent volatility we enjoy seeing in the game.
Is Richardson’s 1961 defensive WAR simply an instance of the 13% of the game that WAR refuses to account for rationally? Is the team’s 13% "variance" concentrated in one player, somehow, while the rest of the team’s WAR makes perfect sense? Are we to chalk off Richardson’s poor WAR showing in 1961 to his bad luck that will presumably even out over the course of his career? Is it just some sort of weird anomaly?
I’m assuming that it’s not simply an anomaly. That is, that the explanation for this unaccountable assessment is more complex than "Hey, shit happens." I’d be disappointed to find that 13% of WAR’s findings are skewed and that WAR’s defenders are OK with that figure—the thing I came to WAR for in the first place, and Win Shares, perhaps naively, is some sort of all-purpose metric that yielded a close approximation of each element that goes into winning baseball games. My own definition of a close approximation is much more like 99% or 98%--if we’re really talking about a mere 87% accuracy rate, then I’ve been deceived in my expectations of either metric.
This puts me squarely in the group of fans who, in Joe Posnanski’s terms, "want to believe that [WAR] uses extremely complex calculation and reasoning to give us a wonder-stat, one that answers all questions and sees all worlds. Is that fair? No. But it is real."
I may be misconstruing the meaning of "WAR accounts for about 87 percent of all runs in baseball" but if I’m not, WAR, and maybe Win Shares too, seem far less reliable to me today than it did yesterday.