Username:	Password:

Remember me

Forgot your username/password?

Print Email

Home>Articles

The Sheff on the Shelf

By Dave Fleming

December 2, 2020

This started as some musings about Curt Schilling. Bill posted a poll yesterday that had Curt Schilling matched up with another player, and that sent me down a little rabbit hole of investigating, which led me to this article.

It’s not about Curt Schilling. Not at all.

It is about…or at least tangential…to Bill’s recent article about grain elevators and comparison derivatives and the problems inherent to the WAR statistic, but I don’t want to get into until the end. I try to avoid thinking about the same subjects Bill is pondering for the same reasons I tend to avoid swimming laps when Michael Phelps is in the pool: I don’t want to intimidate anyone.

Let’s start with my last article, which was about Vladimir Guerrero and Bobby Abreu.

Guerrero and Abreu had careers that overlapped. They were both born in Spanish-speaking countries. They reached the majors at the same time, established themselves at the same time, and had careers of parallel lengths. They played the same position.

In that article, I went to great pains to show that by just about every metric imaginable, they were comparably valuable players.

If you look at either version of WAR, or Win Shares, if you glance at peak years or five- or seven-year stretches of peak performance, they rate as very, very, even. Same value.

Same value…but different production.

Vladimir Guerrero was a burly power hitter who hit a lot of homeruns. Abreu wasn’t nearly as dominant a power hitter, but he closes the gap because he walked a lot more than Guerrero. There were other differences between the two men, but that was the biggest: Guerrero hit homers. Abreu drew walks.

Requisite tables:

Name	H	2B	3B	HR
Vladimir Guerrero	2590	477	46	449
Bobby Abreu	2470	574	59	288

And:

Name	Walks	IBB
Vladimir Guerrero	737	250
Bobby Abreu	1476	115

Most outfielders have the majority of their value in what they do as hitters. You can make the Hall-of-Fame as an elite defensive shortstop, but it’s a tougher task to pull off as an outfielder. You gotta hit.

And both men did hit, though neither was a perfect offensive player. Guerrero was a terrific power hitter, but he didn’t get on-base at an elite clip. Abreu excelled at getting on base, but in a power-heavy era, he didn’t hit too many dingers. The two tables above show each player’s limits.

Win Shares and the two versions of WAR are in general agreement about both players’ career value:

Name	FanGraphs WAR	B-R WAR	Career Win Shares
Vladimir Guerrero	54.5	59.5	324
Bobby Abreu	59.8	60.2	356

Abreu is a little ahead by FanGraph’s WAR and Win Shares. Adjusting for games played, Vladimir is slightly more valuable by Baseball-Reference’s version of WAR. It’s all very, very close.

There is a third right fielder…same era as Guerrero and Abreu, who I did not include in the previous article. I didn’t think about him, didn’t pay him any mind.

Let’s include him now. Starting with nature of their hits:

Name	H	1B	2B	3B	HR
Vladimir Guerrero	2590	1618	477	46	449
Bobby Abreu	2470	1549	574	59	288
Gary Sheffield	2689	1686	467	27	509

Gary Sheffield, purely on the bases of what happened on balls-in-play, was more closely aligned to Vlad than Abreu. He was a sluggardly slugger. He was a stone-cold clouter.

What about walks?

Name	Walks	IBB
Vladimir Guerrero	737	250
Bobby Abreu	1476	115
Gary Sheffield	1475	130

Here, Sheffield is even with Bobby Abreu. Like Abreu, Sheffield was a terrific at getting on-base. Both Abreu and Sheffield collected more than 1300 unintentional walks. Vlad collected less than 500 unintentional walks. That is a gap of some eight hundred walks.

So Gary Sheffield, as a hitter, combines Vladimir Guerrero’s power hitting with Abreu’s ability to get on-base via the walk. He is the best of both players.

If I wanted to bore you to tears, I could go into the defensive evaluations of all three players, and their contributions on the base paths, but there isn’t a helluva lot of daylight between them. Abreu and Guerrero were right fielders with limited defensive reputations: Guerrero had an arm, and each won a Gold Glove, but both rate as poor defensive players. Sheffield was mostly a right fielder, but he was athletic enough to come up as a shortstop briefly, and he played third base for a few years. Sheffield stole 253 bases and was caught 103 times: he wasn’t as good a baserunner as Bobby Abreu, but he was probably a bit better than Guerrero.

The gap between them really comes down to offense. Vlad Guerrero was a terrific power hitter who didn’t walk much, and Bobby Abreu was a terrific on-base player who could occasionally pop one out. Gary Sheffield was terrific in both areas: he had Vlad’s power and Abreu’s on-base average.

And…here is the kicker…WAR just does not credit that.

WAR…both versions…considers all three men to be of parallel value:

Name	Games	B-R WAR	FanGraphs WAR
Vladimir Guerrero	2147	59.5	54.5
Bobby Abreu	2425	60.2	59.8
Gary Sheffield	2576	60.5	62.1

We can how close the metric rates all three players by looking at their respective WAR per 162 games:

Name	Games	B-R WAR/162	FanG WAR/162
Vladimir Guerrero	2147	4.5	4.1
Bobby Abreu	2425	4.0	4.0
Gary Sheffield	2576	3.8	3.9

It’s worse than comparable: both versions of WAR believe that Gary Sheffield - a better on-base player than Guerrero and a better slugger than Abreu - is actually a slightly lesser player than the other two outfielders.

I realize that I am leaving out a great deal of secondary information about these players. I have not mentioned their personalities, or their relative fame, or the arc of their careers, or whether their teams won or lost games. I don’t care about any of that. I am talking only about the math.

WAR is all about the math. It is credited for being objective, for eliminating all of the external noise to distill what a player did, and giving that a comprehensive number. It is meant to be objective. It is credited…by many people…as being the best objective measure of a player.

But it misses here. If you took Vlad Guerrero and added 50 walks a year, you’d have Gary Sheffield. If you took Bobby Abreu and gave him 12 more homeruns a year, you’d have Gary Sheffield.

WAR doesn’t give any credit to that.

The system that DOES credit that difference is Win Shares. Win Shares shows a clear gap between Sheffield and the other two players.

Name	Career Win Shares
Vladimir Guerrero	324
Bobby Abreu	356
Gary Sheffield	430

Per 162 games, Sheffield is ahead of Abreu and Guerrero:

Name	Games	B-R WAR/162
Vladimir Guerrero	2147	24.4
Bobby Abreu	2425	23.8
Gary Sheffield	2576	27.0

Sheffield isn’t wildly ahead of the other two players: he does not blow them away. He’s a little ahead, which is what you’d expect if you thought about the three players for more than ten minutes.

And Win Shares rates Sheffield as having had bigger seasons. Again…not much bigger. But bigger.

Best ten seasons by Win Shares:

Sheffield: 34, 34, 32, 31, 31, 30, 30, 30, 26, 24.

Abreu: 33, 29, 28, 27, 26, 26, 26, 25, 23, 23.

Guerrero: 29, 29, 29, 28, 28, 27, 27, 24, 23, 22.

Sheffield had eight seasons of 30+ Win Shares. Abreu had one. Guerrero had zero.

This, to me, shows the errors in the formula. Somehow, there are little dings against Sheffield that hamper how the systems measure him. These are small percentage miscalculations, and I’d hazard that a part of the challenge is that Sheffield moved positions a great deal, while Abreu and Guerrero did not. It is possible that WAR, in trying to figure out what a run from a third baseman means in comparison to a run from a corner outfielder, ends up missing Sheffield’s big seasons. Small miscalculations…small thumbs on the scale…but the result is a big problem.

Gary Sheffield was a better player than Guerrero and Abreu. This fact is obscured by a WAR metric that has so taken over how we evaluate players that trying to argue the case is nearly pointless. But I’ll make the effort all the same.

David Fleming is a writer living in southwestern Virginia. He welcomes comments, questions, and suggestions here and at dfleming1986@yahoo.com.

COMMENTS (33 Comments, most recent shown first)

garywmaloney
Re the specific HOF case of Sheffield, it seems we have left out one of the intangibles -- in honor of the undeniably great Mr. Allen, we might call this factor "Dickitude" -- the alleged negative impact of the individual and his actions on his teams, the game, etc.

But maybe that is outside this particular discussion, as would be BALCO related allegations (which were perhaps tangential).

Just in terms of the numbers, it really comes down to defense. Though the estimable Mr. Jaffe on the Other Site states that Sheffield is something of a favorite of his, he condemns the defense in the strongest terms:

"The bad news is that Sheffield’s defensive numbers — a mix between Total Zone and (from 2003 onward) Defensive Runs Saved — are all-time awful. His -195 fielding runs rank as the second-lowest total of all-time, ahead of only Derek Jeter (-243). That whopping total suppresses Sheffield’s career and peak WARs . . ."

It's the old fielding problem, identified by Bill in his Win Shares book, where he broke with Linear Weights (WAR's ancestor) over the actual value of fielding AND with that system's grotesque errors, which i believe were also based on "Replacement Player" / Estimate issues outlined in his recent piece. Bill spent more time figuring fielding's actual correct value than on any other part of Win Shares, and I think he got it right.

WAR, like Linear Weights, gets the fielding component wrong, and it skews the entire damn thing. That said, I love the elegant simplicity of tangotiger's "solution" -- divide TotalZone by 2.

Stop WAR's misleading of untold fans and professionals -- fix it.
4:53 PM Dec 4th

pgups6
Understood, and was just using the UZR v DRS as just a small component example of the broader issue of estimations upon estimations with two reputable sources having such varying outcomes. This system is heavily relied upon now (awards, contracts, etc) and I think we just need to do a better job conveying that it is a very rough gauge.
10:45 AM Dec 4th

tangotiger
We can debate analogies all we want, but the reality is that the primary difference between the two WAR is their reliance on third-party fielding systems. Reference can EASILY swap to UZR, or Fangraphs can EASILY swap to DRS, so that both systems use the same fielding system. They don't.

The secondary difference between the two is their in-house development of park factors. But those should be for the most part very similar since they both rely on very similar data.

So, what you want to do is simply debate UZR v DRS. And Fangraphs makes this ridiculously easy since they carry both of them, side by side.

10:25 AM Dec 4th

pgups6
But the type of car is a constant (in this case Ramirez), and one system is telling me that car is using SUV tires, the other system is telling me he is using coup tires. That's a problem.
10:08 AM Dec 4th

tangotiger
Again, it's not one WAR against the other WAR. It's like comparing the tires on a SVU to a coupe and not making reference to the manufacturers of the tires.

Both WARs rely on an independent system to include their fielding. Fangraphs uses UZR (from Lichtman) and Reference uses DRS (from Dewan).
10:00 AM Dec 4th

pgups6
But in Ramirez's case, one WAR has his defense rated above replacement while the other WAR has his defense rated below replacement (and could've cost him the MVP). So what are we suppose to do then? Go back to our subjective appraisals?
9:41 AM Dec 4th

TheRicemanCometh
Tango is right yet again. My problem has always been WAR's defensive analysis, which leads to these bizarre outcomes. If they literally cut it in half the whole thing would disappear.

8:13 PM Dec 3rd

tangotiger
This is a UZR (what Fangraphs chooses) v DRS (what Reference chooses) issue. Everything else is window dressing.

7:47 PM Dec 3rd

pgups6
First, understood that there is a ton of work involved (much of it that is way over my head, admittedly) and it is appreciated, just trying to find common ground and solutions, that's all.

An legitimate MVP candidate has a 2.2 on one site but a 3.4 on another site, can we agree this is a problem?

And if the defensive metrics are limited (not off, but limited), should they be the basis for Hall of Fame voting?
5:45 PM Dec 3rd

tangotiger
First off: that "IF" is doing alot of work. The metrics are not off.

Secondly: if we don't provide WAR, something inferior will fill the gap. That's the reality of it. So, what we are doing is providing the best possible solution. The test is not against how close you can get to perfection, but rather, being better than the alternative.

And WAR is it.
5:00 PM Dec 3rd

pgups6
If the information is limited to the point where it causes significant differences then maybe it shouldn't be utilized. We should strive to provide something that is consistent.

A regular baseball fan should not have to take into account the source of the stats and go under the hood. We are use to our stats being hard actual numbers and unfortunately WAR is now being utilized as a hard number.

Hall of Fame candidates Andruw Jones and Scott Rolen are getting serious Hall consideration based on their defensive metrics alone. If these metrics are off/limited, then there's a problem.
4:54 PM Dec 3rd

tangotiger
There's multiple ways to estimate the inflation rate.

That's all we're doing here, trying to estimate reality with the limited information available.
4:04 PM Dec 3rd

pgups6
The different flavors of WAR is part of the problem. Something so prevalent and ingrained should be more standardized. Consider this year's AL MVP race:

Abreu- BBRef WAR 2.8 (oWAR 2.2, dWAR 0.2), Fangraphs WAR 2.6
Ramirez- BBRef WAR 2.2 (oWAR 2.8, dWAR -0.5), Fangraphs WAR 3.4

That's two different flavors, especially for Ramirez. And did Ramirez's negative dWAR cost him some votes?

Your everyday person doesn't have the wherewithal to look under the hood to determine the differences. And nor should they, if a stat is presented, it should be considered accurate.

Not pointing fingers, looking for common ground and solutions. Do we need to par down the emphasis of total WAR. Do the defensive metrics need more work?
1:21 PM Dec 3rd

frisco
I wonder how much of Sheffield's defense WAR is dragged down by his time as a shortstop and third baseman? He was really miscast playing those positions.

My Best-Carey Miller
12:27 PM Dec 3rd

Steven Goldleaf
My preferred method of examining weaknesses in systems is to focus on single worst-case comparisons. I'd be interested in knowing, for those of you who buy Dave's thesis here, which season of Sheffield's you think is most undervalued in comparison to either Abreu's or Guerrero's (preferably at the same position, but not necessarily--just trying to simplify here) and we can start analyzing how WAR reaches the conclusions it does. I presume we'll be looking at defense, mainly, and you will be able to show how, in this worst-case scenario, Sheffield's defense is being undervalued. If you can't show that materially and strongly, as I suspect you can't, in this extreme example, then we don't need to look at your weaker examples, do we? But if you can show weird and inexplicable justifications for evaluating Sheffield unfairly then we can move on to your next strongest case--you'll need to provide quite a few of those to make your case, but the ball is in your court.
7:36 AM Dec 3rd

tangotiger
And this is the important part: WAR allows you to do that. It let's you go halfway between TotalZone and NoZone. It gives you that freedom to create your own flavor of WAR.
10:03 PM Dec 2nd

tangotiger
The entirety of the discrepancy is with fielding. This is the whole story. And it's not "WAR" per se. It's the fielding system that WAR is using.

From 2002-present, Baseball Reference relies on DRS and Fangraphs relies on UZR.

Prior to that, they rely on TotalZone.

If WAR wanted to, it could do this: TotalZone/2. And that's it. Suddenly, Sheffield would come in at 70 WAR and everyone is happy.

10:01 PM Dec 2nd

evanecurb
Dave Fleming is really an excellent writer. This was not only thought-provoking, but it was fun, too. Why do people use uber stats to take a shortcut to determining value? They're looking for a quick and easy answer instead of doing the analysis and research themselves. It's a lot easier to say "Guerrero and Sheffield have the same WAR" than it is to go into a detailed comparison of the two players, where you learn that (1) Guerrero really didn't walk very much for a power hitter, (2) Sheffield did, (3) the available defensive metrics say that Guerrero was much, much better defensvively than Sheffield, and (4) what do we think about these defensive stats, anyway? Was Sheff really that bad?

If you use WAR as a gateway to this analysis, I think you're using it correctly. If you're using WAR to say the two players were of equivalent value and stop there, you're not advancing the conversation.
9:41 PM Dec 2nd

W.T.Mons10
"When I look at WAR...it says that Sheffield and Vlad and Bobby Abreu are equivalent players. When I look at all of the component parts of each of those players...when I do the work that WAR promises it is doing for us...my work says that Sheffield should be ahead of the other two."

But you aren't looking at all of the component parts. You are only looking at offense. WAR agrees that Sheffield was a better hitter than the other two. But it doesn't ignore the other half of the game.
9:12 PM Dec 2nd

sansho1
If it please the court, I would submit that the overarching dynamic issue with WAR being described is a compelling one, but if the example given can be at least in part explained by a wonky element thereof, do forgive the lumpen for piping up in the interest of lively discussion.
8:49 PM Dec 2nd

pgups6
This is fantastic. And just like Winfield-Dewey, the defensive metrics are the biggest culprit. In addition (aside from this example), WAR doesn't seem to ding enough for durability with guys like Walker and Rolen. I understand with missed games, player lose the ability to add upon their WAR totals, but they also have "the benefit" not to debit their amount as well. And given this propensity for negative defensive metrics values, could missing games actually work in favor of a player's WAR?
8:22 PM Dec 2nd

ksclacktc
I've always thought the same thing you're bringing up, and Bill indirectly in this article. However, I've never been able to articulate the problem as well as you have. I've pretty given up arguing with the WAR people those who usually respond the way people involved with politics and the media do on those subjects. WAR has been placed on a pedestal as a continuation of the metrics age and the conclusion of all the work done. As an amateur saberist since the early 80s I shudder at what has become of the movement that was about asking questions, researching and attempting to come up with newer and better ways to answer. Well done Sir. And, keep up the good fight...I see you have some snarky responses already.
8:18 PM Dec 2nd

sansho1
When you get the car back, do you tear up the itemized receipt and tell the mechanic to just write "car works now, $500" instead?
6:04 PM Dec 2nd

DaveFleming
Oh god, yes. Sheffield was scary to see at the plate. That bat wiggle...I always wondered how the hell he managed to time that, but it was tremendously fun to watch.

Sheffield also shared Vlad's ability to avoid the strikeout: both men averaged 74 strikeouts per 162 games, and neither ever topped 100 K's. Abreu was significantly more susceptible to the whiff.
5:18 PM Dec 2nd

DHM
Dave - great job here. Just looking at the math does make you wonder “WTF WAR?”
I always thought Sheffield was the scariest hitter at the plate, looked like he was holding a wiffleball bat and then a ferocious rock-splitting swing...just absolutely terrifying. Playing 3B against him must have taken serious brass balls. I wish we had exit velocity in his prime.
3:04 PM Dec 2nd

michaelplank
Sheffield's number 1 batting comp (although less than 900 so not "truly" comparable") is his former teammate Chipper Jones:

Sheffield 10,947 PA, .292, .393, .514
Jones 10,614 PA, .303, .401, .529

Sheffield came up just a few years earlier so wasn't *quite* as centered in the sillyball era. Other than that, and maybe some park effects, same guy; Jones slightly ahead. OWar gives Sheffield 80.8, Jones 88.3; half a WAR per year difference.

Jones was a cromulent-ish 3B, stopgap SS, good-ish LF ... gets dinged 0.9, not even a full WAR, for his career defense.

Sheffield loses 27.7, more than a WAR per season.

Final count, Sheffield 60.5, Jones 85.3. I don't know.
12:41 PM Dec 2nd

Manushfan
I remember watching Sheff playing RF in Fenway--the guy was actually fairly good, seemed to have a better arm than you would think. As I said elsewheres I'm agnostic on lovely WAR, I think it's got its uses and if you wanna use it as a shorthand go to on 'Player GOOD!' or 'Player BAD!', that's okay by me. But if it tells me player X was lousy in CF (Mantle) or third (Traynor), or just OK (Joe Morgan), I don't have to genuflect and go 'Duh yeah that's right yup yup yup'. Not how it works.
12:13 PM Dec 2nd

Steven Goldleaf
I'm getting to really hating analogies.
11:35 AM Dec 2nd

DaveFleming
One of the things that frustrates me...and please recognize that I am trying to work through my thinking, and not trying to be dismissive...is when people jump to 'breaking down' WAR when it doesn't work.

It's a car problem. I need my car to go, to move forward. If you're trying to sell me a car that doesn't start, and you tell me it isn't working is because the alternator is crappy, or because it's actually the driveshaft that's crooked, or because it's really the flux capacitor that's rotten...well...I don't care. The car ain't working. That's the problem.

I KNOW that the reason Sheffield rates with Vlad and Abreu is because WAR absolutely hammers him on his defense. But...WAR is supposed to do the job of telling the whole story about a player. If I have to go in and say 'well, your defensive metrics stink,' I think it's a car that won't go.

How is WAR used? Not oWAR or dWAR...but WAR. The big tally that is the default ranked on FanGraphs. Is it treated like a 'part' of how we understand a player, or is it treated like it is THE player?

It's my opinion that the majority of stat-knowledgable fans treat it in that second fashion. Not just fans: writers. if you look at the MVP or CY or ROY votes of the past five years, they correlate EXTREMELY CLOSELY with how players rank on WAR.

It is absolutely true that the difference between Sheffield and Vlad and Abreu is expressed in the defensive evaluations of those men. But that difference is carried over to the BIG WAR tally. If you tell me that it's MY job to work that out, you're telling me that it's not the fault of Ford or Chevy or Toyota that my car won't run. Go blame the people who make the alternator.

FanGraphs and BB-Ref have made a car that does not run. It wasn't malicious, and I'm sure that they are working hard at correcting it. But in the meantime, we've had five years where WAR had gained credit that it hasn't earned, and the response to any criticism of the metric are 1) you just don't understand the math enough, and 2) it's just that one part that might not work.

When I look at WAR...big WAR...it says that Sheffield and Vlad and Bobby Abreu are equivalent players. When I look at all of the component parts of each of those players...when I do the work that WAR promises it is doing for us...my work says that Sheffield should be ahead of the other two.
11:25 AM Dec 2nd

sansho1
So these guys all played during the infancy of advanced outfield defensive metrics, which I (when I was paying closer attention than I am now, I should admit) often thought were lacking in face validity. Range factors in the outfield seemingly did not account for the fact that many balls hit to OF can reasonably be handled by one of two players, and who does field them could be a matter of a private prior agreement between them, rather than a reflection of the defensive value of either. The advent of physical data (velo, angle) is surely useful in building a better metric, but is there any attempt to retroactively apply these measures to old game tape? Is it even possible?
11:13 AM Dec 2nd

TheRicemanCometh
Chuck is right. Sheffield is loathed defensively by BWAR and FWAR. FWAR says Sheffield was costing his teams 2-3 wins a year for his defense alone, which I would guess would be quite the shock to the teams he played for. I thought he was a below-average right fielder, but certainly not Greg Luzinski. WAR thinks he was Barnie Fife.
10:45 AM Dec 2nd

3for3
It is easy to see why BBr thinks Sheffield is similar to Abreu and Guerrero. Sheffield is rated 195 runs below average; the other 2 are +7 and -7. This is not a 'grain elevator' problem. You might disagree with the fielding ratings, but it is easy to see where they come from.
10:29 AM Dec 2nd

CharlesSaeger
If you go off the Offense+Position Adjustment WAR, here's what you get:

Vlad: 59.1
Abreu: 61.6
Sheff: 80.8

I suppose I'll have to run their defensive stats at some point. Guerrero had by far the best reputation of the three. BB-Ref hates Sheffield's fielding with the fire of 10,000 suns. How valid that is, I dunno, but Sheffield did have a good work ethic.
9:38 AM Dec 2nd

The Sheff on the Shelf

COMMENTS (33 Comments, most recent shown first)

Leave a comment

Report inappropriate comment


Type of Abuse:
Comments: