Username:	Password:

Remember me

Forgot your username/password?

Print Email

Home>Articles

Summarizing the comparisons of this series

By Bill James

March 5, 2019

A Philosophical Note

Contrasting a Person of Knowledge, a Scientist or a Serious Student or an Analyst, whatever you would want to call us. . . .Contrasting a Person of Knowledge with a Talk Show host or a journalist who can say whatever he/she believes to be true, a serious person should never claim to know that which he does not actually know.

We cannot always obey this rule, and participate in the public dialogue. People put you on TV, they call you on the phone; they don’t want to hear that you don’t know what the answer is. You get pushed in to giving an answer.

When you run the numbers in a study like this, you find many cases in which one pitcher is at 7.3 and the other is at 7.1, and the one pitcher has an advantage in strikeouts and walks but the other pitcher has an advantage in ERA+, and the one pitcher has an advantage in innings pitched but the other in WHIP. It is always tempting to give in to the impulse to choose, the impulse to say that one is right and the other is wrong.

A serious person should never do this. When we don’t know how to sum up all of the moving parts, then we should admit that we don’t know and leave it there.

A Summary of the D-WAR/R-WAR comparison

We have reviewed in this series of articles the pitchers from 1921 to 2018, which is 98 years, and two leagues in each year, so that is 196 choices as the top pitcher in each league. We looked at each one in two ways, so that would be 392, and then we looked at Cy Young winners and sometimes FanGraphs, so. . .

Anyway, of those 196 leagues, there are:

107 leagues in which my new method (D-WAR) agrees with Baseball Reference as to who the top pitcher in the league is, and

89 leagues in which it does not agree.

55% agree, 45% do not. Of the 89 leagues in which we do not have agreement, there are 42 cases in which, after careful examination, I could not say which of the two candidates was the better nominee as the league’s best pitcher (the two candidates being the league leader in D-WAR, and the league leader in R-WAR.)

That leaves 47 other leagues in which we have a disagreement. Of those 47 leagues, there are 26 in which it would seem to me that my new method has it right (and Baseball Reference has it wrong), and there are 21 in which it would seem to me that Baseball Reference probably has the better nominee. This somewhat understates the extent to which, it seemed to me, D-WAR outperformed R-WAR in choosing the league’s best pitcher. In the early part of the study, the years up until 1956, R-WAR held an 11-5 edge, as D-WAR often had a suspect answer due to missing data. Since 1956, however, when we have full data for starting pitchers, D-WAR gets a 21-10 advantage, granted that this "advantage" is merely in my judgment; you would have to decide case by case whether you agreed with that or not.

Since the Cy Young Vote started there have been 66 cases in which the two systems agreed as to who was the best pitcher in the league, who should be the Cy Young Award winner (unless it was a reliever.) In those 66 cases there are 43 in which that player DID win the Cy Young Award, and 23 in which he did not.

From 1956 to 1984 there are 23 cases in which the two systems agreed, but the pitcher who the two systems agree was the best in the league was given the award only 10 times, or 43%. Since 1985, however (beginning in 1985), there have been 43 "agreements", and in those 43 the pitcher that we agreed upon has won the award 33 times, or 77%. This obviously reflects the growing influence of analytical systems on the voting. But going back to this:

We have a count of 196 leagues, and 26 cases in which it seems to me that Baseball Reference pretty clearly has the wrong answer. That’s 13%, but it is really more than 13%. Since 1956 it is more like 17%. It would be my judgment that Baseball Reference is not more than 85% accurate in evaluating pitchers, and could be somewhat less than that.

The R-WAR Most Valuable Players

This study covers 98 years, two leagues each, 196 leagues. In those 196 leagues, there are 75 cases in which Baseball Reference identifies a starting pitcher as being the most valuable player in the league, or the player with the highest WAR. There are no cases in which Baseball Reference identifies a relief pitcher as the "true" MVP, and there are 121 cases in which baseball reference identifies a position player as the true MVP.

Of the 121 R-WAR Most Valuable Players who were position players, most have been outfielders. 71 of the 121 were outfielders. A full breakdown of the 196 players, by position:

	Starting Pitchers	75
	Outfielders	71
	Second Basemen	15
	First Basemen	13
	Shortstops	12
	Third Basemen	8
	Catchers	2
	Relief Pitchers	0

		196

The only catchers to be recognized by Baseball Reference as the true Most Valuable Players in their leagues were Gary Carter in 1982 and Buster Posey in 2012.

Of the 121 position players who are seen by Baseball Reference as the Most Valuable Players, 70 are accounted for by just thirteen players:

	Babe Ruth	9
	Willie Mays	9
	Rogers Hornsby	8
	Ted Williams	6
	Barry Bonds	6
	Mickey Mantle	5
	Alex Rodriguez	5
	Albert Pujols	5
	Stan Musial	4
	Mike Trout	4
	Jimmie Foxx	3
	Carl Yastrzemski	3
	Cal Ripken	3

		70

Ruth and Hornsby may also have been Baseball Reference MVPs prior to 1921, I don’t know; I would guess that they were. The only pitchers who have been three-time Baseball Reference MVPs are Lefty Grove (3), Bob Gibson (3), Greg Maddux (3) and Roger Clemens (4).

One way to look at it is that the player recognized by Baseball Reference as the true MVP is usually, 73% of the time, either a dominant superstar like Trout or Pujols, or a starting pitcher. Dominant superstars who "win the title" repeatedly account for 70 of the 196; starting pitchers account for 75.

The question is, is the frequency with which Baseball Reference recognizes a starting pitcher as the true Most Valuable Player reasonable, or excessive? We can break the pitcher/non pitcher split down into five eras:

1) In the pre-BBWAA MVP era (1921 to 1930), there were 20 MVPs, of whom only two were pitchers. Sixteen of the other 18 were won either by Babe Ruth or Rogers Hornsby.

2) In the era in which there was a BBWAA Most Valuable Player, but no Cy Young Award, there were 50 leagues (1931 to 1955). The BBWAA recognized a starting pitcher as the MVP nine times in those 50 leagues, or 18%. Baseball Reference recognizes a starting pitcher as the MVP 20 times, or 40%.

3) In the first quarter-century of the Cy Young Award (1956 to 1980), there were, again, 50 leagues. In those years, the BBWAA recognized a starting pitcher as the MVP 5 times, or 10% of the 50 awards. Baseball Reference recognizes a starting pitcher as the MVP 23 times, or 46%.

4) In the closing years of the 20^th century (1981 to 1999) there were 38 leagues. In those years, the BBWAA recognized a starting pitcher as the MVP only one time, or 3%. Roger Clemens in 1986 was the only starting pitcher in those years to be recognized as the Most Valuable Player. Baseball Reference, on the other hand, recognizes a starting pitcher as the most valuable player 20 times in 38 leagues, or 53%.

5) In the 21^st century (2000 to 2018) there have been, again, 38 leagues. In those leagues, the BBWAA has recognized a starting pitcher as the MVP two times (Verlander and Kershaw), or 5%. Baseball Reference identifies a starting pitcher as the MVP ten times, or 26%.

Let’s chart that data:

			BBWAA Awards to	Baseball Reference
	Years	Leagues	Starting Pitcher	Recognizes Pitcher
Era 1	1921 to 1930	20	No BBWAA Awards	2
Era 2	1931 to 1955	50	9	20
Era 3	1956 to 1980	50	5	23
Era 4	1981 to 1999	38	1	20
Era 5	2000 to 2018	38	2	10

So we get back to the question: is the frequency with which Baseball Reference identifies a starting pitcher as the true MVP reasonable, or excessive?

I think it is a little bit high. The proposition that 38% of Most Valuable Players should be pitchers is not unreasonable. If you ask, "What percentage of value in winning games is contributed by pitchers?" 38% if a pretty good answer. Also, the fact that the MVP voters of the BBWAA rarely give the award to a starting pitcher is not evidence against Baseball Reference WAR; in merely means either that the percentage in Baseball Reference is too high or the percentage by the BBWAA is too low, and it seems obvious that the second half of that is true.

That’s not exactly the problem. 38% is not unreasonable, but some of the selections seem like a reach. From 1983 to 1999 Baseball Reference identifies a starting pitcher as the National League’s Most Valuable Player eleven times—John Denny in 1983, Dwight Gooden in 1985, Mike Scott in 1986, Orel Hershiser in 1988, Tom Glavine in 1991, Greg Maddux in 1992, 1994 and 1995, Jose Rijo in 1993, Kevin Brown in 1998, and Randy Johnson in 1999. Some of those selections are reasonable. Some of them, I would suggest, are not too reasonable.

From 1966 to 1980, Baseball Reference sees a starting pitcher as the true MVP ten times in 15 years—Sandy Koufax in 1966, Bob Gibson in 1968, 1969 and 1970, Ferguson Jenkins in 1971, Steve Carlton in 1972 and 1980, Tom Seaver in 1973, Rick Reuschel in 1977, and Phil Niekro in 1978. Combining those two, Baseball Reference sees a starting pitcher as the NL MVP 21 times in 34 years.

Sometimes you can buy it, but. . . Hank Aguirre as the American League MVP in 1962? Really? Earl Wilson was more valuable than Frank Robinson in 1966? Sam McDowell in 1965? Watty Clark in 1931? Ned Garver in 1950? Between 1931 and 1942, Baseball Reference sees a starting pitcher as the National League MVP eight times in 12 years—Watty Clark (1931), Carl Hubbell (1933 and 1936), Dizzy Dean (1934), Bucky Walters (1939), Claude Passeau (1940), Whitlow Wyatt (1941) and Mort Cooper (1942.)

I am not arguing that the BBWAA has it exactly right; I’m pretty sure they don’t. But let’s combine these two things: one, that starting pitchers are seen as the MVPs more often than anyone else, and two, that when it isn’t a starting pitcher, it is an outfielder about 60% of the time. It is a starting pitcher or an outfielder 74% of the time. Are catchers and infielders, you think, being cheated? If you just moved a little bit of credit away from the pitcher and to the fielders—let’s say 2% of the game—that probably would knock down not only the number of MVP-WAR designations going to pitchers, but also the number going to outfielders. You probably would wind up with an MVP Award, when it doesn’t go to a pitcher, going occasionally to Roy Campanella rather than Duke Snider, or to Yogi Berra rather than Minnie Minoso, or to Johnny Bench rather than. . . .well, no; in Bench’s era they all go to starting pitchers.

We’ll finish this up tomorrow; thanks for reading.

COMMENTS (10 Comments, most recent shown first)

tangotiger
Those three links once more:
Kimbrel:
www.fangraphs.com/statss.aspx?playerid=6655&position=P#winprobability

Career Leaders:
www.fangraphs.com/leaders.aspx?pos=all&stats=rel&lg=all&qual=y&type=3&season=2018&month=0&season1=1871&ind=0&team=0&?rost=0&age=0&filter=&players=0

2018:
www.fangraphs.com/leaders.aspx?pos=all&stats=rel&lg=all&qual=y&type=3&season=2018&month=0&season1=2018&ind=0&?team=0&rost=0&age=0&filter=&players=0&sort=15,d
10:43 AM Mar 14th

tangotiger
With regards to thresholds: I think those make alot more sense for pitchers than hitters.

To that end, I do have what you are talking about with relief pitchers, using thresholds to determine if a reliever had a "shutdown" game or a "meltdown" game.

Here is Kimbrel:
https://www.fangraphs.com/statss.aspx?playerid=6655&position=P#winprobability

In the Win Probability section, the last two colums (SD, MD) represents that. (If you click the column header, you will get the text description of the column.)

"Career" results are here, though I think that goes back to 1972 or 1974:
https://www.fangraphs.com/leaders.aspx?pos=all&stats=rel&lg=all&qual=y&type=3&season=2018&month=0&season1=1871&ind=0&team=0&rost=0&age=0&filter=&players=0

Rivera leads with 580 Shutdowns (to go with his 121 Meltdowns).

In addition to the scale somewhat resembling Saves, it has the benefit of not discriminating against middle relievers. If they come in high leverage situations and they come out of it in good standing, they get a Shutdown.

For example, this is 2018:
https://www.fangraphs.com/leaders.aspx?pos=all&stats=rel&lg=all&qual=y&type=3&season=2018&month=0&season1=2018&ind=0&team=0&rost=0&age=0&filter=&players=0&sort=15,d

The top two are Treinen and Diaz. But the next 3 are non-closers. You can tell easily by the number of "Pulls", which is also on that page.

10:37 AM Mar 14th

klamb819
I'm playing catchup with this series, but I want to make this point at the risk of repeating something from earlier.

Of the many things that makes this method brilliant, my favorite is this, from Bill's second installment: It "looks at the game-by-game impact of the pitcher’s performance, rather than the cumulative impact." In other words, rather than merely considering a pitcher's average for a season, it considers the percentage of starts in which a pitcher meets various incremental thresholds, both good and bad. It doesn't do that directly, but it has that effect. It gives its highest value for reaching the key threshold of 5 points above the Target Score.

I've always thought this distinction is an under-appreciated reason (among others) why Bill's metrics are superior to the Linear Weights approach that is the foundation of WAR and other comprehensive metrics. Averages are easier to work with, sure, but in order to be comprehensive (like WAR or Win Shares), a metric must consider the percentage of times a person reaches various significant thresholds. Bill has recognized the importance of threshold percentages at least since his first formula for Runs Created (which uses linear values, but applies them geometrically to generate something that isn't an average).

We know this intuitively. We know a .300 hitter who singles on 25% of his at-bats and homers on 2% is very different from a .300 hitter who singles on 18% of his at-bats and homers on 5%.

A more complex example: J.D. Martinez led MLB last year in Changes to Run Expectancy (and wouldn't CRE be a better acronym than RE24?). He increased his team's Run Expectancy by 0.113 runs per plate appearance above average. For the sake of illustration, I'm going to use arbitrary thresholds (lacking sufficient research to reveal more meaningful ones). Would you feel more enlightened to learn that Martinez increased Win Expectancy by 0.113 runs per plate appearance, or to learn what percentage of times he increased Boston's Run Expectancy by at least "the difference between a walk and an out when leading off an inning," which is about 0.60? Or how often he increased the average expectancy by the 0.24-run gain for stealing second with no outs? And also how often his plate appearances decreased Boston's run expectancy by significant amounts — say, the minus-0.60-run average cost of either (a) making a second out that leaves a runner on third,* or (b) getting caught stealing second base for the first out? Or maybe how many of his plate appearances decreased Run Expectancy by the cost of making the third out with a runner on second* (minus-0.32)?
.... *These are the expectancies for single base runners. I didn't take the time to calculate them with multiple runners.

I don't mean to disparage the valuable research done by Linear Weights proponents. Tom Tango is probably the hardest-working man in sabermetrics. He produced the gold-standard Run Expectancy tables, as just one example. And introduced the valuable notion that an all-inclusive metric (WAR) should also account for playing time. Also, credit where it's due: his new metrics on Statcast often reflect threshold percentages instead of averages. My complaint is not even about Linear Weights per sé, but about applying so much great research to metrics that are based on averages and not on threshold percentages.

The brilliance of Bill's approach here is that he found a way to recognize threshold percentages without reciting them. They're baked into the cake of Deserved W-L instead of being served in separate containers of flour, sugar and eggs. To complete his earlier sentence that I quoted above: By looking at game-by-game impact, this approach is "giving an advantage to a pitcher who is consistently good, over a pitcher who is sometimes brilliant." In the real world, then, it gives more value to what we consider more valuable.

P.S. This approach might even be MORE valuable for relief pitching. Suppose in all but one of his outings, a reliever has a 2.13 ERA in 71-2/3 innings (having given up 17 earned runs). But there was this one game in mid-August when he got knocked around for 4 earned runs while getting only one out (and on a pickoff, to boot!) One bad game out of 68 appearances wound up hiking up his ERA by half a run to 2.63, obscuring his excellence in 67 other games.
3:38 AM Mar 6th

SteveN
Many thanks to Goldleaf for the font.
10:52 PM Mar 5th

bjames
Catchers/MVP Award list. . .looks like you missed Ernie Lombardi (1938).

S. Goldleaf gets credit for the larger font. He copied one of the articles, converted it to larger font, and re-posted it. I deleted his re-posting of it because he had left the article open to comments when I did not want comment on it, but I realized that he had a good point about needing larger font.

2:44 PM Mar 5th

doncoffin
Of course he was. I managed to have a brain fart. So, 17, not 18. (Good catch.)
1:28 PM Mar 5th

SteveN
Don, pretty sure that Peckinpaugh was a shortstop.
10:40 AM Mar 5th

doncoffin
Well for what it's worth (and assuming I can count), looks like catchers have won 18* of the actual MVP awards. So maybe the voters are getting things right more often than Baseball Reference WAR.

*Yogi 3
Campanella 3
Bench 2
Cochrane 2
Hartnett 1
Mauer 1
Posey 1
I-Rod 1
Munson 1
Howard 1
Peckinpaugh 1
Bob O'Farrell 1 (NL, 1926)
(Unless I missed someone, of course.)
10:22 AM Mar 5th

steve161
I pretty much agree that DWAR gets it right--or at least is more plausible--more often than RWAR, but I would add this: DWAR is essentially never IMplausible, whereas every now and again RWAR misses by a mile.
9:37 AM Mar 5th

SteveN
I may just be an old fart, born the same year you were, but,, your last few articles have had a big bold font.

Many thanks for that.

Also, enjoy the subject.
6:48 AM Mar 5th

Summarizing the comparisons of this series

COMMENTS (10 Comments, most recent shown first)

Leave a comment

Report inappropriate comment


Type of Abuse:
Comments: