On the Relative Importance of Pitching Categories II

February 26, 2014
Results
 
                I am reporting today the results of my study, which I described yesterday, in which I asked "which categories of a pitcher’s record are the best indicators of his overall value?"  I’m going to start at the bottom.
 
1) Home Runs Allowed (55.4%).   Among the twelve pitching categories that I studied, Home Runs Allowed are the poorest predictor of overall value.  
 
                I figured Home Runs Allowed per 27 innings, even though the innings were essentially a constant. The pitcher allowing fewer home runs was the better pitcher 55% of the time.
 
                Two pitchers in the data, Denny Driscoll in 1882 and Bill Steen in 1914, allowed no home runs at all in their 200 innings. Both pitchers are scored at 2.7 WAR, which is above average, but barely above average, the average being 2.48. On the other end of the scale is Bronson Arroyo, 2011, who gave up 46 bombs in 199 innings.
   
2) ERA (57.4%).   Among the twelve elements studied here, ERA was the poorest predictor of true value, other than the rate of Home Runs Allowed.     Obviously this is a surprising conclusion, and I will discuss the implications of this below, when we are talking about Winning Percentages.  
 
3) Runs Allowed Per 9 Innings (58.9%).    Runs Allowed Per 9 innings did slightly better, as a predictor of true value, than Earned Run Average.
 
4) WHIP (60.01%).   Walks + Hits per Inning Pitched was 60% accurate as a predictor of actual value in this study—better than ERA and Runs Allowed, but not much better.      
 
5) Strikeout Rate (60.5%).  
 
6) Won-Lost Records (60.9%).   OK, the most interesting conclusion from these studies is the fact that won-lost record and its brother, winning percentage, perform better as a predictor of true value than ERA and its brothers, runs allowed per 9 innings and WHIP, so let’s deal with that here.
 
                First of all, we should make sure everyone understands that the "ERA" that performs poorly here is ERA not adjusted for the league ERA, and that this comparison is of different pitchers over a long period of time.   We are comparing here, for example, Johnny Lush in 1907 with Jamie Moyer in 2005.    Lush in 1907 had an ERA of 2.64, but the National League ERA was 2.46.    Moyer in 2005 had an ERA of 4.28, but the American League ERA in 2005 was 4.76.  Lush had a "better" ERA only if one fails to adjust the ERA for context, and it is not surprising that Moyer’s won-lost record (13-7) was better than Lush’s (10-15).   ERA is of course better as a predictor of true value than the won-lost record if one is comparing two pitchers in the same league, and ERA may be better than the won-lost record if one restricts the time frame to a limited period of years, so that there would be less variation in the league standard.    
 
                I started to think about this article in January, when I was on Brian Kenny’s show, and Brian asked me whether I would join in the effort to get rid of the "Win" stat.   I said "No", and the reason I said "No" is that I’m not generally in favor of getting rid of information.    I’m not in the business of eliminating data.
 
                Critics of the won-lost record are focused on the flaws in won-lost records, the biases and inconsistencies—but all statistics have biases and flaws of the same nature.   The won-lost record is misleading because some teams score many more runs than other teams, which makes it much easier for their pitchers to win—but it is also much easier for a pitcher to post a low Earned Run Average in some leagues than it is in others, and in some parks than it is in others.
 
                I started thinking, after that, about how I could compare the validity of a pitcher’s won-lost record to the validity of his ERA.   I am very surprised that won-lost records outperformed ERA in this study, and I certainly did not design the study so as to favor that result.    I assumed that ERA would be a better predictor of actual value than the won-lost record.   It isn’t.
 
                It isn’t over time; it might be over a shorter period of time.   The critic of the won-lost record could make the following argument: that we mentally remove the biases of the ERA when we use it, whereas people who cite won-lost records speak about them as if the pitcher actually "won" or "lost" the game.   All by himself.
                But is that true?    This is an argument about historical stats, about the fair comparison of Jack Morris to Jim Palmer to Burleigh Grimes.   I might argue that fans in general have a stronger grasp of the biases inherent in won-lost records from the teams on which people pitched than they do of the ERA bias from leagues.   Quick now, which league has a higher ERA: The National League in 1933, or in 1969?
 
                The National League ERA in 1933 was 3.33; in 1969 it was 3.60, and in 1970 it was 4.05.   Maybe you got that one right, because you guessed that I was throwing you a curve ball, or maybe you got it because you are an expert in the history of league ERAs, but I think most people would assume that the National League ERA was higher—and probably much higher—in 1933 than it was in 1969.   We organize data into batches, in our minds; we simplify to make storage room.   Comparing a 1930s season to a 1960s season, most people are going to assume that the 1960s ERA should be lower, but in fact the National League ERA was higher in 1961 than in 1931, 1932, 1933, 1935, 1936, 1937, 1938 or 1939.   The National League ERA in 1966 was higher than in 1933, and at the same level as most of the 1930s.    We don’t really have specific enough information in our heads to counter-act that bias.
 
                And that is just one of two major biases in the ERA stat (and similar and related stats, such as WHIP.)   After you adjust out the leagues, then you’ve got to worry about the parks. 
 
                The won-lost record has a tremendous advantage over almost any other stat, in that the winning percentage is always .500 in every decade, in every year, in every month—and in every park.    If you remember that the Orioles of the 1970s were a consistently outstanding team—which I think that most people do—then you can adjust for the ONLY major bias in Jim Palmer’s won-lost record.  
 
                Both won-lost record and ERA have numerous other flaws.    About the won-lost record:
 
                1) It is biased by the team on which the pitcher pitched,
 
                2) It is heavily subject to random fluctuations based on small data samples, and
 
                3) The accounting system is often irrational, and will sometimes credit a pitcher with a "win" only because he has allowed the tying run to score. 
 
                But about the ERA:
 
                1) It is subject to the bias of league norms,
 
                2) It is subject to park effects,
 
                3) It is subject to weather effects (league ERAs are always lower in September than in July, so a pitcher who pitches more in the late season has an ERA advantage),
 
                4) It is subject to the randomness of balls in play being hit at fielders or not being hit at fielders (an effect which also reaches won-lost records),
 
                5) It is subject to the vagaries of scoring decisions.    In 1900 32% of all runs were scored as un-earned, whereas in modern baseball it is less than 10%.
 
                6)  The accounting system for ERA is also irrational on numerous points, and
 
                7) The clustering of run events causes ERA also to be more subject to random fluctuations than many people assume that it is.   The fact that a pitcher pitches 200 innings in a season causes people to assume that the random effects even out, when in fact the more relevant number is not the innings pitched but the runs scored, which is normally more like 80 than 200. The process of normalization from larger numbers is slowed down by the fact that runs are normally scored in clusters.   In fact, it is quite common for a pitcher to have a 2.85 ERA one year, 4.10 the next, when in reality he has pitched as well one year as the other.   
 
                The critic of won-lost records could say, "OK, but we have better summary methods now.   We have Fangraphs WAR and other summary methods which fill the place of won-lost records.   We don’t need them any more."  
                But that is no more an argument to get rid of won-lost records than it is an argument to get rid of ERA, or WHIP, or any other less accurate stat.   Since won-lost records are more accurate at predicting a pitcher’s true value than ERA, why not get rid of ERA?
 
                We don’t get rid of ERA because we’re not in the business of getting rid of information.   It’s not constructive.   We need to know more information, not less. But if you can only know a pitcher’s ERA, or his won-lost record, you’re better off knowing the won-lost record.  
 
 
6) Walks Per Nine Innings (61.2%).   I will switch now from ranking the categories from the bottom to ranking them from the top. . .walks per inning are the 7th-worst category, which makes them the sixth-best. Walks per 9 innings outperformed strikeouts, but only by a thin margin. 
 
5) Winning Percentage (61.5%).   Winning Percentage is the 5th most accurate indicator of a pitcher’s true value, given the assumptions of this study.   I do not know why the winning percentage performed better, as an indicator of true value, than the won-lost record, but I suspect that it had to do with ties.
 
                There are a lot of ties in winning percentage, even given a constant number of innings; 10 and 10 is the same as 9-9, 11-11, 12-12 or 13-13; 8-10 is the same as 12-15, 12-9 the same as 16-12.    The won-lost record breaks those ties, considering 12-12 to be better than 10-10.   This is consistent with the assumption of WAR; if the replacement level is .300, then 12-12 is +4.8, whereas 10-10 is +4.0.    But given that all of these pitchers pitched the same number of innings, the won-lost record probably breaks those ties in a random direction, being "right" as often as it is "wrong", and thus probably enters a .500 component into what is otherwise a .615 category, which tends to drag the measured success of the category downward. 
 
4) Season Score (62.6%)  The Season Score has three components—the innings/ERA component, the won/lost/saves component, and the strikeout/walk component.     The process of figuring a season score starts by figuring a "CLI" or "Crude Leverage Index", based on the pitcher’s saves.   The purpose of the CLI is to put relievers and starters on an equal footing. Since all of these pitchers are starters, the CLI has very limited relevance to this study, and I’ll just skip the explanation of that.      
 
                I multiply the pitchers outs recorded (which is three times his innings pitched) by .435, subtract runs allowed, and subtract earned runs allowed.   This is, in essence, a way of saying that a pitcher with a 5.50 ERA has no value. .. parallel to the "replacement level" concept used in WAR.    (.435 times 27) outs is 11.75 (runs+earned runs) per 27 outs.   If a pitcher has a 5.50 ERA he might also allow another 0.75 un-earned runs per game, in general, so that’s 11.75 runs per 27 outs (5.50 + 6.25).    A pitcher’s score is based on his runs allowed below that level, and both earned and un-earned runs count against him, but earned runs count twice as heavily against the pitcher as un-earned runs.   
 
                In addition to that I give the pitcher eight points for a win, one point for a save, take away five points for each loss.     Then I take two times strikeouts minus three times walks, divide that total by 15, and add that.   The highest Season Score in the study, by far, was 306, by Pedro Martinez in 2002, and the worst Season Score in the study was by John Harkins of the 1887 Dodgers.   Harkins finished 10-14—a decent won-lost record--but struck out 36 batters, walked 77, and had a 6.02 ERA.   
 
                Season Scores were 63% accurate as a predictor of true value in this study.     I was disappointed that the Season Score did not do better as a predictor of true value, and I had expected it to do better.    I used to have a different way of figuring Season Scores, which relied more heavily on Wins and Losses.   I changed the method in 2009 to de-emphasize Won-Lost records.    I didn’t check this out, because I’m not going backward, but it seems obvious that the older version of Season Scores, with more reliance on won-lost records, would have performed better in this study than the new version does. John Harkins, 1887, is a good example of why this is true; he was 11-14 with a 6.02 ERA—but Fangraphs WAR puts his value (1.7) at a point entirely consistent with his won-lost record, but not at all consistent with his ERA.    On the other end of that would be Barney Pelty, 1909; Pelty was 11-11 with a 2.39 ERA.    Fangraphs gives him 1.8 WAR—like Harkins, consistent with his won-lost record, not at all consistent with his ERA.  
 
                But this, of course, makes again the point I made before: that the won-lost category holds value because won-lost records automatically adjust to rising and scoring levels of runs scored, whereas all of the other categories studied here go up and down with changes in time and place. 
 
3)   Relative ERA.   Relative ERA was 63.6% accurate as a predictor of value, making it the third-best predictor of value among the twelve stat categories studied.    This Relative ERA was NOT park-adjusted; an ERA of 3.00 in a league with an ERA of 4.00 was 0.75, regardless of whether the park factor was 109 or 90.    Presumably, Relative ERA would have done even better, had I been able to study the park-adjusted relative ERAs.
 
1 and 2) Strikeouts Minus Walks (65.2%) and Strikeout to Walk Ratio (65.4%) essentially tied as the best predictors of True Value, among the twelve categories studied.
 
                Of course, it is possible that Strikeouts and Walks perform best as a predictor of Wins Above Replacement because Fangraphs is relying too heavily on strikeouts and walks in their calculation of the pitcher’s value.    To this point I have assumed that WAR is a perfect and unassailable measure of a pitcher’s true value. This, of course, is not necessarily true.
 
                Going into the study, I assumed that it would not make very much difference what measure of true value I used, that one measure of overall value would be as closely tied to the elements of a pitcher’s record as another, and I had actually written a couple of sentences explaining to you why it was unlikely to matter what we used as the bottom line. I see now, at the conclusion of the study, that this assumption could be entirely wrong, and that very probably the version of WAR that we use does matter. 
 
                I would have liked to come out of this study entirely convinced that the Fangraphs WAR system does work almost perfectly, and convinced that the confidence placed in F-WAR by this study was not an issue.    I can’t say that I’m there; I’m not criticizing F-WAR, I am not saying that it is wrong in any particular case or that I have a better method, but I can’t tell you that I have unbounded confidence in it, either.    I will have to tell you, honestly, that I was surprised by how many cases there are in which F-WAR gives seemingly irrational answers in comparing two pitchers.
 
                For instance?
 
                Spud Chandler in 1942 was 16-5 with a 2.37 ERA.   Harry Howell in 1902 was 9-15 with a 4.12 ERA.     We’d tend to assume that Chandler might possibly be the better pitcher, no?
 
                Fangraphs likes Howell. The league ERAs are about the same—3.66 for Chandler, 3.57 for Howell.   Chandler’s ERA was 1.29 under the league average; Howell was .55 over.    Chandler’s strikeout to walk ratio was 74 to 74; Howell’s was 33 to 48.    Why exactly should we consider Howell to be a better pitcher?   Not saying it is wrong, you understand; I just don’t quite get it.
 
                Ah, but Harry Howell in 1902, in addition to his work on the mound, also played 26 games at second base, 15 at third base, 18 in the outfield, 11 at shortstop and one at first base.   Maybe his value comes from his work in the field?
 
                No, not really. . .at least if I understand the Fangraphs WAR charts.   Howell is +2.7 as a pitcher, +0.2 for his work as a fielder and hitter.   Chandler, on the other hand, hit a nice .211 with 8 walks, so he is +0.4 as a hitter/fielder.   The non-pitching stuff is actually helping Chandler get closer to Howell, rather than pushing Howell ahead of Chandler.   If I understand the charts correctly. ..and if I don’t understand the charts correctly, then they’re not of much use to me.  
 
                Maybe there is a good reason why Howell is better than Chandler, but I don’t know what it is, and I’m a little skeptical.
                Gil Heredia in 1999 was 13-8 but with a 4.82 ERA; Chuck Finley in 1989 was 16-9 with a 2.57 ERA.   American League ERA in 1999 was 4.87; in 1989 it was 3.89.    The Park Effect for Oakland in 1999 (Heredia) was .92; for California in 1999 it was .93.   Finley was 26 runs better than league, park adjusted; Heredia was three runs worse than league.   DH rule; hitting is basically no factor here.   Finley gave up 13 homers; Heredia gave up 22.  
 
                Fangraphs WAR says that Heredia was better than Finley (+4.0 to +3.6).   Well, OK, but I’m a little bit skeptical.      Garret Stephenson in 2000 was 16-9 with a 4.49 ERA against a league norm of 4.64, strikeout to walk ratio of 123 to 63.    Joe Coleman in 1975 was 10-18 with a  5.55 ERA against 3.79 league ERA, strikeout to walk ratio of 125 to 85.    Fangraphs says that Coleman (+1.3) was better than Stephenson (+1.2).   Say what?
 
                Johnny Broaca in 1935 was 15-7, 3.58 ERA, 78 strikeouts, 79 walks.   Danny MacFayden in 1932 was 8-15, 4.39 ERA, 62 strikeouts and 70 walks.   The league ERAs are basically the same, and neither pitcher was a good hitter.   Even the park is the same for half of the year; Broaca was with the Yankees, and MacFayden was with the Yankees the second half of the year.    Fangraphs says that MacFayden was better than Broaca.   Are you sure?
 
                Hank Wyse, 1946 Cubs, and Sad Sam Jones, 1932 White Sox.    Wyse was 14-12; Jones was 10-15. Wyse’ ERA was 2.69; Jones was 4.48.   Wyse’ strikeout to walk ratio was 52-52; Jones was 64 to 75.   Wyse hit .243; Jones hit .193.    You might assume that Wyse would be a better pitcher?
 
                Nope.
 
                Mule Watson in 1922 was 8-14, 4.70 ERA; Sheldon Jones in 1948 was 16-8, 3.36 ERA. The league ERAs are almost the same, 15 points difference, strikeout to walk ratios are about the same.   Fangraphs says that Watson was better. 
 
                Mace Brown in 1939 was 9-13, 3.38 ERA, 71-52 K-W, 8 home runs allowed.   Oral Hildebrand in 1937 was 8-17, 5.15 ERA, 75-87 K-W, 18 home runs allowed.   Fangraphs says that Hildebrand was better than Brown. 
 
                Art Ditmar in 1960 was 15-9, 3.06 ERA; Rodrigo Lopez in 2010 was 7-16 with a 5.00 ERA.   The league ERAs are about the same, 15 points difference again. Each pitcher walked 56 batters.   Ditmar struck out 65 batters but gave up 25 homers; Lopez struck out 116 but gave up 37 homers.    Again, 15-9, 3.06 ERA vs. 7-16, 5.00 ERA. ..I think I would go with the guy who is 15-9 with a good ERA.   Fangraphs goes the other way.
 
                Here’s a few more comparisons in which Fangraphs seems to offer a counter-intuitive answer to the question, "Which of these two pitchers had a better season?" (These WAR numbers include the batting adjustment):
 
 
First
Last
Year
G
W
L
WPct
IP
SO
BB
HR
ERA
Lg ERA
Total
Hank
Johnson
1928
31
14
9
.609
199.0
110
104
16
4.30
4.04
0.6
Johnny
Lindell
1953
32
6
17
.261
199.0
118
139
17
4.66
4.28
2.8
                           
First
Last
Year
G
W
L
WPct
IP
SO
BB
HR
ERA
Lg ERA
Total
Doug
Rau
1978
30
15
9
.625
199.0
95
68
17
3.26
3.58
1.5
Gene
Conley
1961
33
11
14
.440
200.0
113
65
33
4.91
4.02
2.1
                           
First
Last
Year
G
W
L
WPct
IP
SO
BB
HR
ERA
Lg ERA
Total
Roger
Craig
1956
35
12
11
.522
199.0
109
87
25
3.71
3.77
1.0
Joe
Genewich
1924
34
10
19
.345
200.0
43
65
4
5.22
3.86
1.2
                           
First
Last
Year
G
W
L
WPct
IP
SO
BB
HR
ERA
Lg ERA
Total
Gio
Gonzalez
2010
33
15
9
.625
200.2
171
92
15
3.23
4.14
2.9
Esteban
Loaiza
2000
34
10
13
.435
199.1
137
57
29
4.56
4.92
3.3
                           
First
Last
Year
G
W
L
WPct
IP
SO
BB
HR
ERA
Lg ERA
Total
Bill
Sherdel
1925
32
15
6
.714
200.0
53
42
8
3.10
4.27
3.2
Frank
Foreman
1901
25
12
7
.632
199.1
42
60
3
3.88
3.66
3.5
                           
First
Last
Year
G
W
L
WPct
IP
SO
BB
HR
ERA
Lg ERA
Total
Fritz
Ostermueller
1934
33
10
13
.435
199.0
75
99
7
3.48
4.50
2.6
Doc
Crandall
1911
41
15
5
.750
199.0
94
51
10
2.62
3.39
1.1
                           
First
Last
Year
G
W
L
WPct
IP
SO
BB
HR
ERA
Lg ERA
Total
Kevin
Gross
1987
34
9
16
.360
200.2
110
87
26
4.35
4.09
1.3
Doyle
Alexander
1976
30
13
9
.591
201.0
58
63
12
3.36
3.52
1.0
                           
First
Last
Year
G
W
L
WPct
IP
SO
BB
HR
ERA
Lg ERA
Total
Kris
Benson
2004
31
12
12
.500
200.1
134
61
15
4.31
4.31
3.6
Shawn
Estes
1997
32
19
5
.792
201.0
181
100
12
3.18
4.21
3.4
                           
First
Last
Year
G
W
L
WPct
IP
SO
BB
HR
ERA
Lg ERA
Total
Tommy
Bridges
1932
34
14
12
.538
201.0
108
119
14
3.36
4.48
2.1
Johnny
Lindell
1953
32
6
17
.261
199.0
118
139
17
4.66
4.28
2.8
                           
First
Last
Year
G
W
L
WPct
IP
SO
BB
HR
ERA
Lg ERA
Total
Hank
Johnson
1928
31
14
9
.609
199.0
110
104
16
4.30
4.04
0.6
Ray
Benge
1929
38
11
15
.423
199.0
78
77
24
6.29
4.71
1.6
                           
First
Last
Year
G
W
L
WPct
IP
SO
BB
HR
ERA
Lg ERA
Total
Marcelino
Lopez
1966
37
7
14
.333
199.0
132
68
20
3.93
3.44
1.5
Art
Ditmar
1960
34
15
9
.625
200.0
65
56
25
3.06
3.87
0.7
                           
First
Last
Year
G
W
L
WPct
IP
SO
BB
HR
ERA
Lg ERA
Total
Pat
Dobson
1973
34
12
15
.444
200.0
93
53
23
4.41
3.67
1.7
Doc
Crandall
1911
41
15
5
.750
199.0
94
51
10
2.62
3.39
1.1
                           
First
Last
Year
G
W
L
WPct
IP
SO
BB
HR
ERA
Lg ERA
Total
Ray
Collins
1912
27
13
8
.619
199.1
82
42
4
2.53
3.34
3.2
Jim
Kaat
1961
36
9
17
.346
200.2
122
82
12
3.90
4.02
4.1
                           
First
Last
Year
G
W
L
WPct
IP
SO
BB
HR
ERA
Lg ERA
Total
Al
Demaree
1913
31
13
4
.765
200.0
76
38
4
2.21
3.20
2.6
Andy
Ashby
1997
30
9
11
.450
200.2
144
49
17
4.13
4.21
3.0

                Again, and very sincerely, I am not saying that Fangraphs is wrong in any of these comparisons; it is quite possible that, if I knew more about the calculations, I would agree with them.    I know very well that many times what seems like an obvious conclusion from the statistics will not stand up to closer scrutiny, and the researcher will wind up arguing that what appears to be true is not actually true.   I’ve seen that myself a thousand times.
 
                But I also know very well, from years of doing it, that the process of weighing and measuring every stat so as to determine overall value is a treacherous and difficult task, and that there are thousands of ways you can get the wrong answer.     I’ve done that myself a great many times, as well.    I am less convinced than I would like to be that these evaluations are accurate.
 
 
 

COMMENTS (32 Comments, most recent shown first)

tangotiger
Mike:

It's one thing to look at the 2013 data, and see how the various components correlate to each other. So, how does the SO, BB, and HR elements correlate to runs allowed, if you only focus on the 2013 data.

But, that's kind of limiting. What is more exciting is the predictive nature. What does the 2013 data tell us about the 2014 performance. So, what forecasts the 2014 runs allowed better: the 2013 SO, BB, HR in some form, or 2013 runs allowed? What forecasts the 2014 win% of a pitcher: his 2013 win% or his 2013 runs allowed (relative to league)?

And the really exciting part is when you find that something other than the expected does the better job. It's why we reject a pitcher's win%, because we can do a better job of predicting future win% by using past runs allowed (relative to league), rather than past win%.

This indicates a certain level of random variation when a metric can't predict itself better than some other metric.

That's gotta be worth at least 3 cents.
6:42 AM Mar 1st
 
mauimike
..."what a regression analysis would have to say about the predictive nature of the metrics." For three dollars a month does someone explain this to me? Or is this like, "see the ball, hit the ball." And don't try to explain it to me, I don't want to know. Another day goes by and I still haven't used algebra.
1:01 AM Mar 1st
 
belewfripp
Bill - did you do any regression analysis to study the predictive effects, or was it simply the matching exercise that you detailed in the write-up itself? Because, as I'm sure you know, correlation does not necessarily mean that one factor is predictive of another or that one is causing the other.

There could be - as in this case there is for some of the factors - a third factor that is in fact causing both factors to change, and because both are being acted on by this third factor, they show a correlation with each other. In this case, the third factor is FIP, which has relationships with both F-WAR and several of the factors you tested.

I could also use a random number generator to produce 1000 sets of 1000 numbers, and there would undoubtedly be sets of 1000s that were very strongly correlated with other sets, but it wouldn't mean anything. It would be nice to know what a regression analysis would have to say about the predictive nature of the metrics.​
6:05 PM Feb 28th
 
tangotiger
It doesn't matter what he uses, because they each take a position. FAngraphs is based on FIP, which means the strongest correlation will be those based on SO, BB, HR. BR.com is based on Runs Allowed, so, Relative ERA does the best.

As for Win Shares: that one is not necessarily so clear cut, but the end result is that whatever correlates the best with Win Shares, then that's really what the core of Win Shares is.

What Bill's process is really doing is reverse-engineering these metrics.

So, once you get past that, you can get to the fun stuff, which means of those things NOT part of the process, then what correlates best. With Fangraphs (i.e., FIP), what correlates best: ERA or win%? And the answer was win%, but only because ERA wasn't adjusted for its run environment.

Similarly, if Bill used Win Shares, we'd learn (a) what is its core and (b) of the things not part of the core, what correlates better to Win Shares.

9:42 AM Feb 28th
 
wdr1946
Is there any reason you aren't using the Baseball Reference's WAR rather than Fangraph's WARs? And how does your Win Shares data accord with your findings?
10:48 PM Feb 27th
 
David Kowalski
Looking at the top fifteen pitchers for career strikeout to walk ratio and won-lost percentage only one name appears on both lists: Pedro Marinez, who is 6th in won-lost percentage and third in the ratio of strikeouts to walks.

The top 15 in strikeout to walk ratio are:

1. Tommy Bond
2. Curt Schilling
3. Pedro Martinez
4. Mariano Rivera
5. Dan Haren
6. Cliff Lee
7. Cole Hammels
8. Jim Whitney
9. Trevor Hoffman
10. Doug Jones
11. Jon Lieber
12. Bret Saberhagen
13. Monte Ward
14. Ben Sheets
15. Mike Mussina

Interesting names among the top 15 in won-lost percentage are:

Al Spalding (1st with .795)
Spud Chandler (2nd with .717)
Whitey Ford (3rd with .690)
Pedro Martinez (6th with .687)
Don Gullett (7th with .686)
Lefty rove (8th with .680)
Smoky joe Wood (10th with .676),
Babe Ruth (11th with .671)
Christy Mathewson (15th with .665)

Ricky Nolasco is 19th in career strikeout to walk ratio all time with a 3.52 to 1 ratio. That may explain is large recent contract.


1:35 PM Feb 27th
 
David Kowalski
I looked at two nineteenth century pitchers who attracted me originally because of an almost identical number of wins and a similar won-lost percentage, Jim McCormick was 265-214 with a won-lost percentage of .553. Gus Weyhing was 264-232 with a .532 won-lost percentage. McCormick, however, had a career ERA of 2.48 to Weyhing's 3.88. Part of this was due to Weyhing playing in the high scoring 1890's (all of it). McCormicj's last season was 1887.

What's interesting here is that McCormick had a WAR of 75.5 and a so/bb ratio of 2.28. Weyhing was at 46.4 and 1.06. Their WHIPs, ERA , wins, won-lost percentage, and ERA+ were all closer, often far closer than the srikeout to walk ratio or the WAR ratios. If you take the difference between won-lost percentage from .500 (e.g. 53 and 32), won-lost percentage for these two pichers correlates almost perfectly with WAR (+63% for won lost percentage vs +63% with WAR).

This is highly anecdotal and I'd need maybe ten matches of similar win totals to do this, but the anecdote confirms your finding.
1:05 PM Feb 27th
 
Arrojo
If I am remembering correctly, an old Baseball Abstract did point out that K/9 was the best (or a great stat, if the 'best' stat cannot objectively be known) for determining a pitcher's longevity. Does that still hold true 20-odd years later?
10:48 AM Feb 27th
 
KaiserD2
I have two comments.

First, as far as I can see, Bill did not incorporate park effects in any of his measurements, and I think that's very surprising, and that they could change the results quite a lot.

But my second comment is that the restriction of pitchers who pitched 200 innings is highly arbitrary. The main reason Pedro Martinez is by far the greatest pitcher with 200 innings, is that no pitcher that great would have been limited to 200 innings in any earlier era. Even for the contemporary era that's quite a low number. I can't say exactly HOW that would skew the results, but I tend to think that it would. Most great pitchers were largely excluded from the study.

David Kaiser
9:04 AM Feb 27th
 
shinsplint
Bob, thanks for the opening. I couldn't resist :-) ERC, or Component ERA does have a fielding bias for pitchers on good fielding teams because it includes the number of base hits in the formula. FIP came about, I think, in response to that problem by not including hits at all. But FIP still has a bias for pitchers with good fielding because it divides by innings. Therefore a pitcher who has better fielders turn more batted balls into outs, and the FIP is lower. I use a formula to get around it by dividing by batters faced. Well, not exactly--batters faced would distort the results, so I "scale" batters faced to a value that is in proportion with a number that is in the range of innings pitched. Anyway, you know how I feel about Reuschel/Morris, so I'll just say that there's no correlation between iconic 'staches and pitcher value.
8:31 AM Feb 27th
 
tangotiger
In the Value section on the Fangraphs player pages, they have their "official" WAR, but they also have the run-based WAR (RA9-WAR). And they break down the difference between the two with LOB-wins and BIP-wins.

Therefore, the reader is free to choose to create their OWN metric of WAR, by combining what they think should count.
6:18 AM Feb 27th
 
OldBackstop
g-dammit I make one intelligent comment in four years and nobody replies.
10:46 PM Feb 26th
 
wovenstrap
I'm very interested in Bill's comments about the clustering bias of ERA. I think we all know of pitchers who have had seasons that were ruined by literally three innings of work in which everything blew up on them (it occurs to me now that this would add a managerial bias to the list, which Bill didn't mention -- some managers will let a pitcher out there to die because doing so helps the team win tomorrow).

But what I was going to say was, can't ERA be adjusted slightly with an additional metric that tries to isolate the number and frequency of very bad innings? If a pitcher has a 20% chance of giving you a 6-run inning every time you send him out there, that's very different from a guy who has a 4% chance of doing so. In the Olympics (Bill's favorite thing) they take away the best and worst scores from the tally of judges' scores. You could do something like that, institutionalize the knowledge that "this pitcher is good, he just got beat up a couple times."
10:44 PM Feb 26th
 
OldBackstop
Hi Bill, question: does a pitcher starting a game have a better chance at a win (or actually a decision) than a reliever? The reason I ask is that looking at the comparison above of old-timey pitchers, say Doc Crandall, he did indeed have around 200 IP in 1911, but those came in 15 GS in 41 games. Baseball Reference doesn't have game logs so I couldn't check where Doc's 15-5 record came from (that made up his .750 winning percentage). But spot checking him with Andy Ashby in 1997, Ashby's 200.2 IP all came from 30 GS.
8:44 PM Feb 26th
 
TJNawrocki
It's worth noting that BABIP isn't the only part of a pitcher's record that Fangraphs WAR ignores. Looking at Doug Rau in 1978, one of the pitchers Bill cites as under-rewarded in WAR, he got 20 double-play balls that year (second on the Dodgers only to Tommy John) while allowing just six stolen bases in 16 attempts. F-WAR doesn't think any of that is to Rau's credit.
7:34 PM Feb 26th
 
doncoffin
The other thing I can't help noting is how closely bunched these results are. The worst performing piece of data--home runs allowed--matched up with fgWAR 55% of the time. The best--the K/BB comparisons--watched up 65% of the time. Or the worst metric was about 85% as good as the best metric. I would have expected more variation that that, I guess.
7:03 PM Feb 26th
 
rgregory1956

Hey Tom, using winning percentage was a gag on my part. The ERC, not so much. Not that I think that ERC is a "good" stat, but I use it on occasion. The thing that I've noticed about ERC is that, at least for post-WWII pitchers, when a pitcher is an unpopular sabermetric HOF candidate but a strong BBWAA HOF candidate, like Morris or Hunter, the ERC is very often much lower than their ERA. ERC is a flawed metric (but then almost all metrics are), but I compute Morris having a 112 ERC+ to Reuschel's 111 ERC+. Hunter, by the way, I have at 114. As shinsplints would be quick to point out, ERC doesn't take fielding into account.

6:50 PM Feb 26th
 
bjames
Responding to

Bill, how hard would it be to run the data for "Won-Loss % Adjusted By Team W/L%"?

I couldn't do it. I don't have the data in any organized fashion to launch such a study.

5:22 PM Feb 26th
 
jemanji
Bill, how hard would it be to run the data for "Won-Loss % Adjusted By Team W/L%"?

Your study pushed W-L forward as having, perhaps, more value than some people think - I mean, some people think that it is *harmful* information that should be censored out of the discussion. ;- )

Tom points out the Felix and Verlander case … and you point out (in the main article) that the average fan is aware that Felix pitches for the lowly Mariners. 98% of fans see Felix' .500 W/L and just snap back, "Yeah, sure, for the Mariners. Stick Felix on the Red Sox and he wins 20 a year."

We normalize ERA for park; why not normalize W/L% by team W/L%? I'll bet you that stat, over a series of years, would correlate pretty strongly with a pitcher's effectiveness.

….

Also, the *career* ERA and W/L leaders are a pretty impressive list of names, as are the *single-year* leader boards for any league. You don't have random names at the OUTSIDE OF THE CURVE on ERA and W/L.

Like you say, the stats contain information.


4:33 PM Feb 26th
 
tangotiger
gregory:

If you want to argue a metric with the central component as a pitcher's unadjusted win%, then your head must have exploded when Felix and Verlander both signed similar extensions, even though their W/L records are not comparable.

As for a component-based metric: calculate it and see. I would NOT pre-suppose that this would be in Morris' favor.
4:07 PM Feb 26th
 
rgregory1956

Hey Tom, thanks for the WAR formula below. It's a general equation that seems to be useable for other forms of study.

Just to be a horse's patoot, using winning percentage, Morris ranks higher than Reuschel (I can hear Brian Kenny's scream all the way here in Indiana!). One that might be close to being even is ERC or CERA, relative to the league. Not that that wouldn't be a flawed metric.

3:31 PM Feb 26th
 
tangotiger
Since Baseball Reference's version of WAR (rWAR) used runs allowed per 9IP (RA9) as its core (relative to its run environment), then you should be able to guess what's going to happen: Relative ERA would lead by far.

The key point is what Bill said in this paragraph, which everyone should try to appreciate:
"Going into the study, I assumed that it would not make very much difference what measure of true value I used, that one measure of overall value would be as closely tied to the elements of a pitcher’s record as another, and I had actually written a couple of sentences explaining to you why it was unlikely to matter what we used as the bottom line. I see now, at the conclusion of the study, that this assumption could be entirely wrong, and that very probably the version of WAR that we use does matter. "

By using rWAR as the metric to correlate against, and using RA9 as one of the metrics, you are, in essence, correlating x to x. Hence, the key point is what Bill said in his last sentence.



1:03 PM Feb 26th
 
chuck
It seems as though the type of animal- a pitcher who can limit line drives and/or hits in play better than others- is seen now as non-existent, or if they exist, that this talent or tendency is of little value. It may be that, in baseball right now, that limiting hits in play is of less relative value than having a large strikeout to walk margin; but I would guess that in other eras, when K’s and HR’s were much less common, that such a ball-in-play ability would be of greater value. The value of a great defense in those eras would usually trump that pitcher ability, but the pitchers who could also limit the frequency of hard-hit balls would be able to add another layer of success to such a defense, or make up somewhat for a defense that was merely average.
A large part of Clayton Kershaw’s success last year was limiting hits in play, which he did at a much larger rate than the other Dodger pitchers. FIP doesn’t recognize this outperformance of his team defense (or underperformance, in cases like Tommy John) as an ability, or as a value. The pitchers below all added success and value in this way, I believe:

Reulbach, Joss, Ruth, W.Johnson, P. Alexander, S.Coveleski, Walsh, Shocker, Warneke, Lyons, Garver, Spahn, Tiant, Seaver, Messersmith, Palmer, Koufax, McLain, Hunter, Jenkins, knuckleballers P.Niekro, Rommel, Wakefield, Wilhelm, and Hough, Eckersley, Stieb, Clemens, Sid Fernandez, Johan Santana, Jered Weaver, Clayton Kershaw, and Matt Cain.

It was one of the main components of Hunter’s success in the early 70’s, of Palmer’s success given an already great defense, of the success of the knuckleballers and spitballers, and certainly of Dave Stieb, who typically had not very good walk rates and not very imposing strikeout rates. From 1979 through 1991 Stieb had a better average allowed on balls in play than his teammates every year except 1986, in which he went 7-16. In seven of those years he was 19 to 34 hits better. F-WAR doesn’t see this ability.

A pitcher that can shave one hit per start off his game score improves his season average game score by at least 2 points, more if one also considers the run impact. If a hit in play is valued at around .5 runs, that becomes another 2 points added to his average game score. From Bill’s article on game scores and winning percentages, an improvement of 4 points on one’s game score means an improvement of 60 percentage points, on average, in those games the pitcher starts. A less stellar .5 hits per start (say 17 hits better than team, over a season), looks to be worth about 30 added points of winning pct to the team, in those starts.

Either it's an ability or it's not. If it is, it has value, (and an inability to limit hits, relative to team, has a negative value).
1:01 PM Feb 26th
 
doncoffin
So what happens if we use the BBRef version of WAR instead? (For example, Spud Chandler's BBRef WAR is 2.8 to Harry Holmes' 1.6...)

(I could go through all the examples listed, and I might, but not right now...)
12:50 PM Feb 26th
 
tangotiger
WAR is a framework.

Fangraphs has its own implementation of this framework. As does Baseball Reference.

At its core, WAR for pitchers it this:
WAR = IP/9 * (1.2*League - Player) / 10 * LI

That "1.2" is different from a starting pitcher (1.28) as for a reliever (1.05). You can come up with your own.

The "League" would be adjusted based on park. And "Player" is whatever run-scaled number you want, be it FIP, ERA, RA9, Game Score, Season Score, etc. Come up with your own.

The "10" is the conversion of runs to wins. Again, adjust as to run environment as you think.

The "LI" is Leverage Index, which is 1 for starting pitchers, and 0.7 to 1.5 for relievers (though you can make it whatever you want from 0.3 to 2.5 for relievers).

So, the FRAMEWORK is there, and it's pretty solid. The question are all the particulars inside it, and it allows each person to create his own personal implementation.

And if you can create an implementation that has Jack Morris ahead of Rick Reuschel, congratulations... you'll be the first to do so.



12:29 PM Feb 26th
 
Pale Hose
Does this study mean that you now question if WAR is the best "True Value" indicator, and that you will be trying to develope a better one?
11:57 AM Feb 26th
 
rgregory1956

Just curious, Bill. How much of an era bias is there in this study? If I were going to duplicate your study, say by taking pitchers with 249.0 to 251.0 Innings Pitched, I'm pretty sure I'd leave out pitchers prior to 1920, and I might even leave out pitchers prior to 1946.

10:58 AM Feb 26th
 
CharlesSaeger
Dare I ask when the inevitable follow-up study happens that looks at the relevant questions:

* What happens if we use b-rWAR instead of fgWAR?
* What happens if we use park-adjusted ERA+?
* What happens if we use Win Shares and Loss Shares instead of WAR?
* What happens if we adjust W/9, K/9 and HR/9 for league average and use those?
10:47 AM Feb 26th
 
jwilt
I think many or most of these cases Bill points out boil down to low BABIPs. ERA and W/L credit or debit a pitcher's account for the actions of his defense. In the Kevin Gross vs Doyle Alexander example Gross had a .280-something BABIP, meaning that his defense didn't have a whole lot of outsized impact on his perceived value. But Alexander had a .239 BABIP despite a microscopic strikeout rate, which I would take to mean his defenders were pretty spectacular. FIP and fWAR don't care at all about BABIP, so they doesn't know any of this, and doesn't credit Alexander for it.
10:46 AM Feb 26th
 
OldBackstop
I would like to point out that my answer is Part I nailed this list in exact order.
10:38 AM Feb 26th
 
tangotiger
With regards to Heredia (1999) v Finley (1989):

Heredia had a 4.02 FIP and 4.81 ERA. Relative to league-park, which Fanrgaphs shows as "FIP-" and "ERA-", Heredia is 86 and 101, respectively.

Finley was 3.31 FIP, 2.57 ERA. His FIP- and ERA-: 90 and 68, respectively.

Their FIP relative to their environment shows Heredia did better, and hence, that's how his fWAR is better.

***

No active pitcher better represents this ERA v FIP battle than Ricky Nolasco, who has a career 4.37 ERA and 3.76 FIP. His ERA- is 108 (meaning he gives up runs at 8% higher than league average), while his FIP- is 92 (meaning that if we exclude balls in play, his component numbers suggests he should give up runs at 8% lower than league average).

His signing this year was fun to watch.

9:51 AM Feb 26th
 
tangotiger
Excellent.

Just to reiterate to those who didn't read the comments in Part I (and obviously, Bill wrote Part II before reading those comments): WAR at Fangraphs relies on FIP at its core, and FIP is a combination of K, BB, HR. So, it was a given that the K, BB metrics that Bill chose would win. Just a matter of whether K minus BB or K/BB would win.

But the real fun was in seeing those metrics that do not explicitly use K, BB, HR, and how they would do. In essence, the race was between ERA, Relative ERA, and W/L.

And Bill does make a fascinating point regarding the unadjusted ERA, and why a pitcher W/L record enjoys the tremendous benefit of being "self-centering".

9:34 AM Feb 26th
 
 
©2019 Be Jolly, Inc. All Rights Reserved.|Web site design and development by Americaneagle.com|Terms & Conditions|Privacy Policy