Is Phil Hughes the Greatest Pitcher of All Time?

July 31, 2013

No.

 

Why Do You Ask?

 

                I had the thought that maybe we could rate all starting pitchers against one another in the same way we rank college football teams or college basketball teams.   Michigan did not play Robert Morris in basketball in 2012-2013, but Michigan played Binghamton and Binghamton played Monmouth and Monmouth played Robert Morris.   In this way, and considering the entire universe of games in one continuous loop, one can compare all college basketball teams to one another, because their schedules all interlock.    We use those comparisons to rate college basketball teams.

                In the same way, all starting pitchers interlock.   Johnny Antonelli did not pitch against Clayton Kershaw, but Johnny Antonelli started three games against Don Drysdale,   Drysdale started against Steve Carlton, Carlton started against Orel Hershiser (August 26, 1984), and Hershiser started several games against Livan Hernandez.   Hernandez started against Lincecum, Cole Hamels and many other pitchers who started against Kershaw.    We can compare Antonelli to Kershaw, then, by considering the entire universe of starts in one continuous loop, as we do in basketball.

                I didn’t actually believe this was going to work, you understand; I just thought it would be kind of fun to play around with.    I assigned every pitcher an initial rank of 15.0000. . .fifteen selected because, at fifteen, no pitcher in any game will score at less than zero, and sub-zero rankings are a nuisance.    Let us take, for illustration, the game of September 23, 2012, when Clayton Kershaw started against Homer Bailey.

                The Dodgers won the game, 5 to 3, which means that Kershaw was two runs better than Bailey.   (It doesn’t exactly mean that, of course, but play along with me.)    In the first cycle of the ranking process, Kershaw and Bailey are both ranked at 15.00.    There are 30 points between them going into the game, and if they have 30 points going in, they have to have 30 points coming out.    If they have 30 points between them and Kershaw was two runs better than Bailey, that puts Kershaw at 16.00 and Bailey at 14.00.

                Except that it’s a home game for Bailey, so he has an advantage that has to be filtered out.   The home team’s advantage (in runs scored) is 0.14 runs per game.   (It is that small only because the home team often doesn’t bat in the bottom of the ninth, but that isn’t relevant right now.)    When we factor that in, then, Kershaw was actually 2.14 runs better than Bailey.    That makes Kershaw’s "Game Output Score" for that contest 16.07, and Bailey’s 13.93.    Kershaw was 2.14 runs better than Bailey. 

                Actually, Kershaw’s team was 2.14 runs better than Bailey’s team, as we all know, but.  ..these are the rules of the game.    We figure these "Game Output Scores" for every game in the data, 118,000+ games.   Each pitcher’s rank, then, is the average of his Game Output Scores.

                Except that it isn’t, because I know immediately that that isn’t going to work.    I know immediately that what is going to happen if we do that is that somewhere in the data there is going to be a pitcher who made only one major league start, and who was matched up in that one start against Bob Gibson or Sandy Koufax or Randy Johnson or somebody really good, and who happened to "win" that one game by a score of 17 to 6 or 22 to 8 or something, so their one and only Game Output Score will be "ten-plus runs better than Randy Johnson."   That’s not the answer I want, so to prevent that from happening, I assume that every pitcher has a minimum of 100 starts.     If the pitcher actually has 100 starts, well and good.   But if he doesn’t have 100 career starts, then we fill in the missing games with an average score of 13.00.     Let’s assume that that pitcher is Mike Crudale; Mike Crudale made only one major league start, in 2002, and the Cardinals won that start, 5 to 0, beating Bruce Chen, who isn’t Randy Johnson but who is still around and still pitching very well now, eleven years later.    That makes Crudale five runs a game better than Bruce Chen, which would make him awfully good.

                What we do, then, is to take Crudale’s Game Output Score from that game. . ..let us say that it is 17.50. . .and we mix that with 99 "games" in which his Game Output Score is 13.00.   I am assuming 13.00 is a replacement-level pitcher.    That makes Crudale’s ranking, based on his one real game and the 99 replacement-level games, 13.045. 

                Why 13?    Let’s say that a team scores 4.50 runs in an average game, which they do, more or less.    If you’re two runs worse than that, that’s 6.50 runs.    If you score 4.50 runs per game and give up 6.50 runs per game, you have a winning percentage of .324.     That’s about a replacement level, so. . .two runs a game below average is replacement level.  

                When we figure these rankings for every pitcher, the best pitcher of all time, after the first iteration of the data, is. . ..Phil Hughes?    It is.   The top ten pitchers, at the end of the first cycle through the data, are as follows:

First

Last

Rank

Phil

Hughes

15.99

Russ

Meyer

15.94

Whitey

Ford

15.89

Adam

Wainwright

15.89

Don

Gullett

15.88

David

Price

15.85

Pedro

Martinez

15.83

Tim

Hudson

15.81

Bob

Lemon

15.80

Jim

Palmer

15.80

 

                I don’t know why Phil Hughes ranks first here.    His career winning percentage is very good, and I would presume that the Yankees must have scored a lot of runs for him, giving him some lop-sided wins, and that they must have won most of the games in which he has had no decision, so that his average margin of victory must be larger than any other pitcher.    Russ Meyer was a Phil-Hughes type pitcher with the Dodgers in the Duke Snider era.

                Apart from Hughes and Meyer in the 1-2 spots, it’s not really a crazy list; we have four or five Hall of Famers in the top ten spots, and most of the other Hall of Fame pitchers are somewhere in the top 100.    This, however, is merely the data after the first iteration of the process.    Going back now to the Kershaw against Bailey game, after the first cycle of the data Clayton Kershaw (gesundheit) has a rating of 15.59, and Homer Bailey a rating of 15.28.    When we plug those numbers back into that game, then, Kershaw has a Game Output Score (for that game) of 16.55, and Bailey of 14.48, whereas the scores in the first cycle were 16.07 and 13.93.    The system now knows that that was a matchup of two good pitchers, so it adjusts the Game Output Scores accordingly.

                Also, I probably should explain. . .the addition to the system of all of those "phantom 13s" drops the average rankings below 15.00.   If we didn’t correct for this, then as we repeated the process through cycle after cycle, 13.00 would gradually replace 15.00 as the average ranking, so that we would be assuming that all of the "missing" starts were of average quality, rather than that they were of replacement-level quality.  We have to adjust the rankings after every cycle of the study to re-center them at 15.00.    After the second round of the study, the top ten pitchers of all time are as follows:

First

Last

Rank

Phil

Hughes

16.35

Whitey

Ford

16.29

Don

Gullett

16.27

Adam

Wainwright

16.21

Pedro

Martinez

16.21

David

Price

16.20

Tim

Hudson

16.15

Juan

Marichal

16.15

Jim

Palmer

16.13

Russ

Meyer

16.13

               

                This is the same pitchers as before, except that Juan Marichal has replaced Bob Lemon on the list and Russ Meyer has fallen from second to 10th.

                After the third round of the study the top ten pitchers are these:

First

Last

Rank

Phil

Hughes

16.47

Whitey

Ford

16.47

Don

Gullett

16.46

Pedro

Martinez

16.39

David

Price

16.36

Adam

Wainwright

16.33

Juan

Marichal

16.33

Tim

Hudson

16.31

Jim

Palmer

16.28

Andy

Pettitte

16.22

 

                And after the fourth round, these:

 

First

Last

Rank

Whitey

Ford

16.55

Don

Gullett

16.55

Phil

Hughes

16.52

Pedro

Martinez

16.47

David

Price

16.43

Juan

Marichal

16.42

Tim

Hudson

16.38

Adam

Wainwright

16.37

Jim

Palmer

16.36

Andy

Pettitte

16.28

 

                One can see that the rankings are gradually moving in the direction of more reasonable results, but they are certainly taking their own sweet time about it.    I followed the system through 26 cycles of the process.   After 26 cycles, the top ten pitchers were these:

First

Last

Rank

Don

Gullett

16.77

Juan

Marichal

16.62

Whitey

Ford

16.59

Jim

Palmer

16.52

Pedro

Martinez

16.49

Phil

Hughes

16.44

Ron

Guidry

16.43

David

Price

16.40

Tim

Hudson

16.36

Sandy

Koufax

16.33

 

                It’s not a terrible list.   You take any five pitchers from that list and make them your starting rotation, you’re in good shape.   That’s an understatement; you take any five pitchers from that list and make them your starting rotation, and you win the pennant.

                Still. ..Don Gullett is not the greatest pitcher of all time.   I understand why he ranks there, sort of.    His career winning percentage was .686, and, since he pitched for the Big Red Machine, he won a lot of games by comfortable margins.   If you follow the process long enough the scores will entirely stop moving, and, after 26 cycles, we have reached the point at which the scores have, for the most part, stopped moving.   We are stuck with Phil Hughes as the sixth-greatest pitcher of all time, which is really not an answer that I can live with.

                I made some adjustments to the process and started over.    First, I changed the "100 games minimum" to "200 games minimum", which was done, of course, to get rid of Phil Hughes and David Price from the top ten list.    Second, I changed the replacement level from 13.00 to 13.25.    It was just an instinct that 13.00 was too low, unrealistically low.

                Also, to mitigate the effects of having unequal offensive support by different pitchers, I replaced the team’s actual runs scored in the game by a number which was:

                80% of their actual runs scored in the game, plus

                .90. 

                Kason Gabbard once won a game by the score of 30 to 3, which makes him 27 runs better than the opposition starting pitcher in that game.    This adjustment treats the score of that game as if it was 24.9 to 3 from Gabbard’s standpoint and 30 to 3.3 from the standpoint of the opposition pitcher (Daniel Cabrera).

                This process, without this adjustment, treats runs scored for the pitcher the same as runs allowed by the pitcher; in other words, winning a game 4 to 0 is the same as winning a game 8 to 4.     This adjustment just makes the opposition score a little bit more significant than the offensive support, which of course it should be.    Winning a game 4 to 0 becomes a victory margin of 4.1; winning a game 8 to 4 becomes a victory margin of 3.3.    I thought maybe this would help us with the Don Gullett/Phil Hughes problem of pitchers being overrated because their teams scored a lot of runs for them.

Then I re-started the process, with every pitcher starting back at 15.00.    After following that process through some unnecessarily large number of cycles, I had the following list of the top 25 pitchers of the last 60 years:

Rank

First

Last

Score

1

Juan

Marichal

16.87

2

Jim

Palmer

16.80

3

Gary

Nolan

16.67

4

Steve

Blass

16.65

5

Pedro

Martinez

16.64

6

Whitey

Ford

16.63

7

Ron

Guidry

16.61

8

John

Candelaria

16.60

9

Sandy

Koufax

16.59

10

Bob

Gibson

16.54

11

Jim

Maloney

16.53

12

Dave

McNally

16.53

13

Tom

Seaver

16.52

14

Don

Gullett

16.51

15

Steve

Carlton

16.49

16

Ferguson

Jenkins

16.48

17

Jimmy

Key

16.47

18

Nelson

Briles

16.44

19

Don

Sutton

16.38

20

Tim

Hudson

16.38

21

Larry

Dierker

16.37

22

Luis

Tiant

16.37

23

Jack

Billingham

16.37

24

Mike

Cuellar

16.36

25

Tony

Cloninger

16.36

 

                At which point I decided to call it a day.   Steve Blass and Gary Nolan have replaced Phil Hughes and David Price as the problem pitchers among the top 10, but the process still has some obvious flaws.

                The process also has some good points, and let me highlight those briefly.   It automatically removes park effects, since the two pitchers working against one another are always working in the same park; thus, there is no entry point for park effects.    It automatically removes the up-and-down swings of offensive levels over the years, since the two pitchers matched against one another are always working in the same year and on the same day.    Actually, the system is very slightly biased against a good pitcher working in a low-run park and a low-run era, because the margins of victory will be smaller in a low-run park, but this is a very small bias, and not a real problem.  

                But the system treats the performance of the team as representative of the performance of the individual pitcher, thus giving Phil Hughes the benefit of Mariano Rivera’s good innings as well as the offensive output of Mark Teixeira, A-Rod, Granderson, et al.    That’s a problem.   There is a more subtle problem, which is the fraying of the edges along the time line.   The system "replaces" games that the starting pitcher has pitched, up to the level of 200 starts per pitcher.   In the first and last years of the study (1952 and 2012, and the surrounding years) the percentage of pitchers who have less than 200 starts is higher, since we are looking at partial-career data for those pitchers.   Since there are more pitchers who have less than 200 starts, there are more "replacement level" games added into the system.   This tends to push down the average ratings for pitchers pitching in the first and last years of the study, which tends to favor the pitchers pitching in the middle years. . . .in this case, the pitchers from the 1970s and 1980s.   I don’t know of any satisfactory way to solve that problem.

                The system is looking for the average level of a pitcher’s performance, and thus treats a pitcher pitching well in 201 games the same as a pitcher pitching well in 700 games.    This, in part, explains the strong ratings for Gary Nolan, Steve Blass and Tony Cloninger; they made 200 starts, but not a lot more.

                I don’t know of any really satisfactory way to solve that problem, either.   I think what you would have to do is use a "sliding scale" replacement level, in which we replaced missing starts below 100 at a level of 13.25, below 200 at 14, below 300 at 14.5, and below 500 at 15.0, or something like that. . .in other words, we won’t assume that the games Steve Blass didn’t pitch are at replacement level, but we will assume that they are only average.    But I’d want to think through the logic of that somewhat better before I did it.

                I thought that there might be a "time line bias" in the study, caused by the fact that many pitchers’ best years come when they are young, and their weaker years when they are older.   Juan Marichal pitched for 18 years in the majors.   He was 144-68 in the first nine years, and 99-74 in the second nine years.    Tom Seaver pitched for 20 years; he was 182-107 in the first ten years and 129-98 in the second ten.

                That’s pretty common, and I was concerned that this could cause pitchers in the second half of the study to rate higher than pitchers in the first half, since, when an old pitcher pitches against a young pitcher of the same career quality, the younger pitcher will tend to win and thus tend to rate higher.    That could be a problem, although I didn’t really see any evidence that it was. 

                A lot of what happens in the study, I really don’t understand.    At the time I called off the dogs, Greg Maddux was in 90th place in the study, between Bob Forsch and Jim Barr.   I really don’t understand why; the system doesn’t like Greg Maddux, but I don’t know why.    Jim Bunning ranked below Maddux.   Jim Rooker ranked ahead of Dave Stewart, Reggie Cleveland ranked ahead of Nolan Ryan, and Larry McWilliams ranked ahead of Camilo Pascual.    I really don’t understand any of these rankings or why the system doesn’t work in those cases, but I’ve put as much time into it as I have available.   And more.

                I think if I were to attempt this study again, what I might do is base the comparison not on the score of the game, but on the Game Scores of the two starting pitchers.     If one starting pitcher in a game has a Game Score of 50 and the other of 40, the team with the "50" Game Score is going to win. . .well, something more than 80% of the time, probably more than 90%.    Let me see. . ..

                Just checked.   In my data there are 1,260 games in which one starting pitcher has a Game Score of exactly 50, and the other starting pitcher has a Game Score of 39, 40 or 41.    The team with the pitcher who has a Game Score of 50 goes 1053-207 in those games, an .836 winning percentage.

                Anyway, if you based the system on Game Scores, rather than the scores of the games, it would still have the virtues of removing park effects and other external run influences from the study, but would also remove some of the other extraneous influences, such as the performance of the bullpens.  I think that might work better.   But I’m done with it for now.

 
 

COMMENTS (17 Comments, most recent shown first)

David Kowalski
It's kind of interesting who doesn't make the list. Go over the top win-loss percentages of all time and pick out players from the study period with high winning percentages. Among them:

Roy Halliday .659
Roger Clemens .655
Ron Guidry .651
CC Sabathia .641
Johan SDantana .641
Mike Mussina .641
Dwight Gooden .634

All of the pitchers in the group had at least 200 starts, so the use of replacement games is not an issue.
3:16 PM Oct 18th
 
MarisFan61
Sunday's game didn't much help my hypothesis....
11:36 PM Aug 4th
 
bjames
If you end up with the same or similar result as Bill's original list, it suggests that the whole meshing that Bill is doing is not adding value.


That's not true at all. It simply means that THE FIRST ROUND is not adding value--because, of course, the first round uses the initial assumption that the opposition quality is a constant. The value comes from repeating the process... .which we knew anyway.
1:52 AM Aug 3rd
 
evanecurb
So, I'm looking through the list, seeing guys who generally were good pitchers on good offensive teams with winning records, and I later read your comment about the small bias against pitchers in a low run era in a low scoring ballpark, and then I see Larry Dierker in the Top 25.

What gives, Beav?

Actually, Dierker was very good for a short period of time, and even though he pitched in a low run environment, his offensive support within that context may have been pretty good: Jim Wynn, Rusty Staub, and Joe Morgan were really, really good.

Hypothetical question: If Joe Morgan had played his entire career in Houston, would he have even come close to the Hall of Fame?

Next question: Who is the real life Joe Morgan, i.e. the guy who had HOF performance in a context that hid that performance? May have been the aforementioned Jim Wynn
11:15 PM Aug 2nd
 
tangotiger
Trail: excellent!

So, it seems that we get substantially the same answer if we use the most basic approach possible. Which really suggests that given that number of starts, the quality of opposition is not going to be biased.

The bias happens of course with your own team offense (for the most part) and your own team bullpen (to some degree). The opponent kinda "evens out".

Occam's Razor.
11:15 AM Aug 2nd
 
Trailbzr
What Tango suggested, using Retrosheet IDs for format consistency:

1.58 hughp001 1.39 fordw101 1.39 meyer101 1.38 waina001 1.37 gulld101
1.29 pricd001 1.25 martp001 1.21 hudst001 1.21 lemob101 1.19 palmj001

So the extra steps brought the pack together, with very minor shuffling of order.
10:55 AM Aug 2nd
 
tangotiger
I'd like to see a simple list like Trailblazer is suggesting: team run differential in games started by the pitcher.

If you end up with the same or similar result as Bill's original list, it suggests that the whole meshing that Bill is doing is not adding value.

If you don't get the same list, then we need to see if there's a bias, or whether we are learning something new.​
10:33 AM Aug 2nd
 
Trailbzr
Phil Hughes has, by a comfortable margin, the greatest offensive support during the study period (min 100 starts):

6.16 Phil Hughes
5.72 Russ Meyer
5.60 Jaret Wright
5.57 Brian Bohanon
5.56 Rob Bell
9:02 AM Aug 2nd
 
MarisFan61
When a study like this comes up with a result like this about someone like Phil Hughes, it always makes me wonder if the guy is at least better than he has seemed -- not the greatest ever or even really in the top 10 or 20, but just better than we thought. The first guy I ever wondered this about was Jose Cruz. I think most of us now think we wasn't really quite as good as some metrics made him look, but we do feel he was much better than his reputation. I'm guessing that this does mean Hughes has been significantly better than we've thought. While it does indeed seem impressionistically to me that he's been the beneficiary of a lot of a lot of high Yankee scoring (often taking him off the hook from a loss), it seems to me that if you had to go through such multiple conniptions to get him out of the top few, there's a high chance that he has pitched disproportionately in tough circumstances: up against good teams or aces or hitter-friendly umpires or hot or humid weather. All of which of course could be researched if someone wanted to do it. I'll settle for tentatively assuming what I said, and now following him with increased interest.
12:58 AM Aug 2nd
 
oldehippy
I admire your patience Bill! Any study I did that said to me that Phil Hughes is better than Whitey and Louisiana Lightening would have ended up on my scrap heap as soon as I was done laughing. But that's why I'm me and you're Bill James!
2:36 PM Aug 1st
 
jdurkee
First, I know people with too much time on their hands when their baseball team is in first place.

Second, the real problem with this study is that we are using a metric to rank pitchers which we don't believe is one which isolates the performance of pitchers vs other effects. We need to pick a pitcher-centric metric. Everyone has their own choice here, so pick yours and do the study.​
12:42 PM Aug 1st
 
Trailbzr
"Going back now to the Kershaw against Bailey game, after the first cycle of the data Clayton Kershaw (gesundheit) has a rating of 15.59, and Homer Bailey a rating of 15.28. When we plug those numbers back into that game, then, Kershaw has a Game Output Score (for that game) of 16.55, and Bailey of 14.48, whereas the scores in the first cycle were 16.07 and 13.93. "

Looks like there's a small math problem here; the difference in GO Scores has dropped from 2.14 to 2.07. Also, the two pitchers brought 30.87 points into the game, and earned 31.03.
7:12 PM Jul 31st
 
chuck
I do hope you return to this study sometime trying game scores instead; it was the thought that occurred to me as I saw pitchers I know to have had good offensive support appear in the rankings.
Some of the pitchers on the list, like Palmer, McNally, Cuellar and Ford, had not only good run support but top-ranked defenses behind them, which is a smaller factor than run support perhaps, but may be more difficult to remove. It's also something game scores would encounter.
4:45 PM Jul 31st
 
Hal10000
Maybe I'm being a bit ignorant but this crosses me as a variation on the ELO system applied to starting pitchers. Baseball Reference is running this, only using user ratings as opposed to actual scores and I suspect your effort is running into similar limitations.

Going by game score might give you a bit more dynamic range and better remove the impact of teammates. But I think it might also increase the problems by further disentangling the pitcher's performances from each other.
3:44 PM Jul 31st
 
Brian
Not that you are going back to this anytime soon, but 2 modifications I can think of are as follows:

1) Rate the pitcher by the score when he leaves the game and all runs are charged to him. Then extrapolate the score to 9 innings. If he leaves after 6 up 6 to 3, the final score becomes 9 to 4.5. The thinking is that this takes your bullpen out of the equation.

2) Average the offensive number with 4.5 per 9. The theory is in the battle of pitchers, half of what happened offensively is your teammates and half is the other pitcher. In the above example, the score becomes 6.75 to 4.5

I also like Tom Tango's suggestions
12:12 PM Jul 31st
 
3for3
How about capping the 'wins' at 4? Anything beyond 4 is usually just extra offense.
12:08 PM Jul 31st
 
tangotiger
To handle the uneven playing time, you give EVERY pitcher the same number of games. Say, every pitcher gets 100 starts at 13.25, plus whatever else he actually has. This is how Regression Toward The Mean works.

If you stick with team run differential, you can cap off the differential to something like five runs (5 runs = 0.5 wins better than the other guy, and if the other guy is a .500 pitcher...)

Anyway, nice exercise!
11:43 AM Jul 31st
 
 
©2019 Be Jolly, Inc. All Rights Reserved.|Web site design and development by Americaneagle.com|Terms & Conditions|Privacy Policy