No.
Why Do You Ask?
I had the thought that maybe we could rate all starting pitchers against one another in the same way we rank college football teams or college basketball teams. Michigan did not play Robert Morris in basketball in 2012-2013, but Michigan played Binghamton and Binghamton played Monmouth and Monmouth played Robert Morris. In this way, and considering the entire universe of games in one continuous loop, one can compare all college basketball teams to one another, because their schedules all interlock. We use those comparisons to rate college basketball teams.
In the same way, all starting pitchers interlock. Johnny Antonelli did not pitch against Clayton Kershaw, but Johnny Antonelli started three games against Don Drysdale, Drysdale started against Steve Carlton, Carlton started against Orel Hershiser (August 26, 1984), and Hershiser started several games against Livan Hernandez. Hernandez started against Lincecum, Cole Hamels and many other pitchers who started against Kershaw. We can compare Antonelli to Kershaw, then, by considering the entire universe of starts in one continuous loop, as we do in basketball.
I didn’t actually believe this was going to work, you understand; I just thought it would be kind of fun to play around with. I assigned every pitcher an initial rank of 15.0000. . .fifteen selected because, at fifteen, no pitcher in any game will score at less than zero, and sub-zero rankings are a nuisance. Let us take, for illustration, the game of September 23, 2012, when Clayton Kershaw started against Homer Bailey.
The Dodgers won the game, 5 to 3, which means that Kershaw was two runs better than Bailey. (It doesn’t exactly mean that, of course, but play along with me.) In the first cycle of the ranking process, Kershaw and Bailey are both ranked at 15.00. There are 30 points between them going into the game, and if they have 30 points going in, they have to have 30 points coming out. If they have 30 points between them and Kershaw was two runs better than Bailey, that puts Kershaw at 16.00 and Bailey at 14.00.
Except that it’s a home game for Bailey, so he has an advantage that has to be filtered out. The home team’s advantage (in runs scored) is 0.14 runs per game. (It is that small only because the home team often doesn’t bat in the bottom of the ninth, but that isn’t relevant right now.) When we factor that in, then, Kershaw was actually 2.14 runs better than Bailey. That makes Kershaw’s "Game Output Score" for that contest 16.07, and Bailey’s 13.93. Kershaw was 2.14 runs better than Bailey.
Actually, Kershaw’s team was 2.14 runs better than Bailey’s team, as we all know, but. ..these are the rules of the game. We figure these "Game Output Scores" for every game in the data, 118,000+ games. Each pitcher’s rank, then, is the average of his Game Output Scores.
Except that it isn’t, because I know immediately that that isn’t going to work. I know immediately that what is going to happen if we do that is that somewhere in the data there is going to be a pitcher who made only one major league start, and who was matched up in that one start against Bob Gibson or Sandy Koufax or Randy Johnson or somebody really good, and who happened to "win" that one game by a score of 17 to 6 or 22 to 8 or something, so their one and only Game Output Score will be "ten-plus runs better than Randy Johnson." That’s not the answer I want, so to prevent that from happening, I assume that every pitcher has a minimum of 100 starts. If the pitcher actually has 100 starts, well and good. But if he doesn’t have 100 career starts, then we fill in the missing games with an average score of 13.00. Let’s assume that that pitcher is Mike Crudale; Mike Crudale made only one major league start, in 2002, and the Cardinals won that start, 5 to 0, beating Bruce Chen, who isn’t Randy Johnson but who is still around and still pitching very well now, eleven years later. That makes Crudale five runs a game better than Bruce Chen, which would make him awfully good.
What we do, then, is to take Crudale’s Game Output Score from that game. . ..let us say that it is 17.50. . .and we mix that with 99 "games" in which his Game Output Score is 13.00. I am assuming 13.00 is a replacement-level pitcher. That makes Crudale’s ranking, based on his one real game and the 99 replacement-level games, 13.045.
Why 13? Let’s say that a team scores 4.50 runs in an average game, which they do, more or less. If you’re two runs worse than that, that’s 6.50 runs. If you score 4.50 runs per game and give up 6.50 runs per game, you have a winning percentage of .324. That’s about a replacement level, so. . .two runs a game below average is replacement level.
When we figure these rankings for every pitcher, the best pitcher of all time, after the first iteration of the data, is. . ..Phil Hughes? It is. The top ten pitchers, at the end of the first cycle through the data, are as follows:
First
|
Last
|
Rank
|
Phil
|
Hughes
|
15.99
|
Russ
|
Meyer
|
15.94
|
Whitey
|
Ford
|
15.89
|
Adam
|
Wainwright
|
15.89
|
Don
|
Gullett
|
15.88
|
David
|
Price
|
15.85
|
Pedro
|
Martinez
|
15.83
|
Tim
|
Hudson
|
15.81
|
Bob
|
Lemon
|
15.80
|
Jim
|
Palmer
|
15.80
|
I don’t know why Phil Hughes ranks first here. His career winning percentage is very good, and I would presume that the Yankees must have scored a lot of runs for him, giving him some lop-sided wins, and that they must have won most of the games in which he has had no decision, so that his average margin of victory must be larger than any other pitcher. Russ Meyer was a Phil-Hughes type pitcher with the Dodgers in the Duke Snider era.
Apart from Hughes and Meyer in the 1-2 spots, it’s not really a crazy list; we have four or five Hall of Famers in the top ten spots, and most of the other Hall of Fame pitchers are somewhere in the top 100. This, however, is merely the data after the first iteration of the process. Going back now to the Kershaw against Bailey game, after the first cycle of the data Clayton Kershaw (gesundheit) has a rating of 15.59, and Homer Bailey a rating of 15.28. When we plug those numbers back into that game, then, Kershaw has a Game Output Score (for that game) of 16.55, and Bailey of 14.48, whereas the scores in the first cycle were 16.07 and 13.93. The system now knows that that was a matchup of two good pitchers, so it adjusts the Game Output Scores accordingly.
Also, I probably should explain. . .the addition to the system of all of those "phantom 13s" drops the average rankings below 15.00. If we didn’t correct for this, then as we repeated the process through cycle after cycle, 13.00 would gradually replace 15.00 as the average ranking, so that we would be assuming that all of the "missing" starts were of average quality, rather than that they were of replacement-level quality. We have to adjust the rankings after every cycle of the study to re-center them at 15.00. After the second round of the study, the top ten pitchers of all time are as follows:
First
|
Last
|
Rank
|
Phil
|
Hughes
|
16.35
|
Whitey
|
Ford
|
16.29
|
Don
|
Gullett
|
16.27
|
Adam
|
Wainwright
|
16.21
|
Pedro
|
Martinez
|
16.21
|
David
|
Price
|
16.20
|
Tim
|
Hudson
|
16.15
|
Juan
|
Marichal
|
16.15
|
Jim
|
Palmer
|
16.13
|
Russ
|
Meyer
|
16.13
|
This is the same pitchers as before, except that Juan Marichal has replaced Bob Lemon on the list and Russ Meyer has fallen from second to 10th.
After the third round of the study the top ten pitchers are these:
First
|
Last
|
Rank
|
Phil
|
Hughes
|
16.47
|
Whitey
|
Ford
|
16.47
|
Don
|
Gullett
|
16.46
|
Pedro
|
Martinez
|
16.39
|
David
|
Price
|
16.36
|
Adam
|
Wainwright
|
16.33
|
Juan
|
Marichal
|
16.33
|
Tim
|
Hudson
|
16.31
|
Jim
|
Palmer
|
16.28
|
Andy
|
Pettitte
|
16.22
|
And after the fourth round, these:
First
|
Last
|
Rank
|
Whitey
|
Ford
|
16.55
|
Don
|
Gullett
|
16.55
|
Phil
|
Hughes
|
16.52
|
Pedro
|
Martinez
|
16.47
|
David
|
Price
|
16.43
|
Juan
|
Marichal
|
16.42
|
Tim
|
Hudson
|
16.38
|
Adam
|
Wainwright
|
16.37
|
Jim
|
Palmer
|
16.36
|
Andy
|
Pettitte
|
16.28
|
One can see that the rankings are gradually moving in the direction of more reasonable results, but they are certainly taking their own sweet time about it. I followed the system through 26 cycles of the process. After 26 cycles, the top ten pitchers were these:
First
|
Last
|
Rank
|
Don
|
Gullett
|
16.77
|
Juan
|
Marichal
|
16.62
|
Whitey
|
Ford
|
16.59
|
Jim
|
Palmer
|
16.52
|
Pedro
|
Martinez
|
16.49
|
Phil
|
Hughes
|
16.44
|
Ron
|
Guidry
|
16.43
|
David
|
Price
|
16.40
|
Tim
|
Hudson
|
16.36
|
Sandy
|
Koufax
|
16.33
|
It’s not a terrible list. You take any five pitchers from that list and make them your starting rotation, you’re in good shape. That’s an understatement; you take any five pitchers from that list and make them your starting rotation, and you win the pennant.
Still. ..Don Gullett is not the greatest pitcher of all time. I understand why he ranks there, sort of. His career winning percentage was .686, and, since he pitched for the Big Red Machine, he won a lot of games by comfortable margins. If you follow the process long enough the scores will entirely stop moving, and, after 26 cycles, we have reached the point at which the scores have, for the most part, stopped moving. We are stuck with Phil Hughes as the sixth-greatest pitcher of all time, which is really not an answer that I can live with.
I made some adjustments to the process and started over. First, I changed the "100 games minimum" to "200 games minimum", which was done, of course, to get rid of Phil Hughes and David Price from the top ten list. Second, I changed the replacement level from 13.00 to 13.25. It was just an instinct that 13.00 was too low, unrealistically low.
Also, to mitigate the effects of having unequal offensive support by different pitchers, I replaced the team’s actual runs scored in the game by a number which was:
80% of their actual runs scored in the game, plus
.90.
Kason Gabbard once won a game by the score of 30 to 3, which makes him 27 runs better than the opposition starting pitcher in that game. This adjustment treats the score of that game as if it was 24.9 to 3 from Gabbard’s standpoint and 30 to 3.3 from the standpoint of the opposition pitcher (Daniel Cabrera).
This process, without this adjustment, treats runs scored for the pitcher the same as runs allowed by the pitcher; in other words, winning a game 4 to 0 is the same as winning a game 8 to 4. This adjustment just makes the opposition score a little bit more significant than the offensive support, which of course it should be. Winning a game 4 to 0 becomes a victory margin of 4.1; winning a game 8 to 4 becomes a victory margin of 3.3. I thought maybe this would help us with the Don Gullett/Phil Hughes problem of pitchers being overrated because their teams scored a lot of runs for them.
Then I re-started the process, with every pitcher starting back at 15.00. After following that process through some unnecessarily large number of cycles, I had the following list of the top 25 pitchers of the last 60 years:
Rank
|
First
|
Last
|
Score
|
1
|
Juan
|
Marichal
|
16.87
|
2
|
Jim
|
Palmer
|
16.80
|
3
|
Gary
|
Nolan
|
16.67
|
4
|
Steve
|
Blass
|
16.65
|
5
|
Pedro
|
Martinez
|
16.64
|
6
|
Whitey
|
Ford
|
16.63
|
7
|
Ron
|
Guidry
|
16.61
|
8
|
John
|
Candelaria
|
16.60
|
9
|
Sandy
|
Koufax
|
16.59
|
10
|
Bob
|
Gibson
|
16.54
|
11
|
Jim
|
Maloney
|
16.53
|
12
|
Dave
|
McNally
|
16.53
|
13
|
Tom
|
Seaver
|
16.52
|
14
|
Don
|
Gullett
|
16.51
|
15
|
Steve
|
Carlton
|
16.49
|
16
|
Ferguson
|
Jenkins
|
16.48
|
17
|
Jimmy
|
Key
|
16.47
|
18
|
Nelson
|
Briles
|
16.44
|
19
|
Don
|
Sutton
|
16.38
|
20
|
Tim
|
Hudson
|
16.38
|
21
|
Larry
|
Dierker
|
16.37
|
22
|
Luis
|
Tiant
|
16.37
|
23
|
Jack
|
Billingham
|
16.37
|
24
|
Mike
|
Cuellar
|
16.36
|
25
|
Tony
|
Cloninger
|
16.36
|
At which point I decided to call it a day. Steve Blass and Gary Nolan have replaced Phil Hughes and David Price as the problem pitchers among the top 10, but the process still has some obvious flaws.
The process also has some good points, and let me highlight those briefly. It automatically removes park effects, since the two pitchers working against one another are always working in the same park; thus, there is no entry point for park effects. It automatically removes the up-and-down swings of offensive levels over the years, since the two pitchers matched against one another are always working in the same year and on the same day. Actually, the system is very slightly biased against a good pitcher working in a low-run park and a low-run era, because the margins of victory will be smaller in a low-run park, but this is a very small bias, and not a real problem.
But the system treats the performance of the team as representative of the performance of the individual pitcher, thus giving Phil Hughes the benefit of Mariano Rivera’s good innings as well as the offensive output of Mark Teixeira, A-Rod, Granderson, et al. That’s a problem. There is a more subtle problem, which is the fraying of the edges along the time line. The system "replaces" games that the starting pitcher has pitched, up to the level of 200 starts per pitcher. In the first and last years of the study (1952 and 2012, and the surrounding years) the percentage of pitchers who have less than 200 starts is higher, since we are looking at partial-career data for those pitchers. Since there are more pitchers who have less than 200 starts, there are more "replacement level" games added into the system. This tends to push down the average ratings for pitchers pitching in the first and last years of the study, which tends to favor the pitchers pitching in the middle years. . . .in this case, the pitchers from the 1970s and 1980s. I don’t know of any satisfactory way to solve that problem.
The system is looking for the average level of a pitcher’s performance, and thus treats a pitcher pitching well in 201 games the same as a pitcher pitching well in 700 games. This, in part, explains the strong ratings for Gary Nolan, Steve Blass and Tony Cloninger; they made 200 starts, but not a lot more.
I don’t know of any really satisfactory way to solve that problem, either. I think what you would have to do is use a "sliding scale" replacement level, in which we replaced missing starts below 100 at a level of 13.25, below 200 at 14, below 300 at 14.5, and below 500 at 15.0, or something like that. . .in other words, we won’t assume that the games Steve Blass didn’t pitch are at replacement level, but we will assume that they are only average. But I’d want to think through the logic of that somewhat better before I did it.
I thought that there might be a "time line bias" in the study, caused by the fact that many pitchers’ best years come when they are young, and their weaker years when they are older. Juan Marichal pitched for 18 years in the majors. He was 144-68 in the first nine years, and 99-74 in the second nine years. Tom Seaver pitched for 20 years; he was 182-107 in the first ten years and 129-98 in the second ten.
That’s pretty common, and I was concerned that this could cause pitchers in the second half of the study to rate higher than pitchers in the first half, since, when an old pitcher pitches against a young pitcher of the same career quality, the younger pitcher will tend to win and thus tend to rate higher. That could be a problem, although I didn’t really see any evidence that it was.
A lot of what happens in the study, I really don’t understand. At the time I called off the dogs, Greg Maddux was in 90th place in the study, between Bob Forsch and Jim Barr. I really don’t understand why; the system doesn’t like Greg Maddux, but I don’t know why. Jim Bunning ranked below Maddux. Jim Rooker ranked ahead of Dave Stewart, Reggie Cleveland ranked ahead of Nolan Ryan, and Larry McWilliams ranked ahead of Camilo Pascual. I really don’t understand any of these rankings or why the system doesn’t work in those cases, but I’ve put as much time into it as I have available. And more.
I think if I were to attempt this study again, what I might do is base the comparison not on the score of the game, but on the Game Scores of the two starting pitchers. If one starting pitcher in a game has a Game Score of 50 and the other of 40, the team with the "50" Game Score is going to win. . .well, something more than 80% of the time, probably more than 90%. Let me see. . ..
Just checked. In my data there are 1,260 games in which one starting pitcher has a Game Score of exactly 50, and the other starting pitcher has a Game Score of 39, 40 or 41. The team with the pitcher who has a Game Score of 50 goes 1053-207 in those games, an .836 winning percentage.
Anyway, if you based the system on Game Scores, rather than the scores of the games, it would still have the virtues of removing park effects and other external run influences from the study, but would also remove some of the other extraneous influences, such as the performance of the bullpens. I think that might work better. But I’m done with it for now.