On Babe Ruth
Lost In Time
Joe Posnanski posted a poll a day or two ago asking about Babe Ru .. . well, here; I’ll just copy the poll:
What do you think is Babe Ruth’s LIKELY level if you plucked him out of 1927 and put him in today’s big leagues.
Superstar
Star (after adjustments)
Average to below average
Could not make it
The poll got 8,298 votes. 21% (including me) said "Superstar", 25% said "Star (after adjustments)", 24% said "Average to below average", and 30% said "Could not make it".
I am quite certain that most people are wrong about this, and actually more certain about that than I was at the time I voted, but that’s getting ahead of myself. Accompanying my vote, I posted the following:
What people don't get is that the difference between TEAMS is much less than the diff between PLAYERS, & the diff between LEAGUES is less than teams. So there is enormous improvement since 1927, but on the scale of LEAGUES. On the scale of PLAYERS it isn't nearly as large.
My post got 8,299 responses; take that, Joe. Just kidding. Anyway, one response to my post was:
I’ve read this tweet four times and it’s the syntax—that’s what I don’t understand.
Well, I’d help you out, dude, but honestly I have never understood what "syntax" is. Seriously. If you held a gun to my head I couldn’t define the word "syntax". It has something to do with words, I think. But I posted this just in case it helps:
I’ve been trying to explain it to baseball people for 30 years and they rarely understand. Let’s say an average team is .500, a pennant-winning team is .600, and Babe Ruth is .850. If the average TEAM moves from ,500 to .600, that’s an enormous change, but it doesn’t do much to Babe Ruth.
All of that is by way of background. I’ve done a little study here or half-study or thought experiment or modelling exercise or something; whatever it is, it relates to the issue and breaks new ground for me, so I hope it is worth sharing.
The three questions I am trying to get to here are:
What is the standard deviation of performance level (or winning percentage) among players?
What is the standard deviation of performance level (or winning percentage) among teams?, and
What is the standard deviation of performance level (or winning percentage) among leagues?
It is very easy to know what the standard deviation of winning percentage is for teams. You can calculate that for a season, you can calculate it for a decade, or for all of baseball history; that’s easily done, has been done thousands of times by thousands of people. That’s easy.
The standard deviation for players is not nearly as easy to measure. It is difficult to measure for at least three reasons. First, given a player’s performance over the course of a season, his stats, it is difficult to determine exactly what his effective winning percentage was. Different experts could address that problem, and they would come up with somewhat different answers, because there are unknowns in the process sufficient to cause meaningful discrepancies. Second, even given what a player’s effective winning percentage was, it is difficult to know to what extent this represented his true skill level. You could take 50 players of exactly the same skill level and let them play out a season; some of them would hit .300 and some of them would hit .250, just because that’s what happens, that’s baseball. What we need to know is the standard deviation of skill level. And third, you would get a very different answer to the question of "what is the standard deviation of the effective winning percentage of players" if you studied only players who had 600 plate appearances than you would if you studied players with 200 plate appearances, or with 50 plate appearances.
The standard deviation of skill level for players is very difficult to measure, but it could theoretically be measured. But the standard deviation of skill level for leagues is almost impossible to ascertain from data. Since one player always fails when another player succeeds, since the winning percentage of each league is .500, all leagues measure as being the same, even if they are not actually the same. You can gain some insight into the problem by studying interleague play, but that data sample is too small to be remotely reliable, and doesn’t address the problem of 2008 compared to 2018, which is what we are interested in. You can gain some insight into the problem by studying players moving between leagues, but that’s highly speculative, and you can gain some insight into the problem by studying college baseball, where teams from different leagues routinely play against one another, but that’s almost entirely irrelevant to the discussion of major league teams. Basically, we don’t have a clue what the standard deviation of performance levels between different leagues is. There’s no way to calculate it from the data.
BUT.
Perhaps we could calculate it from a model? We cannot calculate what the standard deviation of performance level for different major leagues IS, but perhaps we can estimate from a model what it SHOULD BE. We can estimate what we would expect it to be, based on what we know.
Suppose that we represent the performance level of an average player at .500, and the performance level of an individual player as a random number between zero and one. That’s his winning percentage, let us say. In a spreadsheet, I thus "created" 2 million players, which is vastly larger than the actual number of players in major league history, but each player was just one cell of the spreadsheet, one random number.
Suppose, then, that each 15 players represents a team. Please don’t debate with me the number "15"; I know that a major league team has 25 players, not 15, but a regular player, healthy all year, is much more than 1/25th of his team. If I had used 25 players to represent a team, then the "diminishing improvement" implication which results from the model would be stronger than it is, and readers would correctly point out that I had over-stated the results by using too many players to represent a team.
And suppose, then, that each 15 teams represents a league. There are 15 teams in each league now, used to be 8, but that still kind of words, because two eight-team leagues would be 16 teams, and thus our measurement reasonably approximates the random difference between 1934 and 1935, assuming nothing big happened in the winter of 1934/35.
Given a sample of 2 million "players" with an average of .500, the standard deviation of skill level was .288, actually .288573, if you need to know.
Let us assume that each fifteen players, generated at random, represent a team. What, then, is the standard deviation of skill level for teams?
It is .068.
Now that’s a really interesting figure. When I saw that figure, I realized that I would have to publish the study.
Why is that a really interesting number? Some of you already know.
Because that’s almost exactly what the standard deviation of winning percentage for teams actually is. Over the last five years (2014 to 2018) the standard deviation of winning percentage for teams is .070.
I didn’t expect that to happen. It’s just a really simple model with some obviously unrealistic assumptions; I assumed that when I calculated the output from the model we would get a set of obviously unrealistic numbers, and then we could get into a discussion of how the output of the model relates to real life. But there is only one output data point from the model that we know what the real-life number ought to be, and the way it relates is: it’s the same.
That doesn’t answer ALL of the questions about how the data in the model relates to real life, not by miles and miles, but it advances the ball. It takes away certain issues of scale. There are still a lot of other problems.
Anyway, moving ahead. . . .What is the standard deviation of winning percentage for leagues?
It’s .017.
Standard deviation of Winning Percentage for players: .288
Standard deviation of Winning Percentage for teams: .068
Standard deviation of Winning Percentage for leagues: .017
Now, agreeing for the sake of argument that the quality of play in major league baseball improves steadily over time and has improved quite significantly since 1927. . . agreeing for the sake of argument, and also because it is pretty obviously true. But the next question is, does the improvement in play over time take place on the scale of PLAYERS, or on the scale of LEAGUES?
It seems obvious that it takes place on the scale of differences between leagues. Since the game of baseball is even larger than a league, it could reasonably be argued that the scale of change is even smaller than is reflected in this "league" number.
Comparing 1970 to 1960, for example. . .well, bad example. Baseball expanded from 16 teams to 24 in the 1960s, which probably caused a step backward in the overall level of play, rather than a step forward. Comparing 1980 to 1970, for example, it would seem to be almost certainly true that the quality of play in the National League in 1980 was higher than it was in 1970.
But is it likely that every TEAM in 1980 was better than any TEAM in 1970? The best team in the National League in 1970 was the Cincinnati Reds, 102-60. The worst team in the National League in 1980 was the Chicago Cubs, 64-98. You don’t really think that the 1980 Chicago Cubs could beat the 1970 Cincinnati Reds, do you?
No, of course you don’t, because changes in the quality of play don’t operate on that scale. To suggest that every player in the National League in 1980 was better than any player in the National League in 1970 would be another gigantic leap. No one would argue that.
It seems apparent to me that changes in the game overall operate on the scale of leagues, but how large are they, on that scale?
Let us suppose that the improvement in play in baseball in each ten-year period is one standard deviation of league quality. That would be an ENORMOUS change. It is actually hard to believe that the change could be that large, except under unusual conditions such as comparing 1953 to 1943. If we assume that that was true, that would suggest that the improvement from 1927 to the present was about 9 standard deviations.
A difference of 9 standard deviations is. . . well, numbers don’t come that large. It never happens. It’s not a one-in-a-million difference; it is vastly beyond that. A one-in-a-million difference is like 4 standard deviations or something, I forget exactly. Five standard deviations would be many times larger than four standard deviations. If you measured every human being on earth by height, weight, hair quantity, blood pressure and 50 other measurements, I would doubt that any human being would be nine standard deviations above the norm in any category. The 600-pound woman would not be nine standard deviations above the norm in terms of weight; she might be as much as six, maybe. Somebody will probably post and tell me I am wrong; she would actually be eleven. Whatever; I am just trying to explain that it is an enormous difference.
But this is my point: that, in the model that I established, Babe Ruth would be at .999, and one standard deviation above the norm for a player would be at .788. If the norm moves forward on the scale of players, the difference is relevant to Ruth.
But on the scale of leagues, Babe Ruth is .999 compared to .500. If the league moves forward by one standard deviation, he is at .999 compared to .517. If it moves forward by two standard deviations, he is at .999 compared to .534. It makes very little difference to Babe Ruth. It is impossible to see how the norm for a league could move forward by such a large margin that it buries Babe Ruth. It just can’t happen, once the system is organized. It can happen before the system is organized—that is, before 1890—but not after the system is organized.
Years ago, in a SABR publication, somebody posted a "proof" that baseball quality is constant over time, which went something like this; I forget the exact details.
Ty Cobb hit .350 in 1907, leading the league in hitting, and hit .357 in 1927.
Eddie Collins hit .347 and .324, as a regular, in 1909 and 1910, and hit .360 and .349, as a regular, in 1923 and 1924.
Babe Adams was 18-9 in 1910, and 14-5 in 1921, with a better ERA relative to the league in 1921 than in 1910.
Pete Alexander had a .683 winning percentage and a 2.57 ERA in 1911, and had a .677 winning percentage and a 2.52 ERA in 1927.
Eppa Rixey was 19-18 in 1921, and 19-18 in 1928.
Huck Betts was 3-7 in 1921, but 17-10 in 1934.
Ted Lyons was 12-11 with a 4.87 ERA in 1924, and was 14-6 with a 2.10 ERA in 1942.
Freddie Fitzsimmons was 14-10 with a 2.88 ERA in 1926, and 16-2 with a 2.82 ERA in 1940.
Al Benton had a 4.56 ERA in 1939, 4.44 in 1940, but 2.12 in 1949 and 2.37 in 1952.
Ted Williams was the best hitter in baseball in 1941, and the best hitter in baseball in 1960.
Pee Wee Reese hit .229 and .255, as a regular, in 1941 and 1942, but hit .309 and .282 in 1954 and 1955.
Warren Spahn was 21-10 in 1947, and 21-10 in 1960.
Bob Lemon was 20-14 in 1948, and 20-14 in 1956.
Warren Spahn was 23-7 in 1953, and 23-7 in 1963.
Ted Abernathy was 5-9 with a 5.93 ERA in 1955, but was 10-3 with a 2.59 ERA in 1970, and posted a 1.72 ERA in 1972.
Bob Gibson from 1959 to 1961 had ERAs of 3.32, 5.59 and 3.24. In 1972 and 1973 he had ERAs of 2.46 and 2.73.
Steve Carlton was 14-9 with a 2.98 ERA in 1967, and 23-11 with a 3.10 ERA in 1982.
Bert Blyleven was 16-15 with a 2.81 ERA in 1971, and 17-5 with a 2.73 ERA in 1989.
George Brett hit .333 in 1976, winning the batting title, and .329 in 1990, winning the batting title. He hit .282 in 1974, and .285 in 1992.
Paul Molitor hit .273 in 1978, and .281 in 1998, with more than 500 at bats each year.
Tony Gwynn hit .309, .351 and .317 from 1982 to 1984, and .372, .321 and .338 from 1996 to 1998.
Chuck Finley had a 4.11 ERA in 1988, and a 4.11 ERA in 2000.
Andy Pettitte was 21-8 in 1996, and 21-8 in 2003.
Bartolo Colon had a 3.71 ERA in 1998, and 3.43 in 2016.
I could do thousands more, but you get the point. If baseball improved dramatically, he asked, when did it improve?
There is a lot wrong with that argument, and when he published that argument in the mid-1970s I made fun of him for it, because, well, I’m kind of a jackass sometimes. There is a lot wrong with the argument, but there is something right about it, too. IF baseball was improving on the scale of teams—that is, if it was improving by .050 in a decade or something like that—then these comparisons would not be possible. It would not be possible for a player to remain dominant in his decline phase, seasons separated by 15 years or more, except in very rare situations where a player actually improved greatly over time. The fact that this happens is not evidence that baseball is not improving over time, but it IS evidence that the league is not taking large steps forward in each decade. The only way that Babe Ruth could NOT be a star player today is if the league was taking large steps forward in each decade.
Late Night Addendum 1
Expanding a little bit on the study published a couple of hours ago—
In this study there is (A) the standard deviation of talent among players, (B) the standard deviation of talent among teams, and (C) the standard deviation of talent among leagues.
B is a known variable and reasonably near a constant,
A is a partially known variable, and
C is completely unknown
The critical question of the study is, What is the relationship of B to C? That’s REALLY what the study is about: If B is known, what is C likely to be?
In my model, it is extremely likely that the estimate of A is completely wrong, and that (because of that) the relationship of A to B in this study is mis-stated or mis-calculated.
A is mis-estimated in this model, because using a random number to represent the player’s value (A) creates a straight-line distribution, when in reality it is likely that the distribution of A represents either a bell-shaped curve or the right-hand portion of a bell-shaped curve. The standard deviation of winning percentage among players is almost certainly NOT .288, but some number significantly lower than that. I would suggest that it is probably .120 to .150, but that’s just a guess.
With regard to Babe Ruth, the model assumes that he is +.499 vs. an average player, but also that he is less than two standard deviations above the norm. In reality, he is almost certainly MORE than two standard deviations above the norm. He might be 3, he might be 3 and a half, but he’s more than 2. He’s not a .999 player; he is more like an .850 player, but against a norm of .120, that’s close to three standard deviations above the norm.
Why, then, does the model hit the mark in regard to B?
Because the model assumed that the winning percentage of the team is the average of the winning percentage of the 15 players, when in reality it probably represents an exponent of the average. It’s probably the average to the power 2 vs (1-A), or to the power 4, or something. Depending on how you calculate the winning percentage of the player, which is a partially understood problem.
Basically, it’s off-setting errors with regard to A vs. B. A is wrong, and the relationship of A to B is wrong, but they happen to off-set to hit the target, so that B is what B ought to be. This was totally unexpected. I had no idea that B would turn out to be the actual B, so I wasn’t worried about the relationship of A to B.
BUT. Big But.
What is critical to the study is not the relationship of A to B, but the relationship of B to C. Is there any reason to believe that this problem has any impact on the relationship of B to C?
I can’t see that there is. It doesn’t seem to me that this problem has anything to do with the actual relationship of B to C. That’s why the study was exciting to me—because if B is the actual B, then it seems reasonable, based on the model, to assume that C is the actual C.
The Other Problem
Another issue here is that changes in HOW the game is played are not necessarily changes in how WELL the game is played. Not all changes in the game are changes for the better. The famous evolutionary biologist Stephen Jay Gould, who was a huge baseball fan, would make this point constantly: that evolution does not produce a BETTER species, it merely produces a DIFFERENT species which is better adapted to its own survival. I didn’t understand what the hell he was talking about at the time, but I get it now. Wish he was still around so I could tell him that.
Anyway, SOME of the changes that have taken place in baseball are not forward developments, but merely variations. Sideways evolution.
Suppose that in 1927, for some reason, baseball had split into two camps, with players NOT going back and forth from one to the other. Over time, the games played by the two camps would go in different directions. 90 years later, it would almost certainly be true that a dominant player from the West Coast league would not be equally dominant in the East Coast game—and vice versa. DIFFERENCE is not the same as QUALITY. The Japanese game, for example. The Japanese are baseball-obsessed at a much, much deeper level than Americans are, but the players from Japan who come to the states are sometimes at a disadvantage because they’re playing a significantly different game.
The "heavy bat" problem is not a QUALITY problem; it’s a DIFFERENT GAME problem. If Adam Ottavino went back to 1927, he almost certainly would have a six-plus ERA because his managers would make him try to pitch 9 innings a game, and he couldn’t do it. That’s the way the game was played then.
We moved away from the 9-inning starters because there is a competitive advantage in using relievers, but that advantage rests in large part on tactics that would be unsustainable in 1927. You can’t call players to the majors at the drop of a hat in 1927. You’re playing in St. Louis and your top minor league team is playing in Rochester, you can’t get him there for tonight’s game—or tomorrow night’s game. You can’t watch video of him pitching his last start, too see how he looks. You can’t check the internet to see how well he has been pitching. The strategies that dominate in 2019, wouldn’t work in 1927.