A couple of weeks ago I wrote an article for Grantland about Dwight Evans, arguing that Evans was a Hall of Fame caliber player. The article contrasted Evans’ contribution to his team with that of Dave Parker (among other players), and one section of the article contrasted the defensive contributions of the two players in the 1986 season. In that section I wrote this:
Because of the difference between them in range, however, Baseball Reference estimates that Parker in 1986 was 17 runs worse in the outfield than an average right fielder, whereas Evans was 8 runs better. That’s 25 runs.
I don’t know how they calculate that, and, because defense is so hard to measure, I prefer to use more conservative measurements. The difference between an average team and a championship team, in a season, is only about 150 runs. Saying that the fielding difference between two right fielders is 25 runs is a little like saying that a 150-pound woman gave birth to a 25-pound baby. Ouch. I’m not saying it’s not possible; it’s just hard to believe. I have Evans as being only about eight runs better than Parker in the field, not because I don’t believe the 25-run difference is possible, but just because I just don’t think that we know for certain how large the difference was. Parker also had been an outstanding defensive outfielder earlier in his career. But I don’t think anyone questions that, by 1986, Dwight Evans was a lot better outfielder than was Parker.
Some of you probably realized, in reading that, that I had used a deceptive comparison to illustrate my point. I compared a very good right fielder—Evans—to a very bad right fielder—Parker in 1986—but, to illustrate my point about the scale, I compared a very good team to an average team; not a bad team, but an average team. The difference between a very good team and an average team is about 150 runs, but the difference between a very good team and a very bad team is 300 runs. I was making the legitimate point that "that’s a lot of runs", but I did it in a little bit of a deceptive way.
You can flog me later, but first I wanted to look more carefully at the issue this raises: is it reasonably believable that the defensive difference between two right fielders would be 25 runs? Or, given the scale of differences between baseball teams, is that just too many runs?
How do we figure that?
I constructed a model to represent the problem. What do we know, as a basis of the model? We know that the standard deviation of runs scored in baseball (per team) is about 80 runs now, less in the 1980s. We know that there are nine positions that feed into that difference. We know that the defensive contributions of the nine positions vary widely, and we know that the defensive responsibilities of the pitcher are vastly larger than their offensive responsibilities, but that the defensive responsibilities of the other fielders are smaller than their offensive responsibilities. We know that the relative offensive/defensive responsibilities are different at different positions; shortstops have more defensive responsibilities relative to their offense than do first basemen.
I constructed the model in this way. . .this actually was easier than it probably sounds. From concept to completion of this little study was something less than an hour; it wasn’t complicated, and it was very convincing, to me. First, I created two lines of what I call "bell random" numbers. A random number, if you graph it, is a flat line; there are as many random numbers between .800 and 1.000 as there are between .400 and .600. If you take two random numbers, add them together and divide by two, you have random numbers but they make a bell-shaped curve, and you have four and a half times as many random numbers between .400 and .600 as between .800 and 1.000. For reasons that I would hope would be obvious to most of you, "bell random" numbers suit our purpose here better than true random numbers.
Then, to represent each player, I formed an "offensive value" and a "defensive value". The "offensive value" for each player was a base number times the bell random number. The "defensive value" was the base number, times another bell random number, times a number representing the defensive responsibility of the position, which we could call the positional defensive fraction.
The "base number" was 100 in the first trial, and settled at 117 because at 117 we had the proportions we needed to represent the real world. The "positional defensive fraction" was .300 for right fielders (.150 for first basemen, .800 for catchers.) For each right fielder in the model, then, his "offensive value" was 117 times a bell random number, and his "defensive value" was 117 times a bell random number, times .300.
I created a large number of "model teams"—more teams in the model than in the real history of major league baseball. I then figured the standard deviation of team runs scored—a number that we know to be just short of 80 in real life. In the 1980s, not counting 1981, it was 71.8.
If the base number was 100, then the standard deviation of runs scored per team would only be 61.6, which is too small. If the base number was 120, then the standard deviation of runs scored per team would be 73.9, which is a little bit too large. The base number that works best, to make the standard deviation of runs scored what it ought to be in the 1980s, is 117.
The key question, then, is "what is the standard deviation of fielding runs by right fielders, in this model?" If the answer to that question was "3.0", then I was right in saying that 25 runs is too large to be a believable separation in defensive value between two right fielders, because that would be eight-plus standard deviations. But if the answer to that question was "10.0", then I would be clearly wrong; given realistic assumptions, it would be entirely possible that the defensive difference between two right fielders would be 25 runs, even though the sum total of all differences between the two teams would rarely be larger than 300 runs.
Answer?
I was dead wrong.
The standard deviation of runs saved by right fielders, in this model, was 7.30. Evans, at +8 runs, would be a little more than one standard deviation above the norm, on a team level (assuming one full-time right fielder for each team.) Parker, at -17 runs, would be a little more than two standard deviations below the norm. It’s not an unbelievable defensive separation between two players, at all. If it was 40 runs, it wouldn’t be hard to believe. If it was 50 runs, maybe that’s a little hard to believe, but at 25, we’re not anywhere near the margins.
Of course, it’s a crude model, and we can’t infer too much from it. But. . .I believe it. For the model to be wrong on this issue, it would have to be wrong by a substantial margin on one of its assumptions. Suppose we changed the "positional defensive fraction" for right fielders from .300 to .200. The standard deviation for runs allowed by right fielders is still 4.86—and at 4.86, it’s still believable that there would be a 25-run separation between two players. If the positional defensive fraction was .20, then only one-sixth of the responsibility of a right fielder would be playing defense. It’s hard to believe that the defensive responsibility of a right fielder is smaller than that.
Yes, it’s a crude model, but the simplicity of the model is one of its strongest points. Because the model is based on very few assumptions, there are a very limited number of points about which it could be seriously wrong. To avoid the conclusion that a 25-run separation between defensive right fielders is entirely possible, the model has to be seriously wrong on some point. I doubt that it is.
Look, I still think I was right in one sense. I was trying to make an argument, and I was trying to convince people that I was right about what I was saying—as I believe that I was, and I think probably most of you believe that I was. You can’t convince people of what you are saying if you make assumptions that you can’t support; therefore, to convince people, you need to use conservative assumptions. If I had assumed that Evans was 25 runs better than Parker (defensively) in 1986, some people might very reasonably have said, "Oh, that’s too many runs for that pocket; he’s got an elephant sitting in a pool pocket there, he doesn’t know what he’s talking about." To avoid that, I scaled the number of runs back to a very conservative estimate, eight runs. I wasn’t wrong to do that, and I wasn’t trying to mislead people; I was merely trying to use conservative estimates so as to give skeptics every opportunity to buy into my argument.
But just between you and me, just between friends—25 runs is a realistic estimate, too. That’s what I know today that I didn’t know yesterday.