Re-Establishing the Values
First thing to tell you today. . .I put "minimums" in the categories. The category bottoms are set at 4 standard Deviations below the norm, but in several categories there are 1 to 3 teams which are more than 4 standard Deviations below the Norm. In our system those teams wind up with negative numbers. Frankly, in a system like this, Negative Numbers are a pain in the rear. Even very SMALL numbers are a pain; they’re like a pain in the elbow. So I put in category minimums; every team is credited with at least 15 Runs Prevented by not committing errors, regardless of the data. It’s just tiny little numbers, effecting just a few teams (two teams in this case); it doesn’t really have anything to do with the accuracy or reliability of the system.
The distribution of talent among INDIVIDUALS is of course greater than the distribution of talent among TEAMS. I set the zero-competence lines based on TEAM numbers, but there are some small but not meaningless number of INDIDIVUALS who would fall below that line, in that area. A team might have a very high walk rate, for example. The system is constructed such that if they have even one starting pitcher who has good control, they cannot touch that zero-competence line. But what if they have one 200-inning pitcher who has, let’s say, not TOO BAD control, or one pitcher who actually has good control, but just pitches 35 innings?
That individual pitcher has to be given credit for HIS contribution to runs prevented via control, regardless of the fact that the team’s competence level, once in a while, under rare circumstances, is measured at zero. If you have a negative number there, or if you have a very small number there, then there is no fund to draw from when giving credit to individuals. What I have done here is to create little "pocket accounts" from which the marginally competent pitchers on completely incompetent teams may be paid. I hope that makes sense.
To depart on a quasi-philosophical rant, this is an area of fundamental misunderstanding among baseball fans, players, and, in my experience, scouts. A major league baseball player needs to have skills in probably 50 or 100 different skill areas, and 8 to 10 major skill groups. A first baseman, for example, needs to be able to hit, hit for power, etc., but he also needs to be able to field first base. But "fielding first base" involves dozens of small skills; the first baseman has to be able to chase down a pop foul, scoop out a low throw, apply a tag when off the base, charge a bunt, throw to another base, position himself so as to hold the runner on, catch the pickoff throw and slap the tag on the runner, etc. There is actually a long list of skills that goes into the major skill group, "playing first base."
People often imagine that major league players are better than amateur players in every skill, or at least in every skill GROUP, but this clearly is not true. If you create a complete spectrum of a player’s skills, it is clear that there are many overlaps between major league skills and lower-level skills. There are many, many high school players who can run faster than 30, 40% of the players in the majors. It is very common to see a high school first baseman make plays, chasing down foul pops, that many major league first basemen would not make.
In this study we are dealing with an array of 11 skills, or 11 category measurements that represent more-or-less distinct skills. In general, major league defensive players and pitchers are competent in all of those areas, but not without exceptions. Sometimes one skill group protects the player’s major league status, even though other skills are sub-competent. There are actually quite significant negative consequences to major league organizations of the failure to understand this. One of the basic differences between major league managers is to what extent they focus on their player’s limitations, and to what extent they focus on their stronger areas. Some managers try to improve the team by focusing on the strengths and trying to build them up. Other managers manage by trying to eliminate weaknesses. There is no clear answer as to what is right and what is wrong.
OK, moving on. I think that I have worked on the problem of re-adjusting the category values for the 11 areas of run prevention as long as I can stand to work on the problem. My energy for the problem is exhausted.
I started with a set of "values" to represent run prevention in each of these 11 areas, understanding that this was merely a starting point. A strikeout is worth .30 runs prevented; a walk not issued is worth .32 runs, etc. In refining these numbers, we use three methods or three resources; not sure how to describe it. We serve three masters.
The basic process of refining the numbers is experimenting with the data to see what best predicts the outcomes. We start with .30 runs prevented for a strikeout, but what if we change that to .31? Does that make the predictions match the output data BETTER, or worse? What if we try .29? This is the basic method that I used to refine the values, and more on this in a moment.
That’s the workhorse method that I used, but there are problems with it. It’s like a hunting dog; it is good at hunting down discrepancies, but sometimes it goes down rabbit trails and starts chasing things that aren’t really helpful. (Never been hunting in my life; this is just what I understand.) A second approach is to sort the data, and see how teams that are strong in each area are doing in the overall comparisons. What is the average error for the 500 teams which have the most strikeouts, for example? What about the average error for the 500 teams which have the FEWEST strikeouts? If the teams which have lots of strikeouts are shown as saving more runs than they actually were able to save, then you know you have over-valued strikeouts, and you need to reduce the value of strikeouts in your trial-and-error experiments.
But the third thing is, it has to comport to some degree with (a) other people’s measurements of the run value of different events, and (b) common sense. If your data tries to tell you that a Single is more valuable than a Home Run, you ignore the data, and force it to go in some other direction. The process of refining the measurements is a process of trying to serve these three masters at the same time.
So, category by category.
1) Strikeouts. We started with a value of .30 runs prevented for each strikeout above the zero-competence level, and wound up at .34. Actually, .337 is probably more accurate than .340, and I have no reluctance to use three digits here, but wound up using .340 for reasons I will try to explain later.
2) Walks. Each walk not issued has a Run Prevention value of about .475 runs. I started with an estimate of .32 runs, but the data forced me to move it significantly higher—a walk or a Hit Batsman, also valued at .475 runs.
This actually was an easy one. In the "test trials" part of the study, I would have several different lines of test studies going at the same time. In some areas, one line of tests would be trying to force the value of one category up, while another line of tests would be forcing it down, and they might fight each other for a long time. But all of the lines of test trials converged on .475 runs as the value of a walk, and I have no doubt that that is the right number for this. I mean, maybe it is .474 or .477 or something, but it’s right there.
3) Home Runs Avoidance. I started with a value of 1.4 runs for each Home Run not surrendered, and wound up at 1.355. In this case, I’m NOT super-confident that 1.355 is the right answer; it could be as low as 1.3. But I’ve just worked on the problem as long as I can stand to work on it, and I’m moving on.
4) Hit Batsman Avoidance. .475 runs, same as a walk.
5) Wild Pitch Avoidance. .375 runs; explanation later. I started at .16 runs for a Wild Pitch, wound up with a number more than twice that; I’ll explain later.
6) Balk Avoidance. .375 runs; discussion later.
7) Fielding Range/DER. I had originally estimated that the Run Prevention value of a play made by a fielder (as opposed to the ball becoming a hit) was .81 runs. The data makes it clear, however, that this number is way too high. My current estimate of a value of a play within DER is .633 runs. . . well, .630 to .635, but I am going with .633.
8) Fielding Consistency (Errors). Here, again, my original value was way too high. I had originally estimated (article posted on April 1) that the run cost of the average error was about .60 runs. The data makes clear that this number is way too high. The number I am using now is .425—meaning that the average error is less damaging than the average walk—and there are things in the data which suggest that the number should be even lower than that.
9) Double Plays. I had originally estimated the value of a Double Play turned at .62 runs. This turns out to be WAY too low, way too low. The current estimate is that a Double Play has a Run Prevention value of 1.100 runs.
10) Stolen Base Control. Stolen Bases and Caught Stealing were combined into one number earlier in the process, Stolen Base Value (which may have been a poor decision, I don’t know.) Anyway, stolen base value was stated on a 1-run basis; 1.00 value in Stolen Base Control was supposed to be one run—and it turned out that way. We started at 1.00; we wind up at 1.00.
11) Passed Balls. Same as Wild Pitches and Balks, .375 runs.
Not sure whether this is bragging or complaining, but in the process of refining these numbers I made thousands and thousands of "test substitutions", working toward the values that best explain the data. I had 142 rounds of tests and usually 40-50 test substitutions in each round, so 6,000 would be a good guess. I actually love doing that kind of work; I think the average person would be bored shitless by it, but I enjoy it. But maybe three days of doing very little else, it wears on you.
In that process of experimental substitutions, my biggest problem, by far, was that data was trying to push the values for Wild Pitches, Balks and Passed Balls up to totally unrealistic values. I don’t know why this happened. The value that I stuck on for these events was .375 runs, but the data WANTS a higher value—a much higher value. I don’t know how much higher, because I wouldn’t experiment with numbers higher than .570, but somewhere up there. It doesn’t make any sense; at .570, a Wild Pitch would be more costly than a walk or a single.
I am essentially an empiricist. I believe whatever the facts tell me to believe. (Some academic once wrote a paper, delivered at a conference, explaining that I actually am NOT an empiricist. I am flattered by the attention, and I am certain that he knows more about what an empiricist is than I do.) Anyway, I think of myself as an empiricist, meaning that I believe what the data tells me to believe. That’s who I am. In the 1970s all of the leading experts tried to tell me how important stolen bases were to an offense. I just kept saying, "Well, that’s not what the data says." That’s who I am, whether it makes me an empiricist or not.
But there has to be some way of looking at the data that makes sense, or you can’t follow it there. I can’t see how Wild Pitches could be more costly than Walks. The inability to settle on the "correct" value for Wild Pitches, Balks and Passed Balls is responsible for the discrepancies in my estimates in other categories, the fact that I was not able to ascertain with confidence what the right value was for some types of events.
And it is not a "little" thing; it’s a surprisingly large problem. It’s a dislocation in the data of 2,000 runs or more; I’ll explain what that means in a moment. At the end point of the analytical process, the one-base events are still screaming "MORE! MORE! MORE! WE NEED MORE VALUE!". There is a lot of energy in the "corrections" that we can’t reasonably make. We could gain about 2,000 runs of accuracy, over time, if I just could let the one-base events go to where they want to go. Maybe more than 2,000, I don’t know.
I got the average error per team down to 43 Runs, or an average error of 6.2%. What is meant by the 2,000 runs is this. The average error, with one-base events locked at .375 runs, is 43.2 runs per team. If you let the value of one-base mistakes go up to .570 runs, though, then you can whittle the error down to 42.4 runs per team, or 6.1%. There are 2,550 teams in the study. To gain .80 runs per team requires a gain of 2,000 runs—and there is little reason to believe that that’s the whole enchilada. I didn’t experiment with numbers higher than .57, but at .570, there still appears to be a lot of energy in the push of those numbers for higher values. They might be trying to head to .65, .70. . . I just don’t know.
But as those things gain more value, something else has to be reduced in value to balance the scales. So allowing THOSE events to be valued higher than .375 not only gives them more weight than seems reasonable; it also forces other numbers DOWNWARD to what seem like unrealistically low numbers. That’s why I don’t even look at it.
I am disappointed that I was not able to get closer than 6.2%, but not really surprised; it is an experimental method with hundreds of internal assumptions. The fact that we got the error down to 6.2% pretty much proves that some those assumptions are right; the fact that we weren’t able to get closer than 6.2% suggests that some of them were wrong. The average error, though, is much less than 6% in modern baseball. For the 20th century as a whole, the average error is 32.35 runs, and the percentage error is substantially less than 5%. Here is the average error by decade:
From
|
To
|
Average Error
|
Percentage
|
1900
|
1909
|
61.0
|
8.2%
|
1910
|
1919
|
49.8
|
6.7%
|
1920
|
1929
|
35.7
|
5.6%
|
1930
|
1939
|
73.2
|
10.2%
|
1940
|
1949
|
32.1
|
4.7%
|
1950
|
1959
|
32.8
|
4.9%
|
1960
|
1969
|
53.0
|
7.6%
|
1970
|
1979
|
60.7
|
9.0%
|
1980
|
1989
|
28.5
|
4.1%
|
1990
|
1999
|
44.1
|
6.2%
|
2000
|
2009
|
32.4
|
4.9%
|
2010
|
2019
|
32.3
|
4.4%
|
Wow. In compiling that chart, I see something that I didn’t see before. Why does the average error spike in the 1930s, and again in the 1970s?
The obvious explanation is the run differential between the leagues, which was very high in the 1930s, and which went up again in the 1970s, when the American League introduced the DH rule (1973). (After the DH rule was established, the league adjusted to it and reverted to historic norms, so that the statistical differences between leagues got to be less over time.) Anyway, that data pattern suggests the possibility that I might be able to substantially reduce the average error of my process by basing the analysis on, let’s say, five-year LEAGUE averages, rather than ten-year MAJOR LEAGUE averages. I may have to re-do a bunch of work to see whether that is true.
This is where we are; we’ll move forward from here. Thank you all for reading. Keep your distance.