Re-Establishing the Values

April 10, 2020
                                          Re-Establishing the Values

 

            First thing to tell you today. . .I put "minimums" in the categories.  The category bottoms are set at 4 standard Deviations below the norm, but in several categories there are 1 to 3 teams which are more than 4 standard Deviations below the Norm.  In our system those teams wind up with negative numbers.   Frankly, in a system like this, Negative Numbers are a pain in the rear. Even very SMALL numbers are a pain; they’re like a pain in the elbow.  So I put in category minimums; every team is credited with at least 15 Runs Prevented by not committing errors, regardless of the data.   It’s just tiny little numbers, effecting just a few teams (two teams in this case); it doesn’t really have anything to do with the accuracy or reliability of the system.

            The distribution of talent among INDIVIDUALS is of course greater than the distribution of talent among TEAMS.   I set the zero-competence lines based on TEAM numbers, but there are some small but not meaningless number of INDIDIVUALS who would fall below that line, in that area.   A team might have a very high walk rate, for example.  The system is constructed such that if they have even one starting pitcher who has good control, they cannot touch that zero-competence line.   But what if they have one 200-inning pitcher who has, let’s say, not TOO BAD control, or one pitcher who actually has good control, but just pitches 35 innings?

            That individual pitcher has to be given credit for HIS contribution to runs prevented via control, regardless of the fact that the team’s competence level, once in a while, under rare circumstances, is measured at zero.   If you have a negative number there, or if you have a very small number there, then there is no fund to draw from when giving credit to individuals.   What I have done here is to create little "pocket accounts" from which the marginally competent pitchers on completely incompetent teams may be paid.  I hope that makes sense.

            To depart on a quasi-philosophical rant, this is an area of fundamental misunderstanding among baseball fans, players, and, in my experience, scouts.   A major league baseball player needs to have skills in probably 50 or 100 different skill areas, and 8 to 10 major skill groups.  A first baseman, for example, needs to be able to hit, hit for power, etc., but he also needs to be able to field first base.   But "fielding first base" involves dozens of small skills; the first baseman has to be able to chase down a pop foul, scoop out a low throw, apply a tag when off the base, charge a bunt, throw to another base, position himself so as to hold the runner on, catch the pickoff throw and slap the tag on the runner, etc.  There is actually a long list of skills that goes into the major skill group, "playing first base." 

            People often imagine that major league players are better than amateur players in every skill, or at least in every skill GROUP, but this clearly is not true.  If you create a complete spectrum of a player’s skills, it is clear that there are many overlaps between major league skills and lower-level skills.  There are many, many high school players who can run faster than 30, 40% of the players in the majors.  It is very common to see a high school first baseman make plays, chasing down foul pops, that many major league first basemen would not make. 

            In this study we are dealing with an array of 11 skills, or 11 category measurements that represent more-or-less distinct skills.  In general, major league defensive players and pitchers are competent in all of those areas, but not without exceptions.  Sometimes one skill group protects the player’s major league status, even though other skills are sub-competent.   There are actually quite significant negative consequences to major league organizations of the failure to understand this.   One of the basic differences between major league managers is to what extent they focus on their player’s limitations, and to what extent they focus on their stronger areas.  Some managers try to improve the team by focusing on the strengths and trying to build them up.   Other managers manage by trying to eliminate weaknesses.  There is no clear answer as to what is right and what is wrong.  

 

            OK, moving on.  I think that I have worked on the problem of re-adjusting the category values for the 11 areas of run prevention as long as I can stand to work on the problem.   My energy for the problem is exhausted. 

            I started with a set of "values" to represent run prevention in each of these 11 areas, understanding that this was merely a starting point.   A strikeout is worth .30 runs prevented; a walk not issued is worth .32 runs, etc.   In refining these numbers, we use three methods or three resources; not sure how to describe it.  We serve three masters.   

            The basic process of refining the numbers is experimenting with the data to see what best predicts the outcomes.   We start with .30 runs prevented for a strikeout, but what if we change that to .31?  Does that make the predictions match the output data BETTER, or worse?   What if we try .29?   This is the basic method that I used to refine the values, and more on this in a moment.

            That’s the workhorse method that I used, but there are problems with it.  It’s like a hunting dog; it is good at hunting down discrepancies, but sometimes it goes down rabbit trails and starts chasing things that aren’t really helpful.  (Never been hunting in my life; this is just what I understand.)  A second approach is to sort the data, and see how teams that are strong in each area are doing in the overall comparisons.   What is the average error for the 500 teams which have the most strikeouts, for example? What about the average error for the 500 teams which have the FEWEST strikeouts?   If the teams which have lots of strikeouts are shown as saving more runs than they actually were able to save, then you know you have over-valued strikeouts, and you need to reduce the value of strikeouts in your trial-and-error experiments. 

            But the third thing is, it has to comport to some degree with (a) other people’s measurements of the run value of different events, and (b) common sense.   If your data tries to tell you that a Single is more valuable than a Home Run, you ignore the data, and force it to go in some other direction.   The process of refining the measurements is a process of trying to serve these three masters at the same time. 

            So, category by category. 

1)      Strikeouts.   We started with a value of .30 runs prevented for each strikeout above the zero-competence level, and wound up at .34.   Actually, .337 is probably more accurate than .340, and I have no reluctance to use three digits here, but wound up using .340 for reasons I will try to explain later.

2)     Walks.  Each walk not issued has a Run Prevention value of about .475 runs.   I started with an estimate of .32 runs, but the data forced me to move it significantly higher—a walk or a Hit Batsman, also valued at .475 runs. 

This actually was an easy one.  In the "test trials" part of the study, I would have several different lines of test studies going at the same time.  In some areas, one line of tests would be trying to force the value of one category up, while another line of tests would be forcing it down, and they might fight each other for a long time.   But all of the lines of test trials converged on .475 runs as the value of a walk, and I have no doubt that that is the right number for this.   I mean, maybe it is .474 or .477 or something, but it’s right there.

3)      Home Runs Avoidance.   I started with a value of 1.4 runs for each Home Run not surrendered, and wound up at 1.355.   In this case, I’m NOT super-confident that 1.355 is the right answer; it could be as low as 1.3.   But I’ve just worked on the problem as long as I can stand to work on it, and I’m moving on. 

4)     Hit Batsman Avoidance.   .475 runs, same as a walk.

5)     Wild Pitch Avoidance.  .375 runs; explanation later.  I started at .16 runs for a Wild Pitch, wound up with a number more than twice that; I’ll explain later. 

6)     Balk Avoidance.   .375 runs; discussion later. 

7)     Fielding Range/DER.   I had originally estimated that the Run Prevention value of a play made by a fielder (as opposed to the ball becoming a hit) was .81 runs.   The data makes it clear, however, that this number is way too high.   My current estimate of a value of a play within DER is .633 runs. . . well, .630 to .635, but I am going with .633.

8)     Fielding Consistency (Errors).   Here, again, my original value was way too high.   I had originally estimated (article posted on April 1) that the run cost of the average error was about .60 runs.  The data makes clear that this number is way too high.   The number I am using now is .425—meaning that the average error is less damaging than the average walk—and there are things in the data which suggest that the number should be even lower than that.

9)     Double Plays.  I had originally estimated the value of a Double Play turned at .62 runs.   This turns out to be WAY too low, way too low.   The current estimate is that a Double Play has a Run Prevention value of 1.100 runs. 

10)            Stolen Base Control.   Stolen Bases and Caught Stealing were combined into one number earlier in the process, Stolen Base Value (which may have been a poor decision, I don’t know.)  Anyway, stolen base value was stated on a 1-run basis; 1.00 value in Stolen Base Control was supposed to be one run—and it turned out that way.   We started at 1.00; we wind up at 1.00.

11)            Passed Balls.   Same as Wild Pitches and Balks, .375 runs.

 

Not sure whether this is bragging or complaining, but in the process of refining these numbers I made thousands and thousands of "test substitutions", working toward the values that best explain the data.  I had 142 rounds of tests and usually 40-50 test substitutions in each round, so 6,000 would be a good guess.  I actually love doing that kind of work; I think the average person would be bored shitless by it, but I enjoy it.  But maybe three days of doing very little else, it wears on you.  

In that process of experimental substitutions, my biggest problem, by far, was that data was trying to push the values for Wild Pitches, Balks and Passed Balls up to totally unrealistic values.   I don’t know why this happened.   The value that I stuck on for these events was .375 runs, but the data WANTS a higher value—a much higher value.   I don’t know how much higher, because I wouldn’t experiment with numbers higher than .570, but somewhere up there.  It doesn’t make any sense; at .570, a Wild Pitch would be more costly than a walk or a single.  

I am essentially an empiricist.  I believe whatever the facts tell me to believe.  (Some academic once wrote a paper, delivered at a conference, explaining that I actually am NOT an empiricist.   I am flattered by the attention, and I am certain that he knows more about what an empiricist is than I do.)  Anyway, I think of myself as an empiricist, meaning that I believe what the data tells me to believe.  That’s who I am.  In the 1970s all of the leading experts tried to tell me how important stolen bases were to an offense.  I just kept saying, "Well, that’s not what the data says."  That’s who I am, whether it makes me an empiricist or not. 

But there has to be some way of looking at the data that makes sense, or you can’t follow it there.  I can’t see how Wild Pitches could be more costly than Walks.   The inability to settle on the "correct" value for Wild Pitches, Balks and Passed Balls is responsible for the discrepancies in my estimates in other categories, the fact that I was not able to ascertain with confidence what the right value was for some types of events. 

And it is not a "little" thing; it’s a surprisingly large problem.   It’s a dislocation in the data of 2,000 runs or more; I’ll explain what that means in a moment.   At the end point of the analytical process, the one-base events are still screaming "MORE!  MORE!  MORE!  WE NEED MORE VALUE!".  There is a lot of energy in the "corrections" that we can’t reasonably make.  We could gain about 2,000 runs of accuracy, over time, if I just could let the one-base events go to where they want to go.     Maybe more than 2,000, I don’t know. 

 

I got the average error per team down to 43 Runs, or an average error of 6.2%.  What is meant by the 2,000 runs is this.  The average error, with one-base events locked at .375 runs, is 43.2 runs per team.  If you let the value of one-base mistakes go up to .570 runs, though, then you can whittle the error down to 42.4 runs per team, or 6.1%.  There are 2,550 teams in the study.  To gain .80 runs per team requires a gain of 2,000 runs—and there is little reason to believe that that’s the whole enchilada.  I didn’t experiment with numbers higher than .57, but at .570, there still appears to be a lot of energy in the push of those numbers for higher values.   They might be trying to head to .65, .70. . . I just don’t know.

But as those things gain more value, something else has to be reduced in value to balance the scales.   So allowing THOSE events to be valued higher than .375 not only gives them more weight than seems reasonable; it also forces other numbers DOWNWARD to what seem like unrealistically low numbers.   That’s why I don’t even look at it.   

I am disappointed that I was not able to get closer than 6.2%, but not really surprised; it is an experimental method with hundreds of internal assumptions.  The fact that we got the error down to 6.2% pretty much proves that some those assumptions are right; the fact that we weren’t able to get closer than 6.2% suggests that some of them were wrong.   The average error, though, is much less than 6% in modern baseball.   For the 20th century as a whole, the average error is 32.35 runs, and the percentage error is substantially less than 5%. Here is the average error by decade:

 

From

To

Average Error

Percentage

1900

1909

61.0

8.2%

1910

1919

49.8

6.7%

1920

1929

35.7

5.6%

1930

1939

73.2

10.2%

1940

1949

32.1

4.7%

1950

1959

32.8

4.9%

1960

1969

53.0

7.6%

1970

1979

60.7

9.0%

1980

1989

28.5

4.1%

1990

1999

44.1

6.2%

2000

2009

32.4

4.9%

2010

2019

32.3

4.4%

 

Wow.  In compiling that chart, I see something that I didn’t see before.   Why does the average error spike in the 1930s, and again in the 1970s?

The obvious explanation is the run differential between the leagues, which was very high in the 1930s, and which went up again in the 1970s, when the American League introduced the DH rule (1973).   (After the DH rule was established, the league adjusted to it and reverted to historic norms, so that the statistical differences between leagues got to be less over time.)   Anyway, that data pattern suggests the possibility that I might be able to substantially reduce the average error of my process by basing the analysis on, let’s say, five-year LEAGUE averages, rather than ten-year MAJOR LEAGUE averages.   I may have to re-do a bunch of work to see whether that is true. 

 This is where we are; we’ll move forward from here.   Thank you all for reading.  Keep your distance.  

 
 

COMMENTS (18 Comments, most recent shown first)

michaelplank
Just sort of groping in the dark here...is it maybe a problem of unlike components? Wild pitches, passed balls and balks, unlike all the other events, don't terminate a plate appearance. In a sense, they're purely a bonus for the offense, since they don't even have any opportunity cost against the plate appearance. Any given PA is more likely than not to end in an out, so even Batter 1 avoids making an out, he's bringing Batter 2 to the plate, and he's more likely than not to make an out... it's ever diminishing, and all innings must eventually end. But if you get a free base, without using up a PA (or even a strike), maybe that has some value that escapes full detection when you try to measure it on a scale of plate appearances.
11:32 AM Apr 14th
 
KaiserD2
Willie did lead off in several all-star games. I believe he led off in 1968, reached first (probably on a walk), the NL loaded the bases, he scored on a ground out--and that was the only run of probably the dullest All-Star game in history.

DK
5:17 PM Apr 12th
 
MarisFan61
Steve: That's unrelated to those simulations by Bill, because whoever he plugged into the leadoff spot, it was "all else equal."
(presumably; seems sure, but I better add this qualifier, because I don't know)
3:53 PM Apr 12th
 
kaline09
Is a WP or PB the same as a SB with a 100 percent success rate? Sometimes a WP or PB results in the equivalent of two SB if more than one runner is on. Is it so unreasonable for that to have value of .57? I am thinking unsophisticatedly about how the RC formula multiplies SB by .52
8:24 AM Apr 12th
 
steve161
If Mays is leading off, then somebody is batting third or fourth who is less effective in that role than Willie. (Unless, of course, I can arbitrarily hire Ruth or Williams for the job.)

Didn't he lead off an All-Star Game or two where the NL also had the likes of Aaron and Robinson?
5:37 AM Apr 12th
 
MarisFan61
.....Recognizing that this is only a side issue of a thing that anyway was just a tangent.....

I did go and look "comprehensively" at the difference between Henderson's and Mays's leagues in on-base average.
(Despite it "not being my style.") :-)

Indeed it's just about 13 points a year, in terms of mean. (I estimated 12-15.)
(I looked year-by-year at each guy's league for each year. I ignored the few very-partial seasons, as well as the year that Mays didn't play at all because of military service.
Conveniently they each had exactly 22 pretty full years, so it's easy to compare the numbers side-by-side, which is how I did it.

The "mean" is perhaps a somewhat misleading way to see it but I don't think that this changes the picture. Only few of the side-by-side season differences were close to 13; most were much less or much more, ranging from "minus-7" (i.e. Henderson's league had a lower on-base avg than the comparable Mays league) to +34, of which there were two such season comps, in addition to a +32, a couple of +26's, etc.
18 of the 22 season comps showed Henderson's league with a higher on-base; 2 showed Mays's league higher, and 2 were even.

I'll save the data for a while in case anyone wants to see it or ask anything about it.
1:46 AM Apr 12th
 
MarisFan61
Kaiser: Take a closer look.
If you mean he WAS on base more often, meaning that it doesn't count ifr you hit a HR, yes.
But otherwise, he barely reached base more than Mays. The difference in their on-base averages is 17 points, and when you take into account the difference in their offensive environments -- which, as I noted, had a difference in on-base average that I estimate at roughly 12-15 points, their gap is almost nothing.
9:50 PM Apr 11th
 
KaiserD2
Don't know how I misread that about Willie, but yes, I did.
Willie was definitely a superior offensive player and he had more great seasons. The job of a lead-off hitter, however, the best thing he can do for his team, is to get on base. Rickey did that more often than Willie, and therefore, I am not in the least surprised that a team with him leading off would score more runs.

DK
8:07 PM Apr 11th
 
MarisFan61
Glenn: Brilliant minds..... :-)

P.S. to my post:
Part of Willie's advantage on yearly offensive Win Shares is due to his having more 'very very full' seasons than Rickey.
But that's hardly all of it.
Like, in the offensive Win Share data on Baseball Gauge (I'd take the data from this site but I don't think thing is there), Willie shows far superior on "per 162 games": 29.1 vs.25.6,
7:36 PM Apr 11th
 
Glenn
Willie Mays' lifetime on-base percentage was .384, not .344. I only checked because .344 sounded real low to me.
7:30 PM Apr 11th
 
MarisFan61
Kaiser: You made a mistake on Mays's on-base average.

Not that I knew offhand what it was, but indeed I was in fair disbelief that it could possibly have been as low as .344 -- and it wasn't.
It was .384.
(I'm surprised your intuition didn't tell you something was wrong there!!)

Anyway, Henderson still does have an advantage over Willie in on-base (BTW you gave a slightly wrong figure for him too, although no big deal -- it was a point lower), 17 points worth, but effectively it was less than that -- little more than nil -- because of the context difference. I didn't look at it comprehensively (not my style) :-) but it appears that on-base averages in general were kinda-sorta 12-15 points lower during Mays's time.

Besides, all you need to do (IMO) in order to see that Mays was a superior offensive player is to look at their yearly offensive Win Shares.

Per the data on this site, here are their best yearly totals (rounded to integer):
Willie: 35,35,34,33,33,33,31,31,30,30
Rickey: 35,32,27,27,27,26,26,26,23,23

It remains unclear (and interesting) as to why the simulations would have shown Rickey's teams scoring more runs than Willie's (I think Bill said it was far more) with both leading off.
7:30 PM Apr 11th
 
steve161
Is it conceivable that the one-base events are trying to tell us that having one (more) runner in scoring position is more valuable than the base-out matrix thinks it is?
1:55 PM Apr 11th
 
KaiserD2
Rickey Henderson had a lifetime OBP of .401, Willie Mays's was .344. I think that's a more than sufficient explanation of why a team would do better with Rickey batting leadoff.
I also remember Bill arguing convincingly that Rickey cost the As runs the year that he broke the stolen base record for a season.

DK
8:57 AM Apr 11th
 
MarisFan61
......The thing I talked about down there felt like a thing in one of Bill's old Annuals -- a piece about simulations with hundreds of possible hitters batting lead-off. If I understood right and remember right, the simulations showed that a team scored more runs with Rickey Henderson leading off than with some seemingly superior people leading off, don't remember who, except that I think Willie Mays was one of them. In that piece, Bill noted it with surprise, but (as I remember) didn't try any explanation. I wondered if it was because of some dynamic by which Rickey improved the following hitters and/or worsened the pitching and/or defense. Of course it's not hard to surmise what factors may have been involved in those things, recognizing that those are clich├ęs that seem largely to be rejected in sabermetrics or at least severely doubted.
But, besides the doubt about the existence of such phenomena, I realized that I didn't know whether such a possible thing would have been reflected in the simulations. (It would depend on what was programmed in.)
2:41 AM Apr 11th
 
MarisFan61
A couple of questions, and a speculation (which involves a question too):

-- Does the fact of strikeouts and walks turning out to have more value than you initially posited mean you now think they probably really have more value than you had thought and/or more than your previous work seemed to show? (I realize that those two latter things may be the same.)
I can imagine that this isn't so, because maybe those initial estimates that you used were somewhat arbitrary -- but I'd tend to think they couldn't have been much arbitrary. So, this all seems to be saying that you've now found strikeouts and walks to be more valuable (and more important) than you had previously thought (and for walks, far more) -- and if so, those in themselves seem to be very major findings.

-- Similar question about double plays: Does this mean you now think they're far more valuable (and important) than you thought before?
If so, it might mean this shows them as being more important than almost anyone thought, which would have numerous implications, including (I think) that the value of SHIFTS is more questionable than has been thought. (I'm assuming they make it less likely to get double plays.)

About the wild pitches/passed balls/balks: This is related to what Fireball Wenz asked, but sort of the inverse.
He wondered if maybe their values were inflated because of what prior things were needed in order for them to happen.
I'm wondering if it's possible that their values are affected (I'm purposely not saying "inflated" because if this is true, IMO it wouldn't be an inflation but an actual) .....wondering if it's possible that their values are affected by a dynamic of what kinds of things may tend to follow them. This gets into a kind of thing that sabermetrics usually rejects. What I'm wondering (just wondering) is if those things might be particularly aggravating or unsettling to a team and therefore may tend to lead to worse pitching and fielding performance against the next few batters.
........which leads to a sub-question, somewhat separate from whether that particular idea is plausible or harebrained -- a question about how such a hypothetical thing would be reflected in this method:
IF a thing were to have such a dynamic effect on the following events, would the method want to bump up the value of that thing, or would it all just go into those following events?
I'm guessing it would make the method want to bump up the value of the 'precipitating' thing.
Again, I know that it's speculative to suggest there may be such a dynamic effect; as far as I know, nothing has been shown to have such an effect.
2:28 AM Apr 11th
 
willibphx
Maybe it is gaming the system but why would you not just use the actual runs for each season rather than a 10 or 5 year average. I understand using a multi year process for the standard deviation but it might be interesting to just use the individual season for the baseline values and see what does to the level of errors.

Not sure how you originally solved for and tested RC but did you use multiple year averages in that process?

Thanks
9:29 PM Apr 10th
 
Fireball Wenz
I know this is so pathetically obvious that you've thought about it and dismissed it - but is the high value of a WP and PB related to the circumstances required for it to happen? Is it because a wild pitch can only happen when there's a runner on already, when there's already run potential? Is that in part what the data is reflecting?
4:25 PM Apr 10th
 
shthar
6.2% is a strong beer.
4:16 PM Apr 10th
 
 
©2020 Be Jolly, Inc. All Rights Reserved.|Web site design and development by Americaneagle.com|Terms & Conditions|Privacy Policy