Username:	Password:

Remember me

Forgot your username/password?

Print Email

Home>Articles

Pause

By Bill James

April 3, 2020

Pause

OK, we have completed the first two phases of our project here, the first two phases being (1) finding the "floor", the level of zero competence, in each of the 11 specified areas, and (2) estimating how many runs are saved, historically, by each team in each of those areas, compared to the floor. Floor, base, misery level, zero-competence level. . . .those are interchangeable terms here. We have staggered drunkenly to the end of the second stage, more or less.

I am going to make two changes to what I have done so far, or one change twice. I am going to change the zero competence level for strikeouts and double plays from three standard deviations below the norm to four standard deviations below the norm. There are two reasons for doing this, which I suspect are pretty obvious, but I’ll spell them out, anyway:

1) The standard for everything other than strikeouts and double plays is four standard deviations below the norm, so those two categories were out of line with the rest of the system. There was a REASON why they were out of line, there was a theory behind it, but still, it is easier to explain the system and easier to justify the system if you can just say that the zero-competence line is 4 standard deviations below the norm on the team level, rather than saying that it is 4 standard deviations below the norm except when it isn’t, and then getting into a long-winded explanation about zero-bounded and unbounded categories.

2) Our estimate for all 11 categories was 13% below the target. Increasing the percentage of strikeouts and double plays that we give credit for brings us closer to the target of 1.78 million runs saved for all teams.

Changing from -3SD to -4SD for strikeouts increases the number of ERP (Estimated Runs Prevented) by strikeouts from 248,355 to 331,131, an increase of 82,776. (Occasionally I like to make up acronyms, even though I know I won’t use them.) Changing from -3SD to -4SD for double plays increases the number of ERP (Estimated Runs Prevented) by double plays from 62,176 to 82,877, an increase of 20,701.

Also, while doing this, I realized that I had failed to include the Runs Prevented by Balk Avoidance in my running total; that’s another 5,705 runs. These three changes—two changes and one correction—bring the total of estimated Runs Prevented to 1,657,784, or 93% of the target number.

Of course, we COULD make the calculations match the target perfectly by going to 4.07 standard deviations below the norm or some other number, but that would be silly, because our values at this point aren’t THAT good. I mean, our values must be more or less right, or it’s unlikely we would come as close as we did to hitting our target, but there are techniques that can be and will be applied to fine-tune the category values for this system, and we would have to do that even if we were exactly on target, which would just throw us off target again, so then we’d have to recalculate everything anyway.

The runs saved by each of the 11 elements of run prevention, historically, are as follows:

Category	Runs Saved (in Thousands)
DER	459
Strikeouts	331
HR Avoidance	325
Control	209
Fielding Percentage	136
Double Plays	83
Stolen Base Defense	49
Hit Batsmen Avoidance	33
Wild Pitch Avoidance	17
Passed Ball Avoidance	10
Balk Avoidance	6

We’re off by 7% on the gross weight of the whole. Figured team by team, we have an average error of 123 runs, or 18%, and a standard error of 155, or 22%. (The standard error is a method based on the squares of each error. It basically says that being off by 20 runs on a team is four times worse than being off by 10 runs on a team. The standard error cannot be less than the average error, and is usually more than the average error.)

Anyway, this will be the starting point (tomorrow) for the third phase of the project, which is the reconciliation stage. Trying to understand where you’ve gone wrong, and trying to make things add up better. 18 to 22% isn’t bad; I expected to be off by more than that, probably 25 to 40%--plus I have a pretty good understanding of WHY I am wrong or where I am wrong, so we should be able to move forward.

For whatever help it may give you in understanding what we are doing, this is the Run Prevention Chart for the 2016 Chicago Cubs:

2016 Chicago Cubs (103-58)
2016 Chicago Cubs (103-58)
Runs Prevented By:
Strikeouts	236
Control	58
Home Run Avoidance	162
Hit Batsmen Avoidance	11
Wild Pitch Avoidance	4
Balk Avoidance	2
Fielding Range (DER)	251
Fielding Consistency (F Pct)	34
Double Plays	31
Stolen Base Control	9
Passed Ball Avoidance	3

Sum of the Above	801
Actual Runs Prevented:	802
Error/Discrepancy:	1

The 2016 Cubs, the team that broke the Billy Goat’s heart, are a team for which the system happens to get about the right answer, which at this point is just coincidence. The Cubs played in a context in which an average team, based on long-established sabermetric methods, could have been expected to score and allow 679 runs. They won big because they scored 808 runs, but also because they prevented 802 runs. They are more or less a 50/50 team; their pitching and defense were as strong as their offense, about the same.

Among the things you can do with this information, once we really get the system working, is to compare two teams in ways that we could not otherwise have compared them. Two teams competing in the World Series; you can analyze and compare their defensive performance now in ways that it was not possible before; well, not NOW now, but when we get this jalopy running. We’ll be able to say that this team’s defense saved 78 more runs than that team’s defense in these six areas, but that team’s defense saved 106 more runs than that team’s defense in the other five areas. We’ll be able to distinguish between pitching and defense in ways that we can’t now. My mind is roiling with like a million things we can do with this data, most of which, of course, won’t turn out to be half as interesting in fact as they seem like they might be in theory.

But the first purpose, of course, is to create a platform of equality between hitters and fielders. Comparing a slugging first baseman to a defender—Boog Powell to Mark Belanger, or Howie Kendrick to Yan Gomes, let’s say, we are equipping some future Yan Gomes to go to his arbitration hearing and say that yes, Howie Kendrick created 58 and I created only 39 runs in similar playing time, but I Kendrick prevented only 14 runs, and I prevented 41. . . ..or whatever the data shows.

Or this. We are equipping some future GM, in contemplating a trade, to look at the record of a 28-year-old free agent shortstop, and say "OK, he saved 46 runs in 2031, 47 in 2032, 53 in 2033, 45 last year, which was 2034. Our shortstops the last four years have saved 38, 40, 36, and last year just 15, so signing this guy is going to save us 10-15 runs a year."

Of course, there are methods now that enable us to evaluate fielders, but this is my real point. It has always bothered me, about fielder’s evaluations, that they are so difficult to cross-check with alternative approaches.

Runs created methods are based on objective methods, and can be cross-checked in many, many different ways. You can see that teams do in fact score the number of runs that our methods predict that they should score. There is a great internal consistency in them, and you can predict how many runs they would lose if this player was injured or traded or out of the lineup. You can judge how well those methods work.

Or let me explain it this way. My methods of approaching Runs are one way; Pete Palmer’s were radically different—and yet they converge on shared conclusions.

Runs Saved +/- estimates are no doubt accurate and reliable to a good extent; I am not suggesting that they are not. What I am saying is that I wish there were better ways to check. SOMETIMES I believe what the Runs Saved systems say about a defender, and sometimes I don’t. When I don’t, I’m limited as to what I can say on the other side.

This, I am hoping, will be a way to construct that second look at the issue—often reaching the same conclusions, no doubt, but if not that, then constructing a way to challenge those conclusions. That is my main purpose here.

COMMENTS (11 Comments, most recent shown first)

djmedinah
FrankD: I had exactly that question for Bill in "Hey Bill" today! You can see his answer there, but his argument is that data should drive the distribution, not the other way. That seems reasonable—but to my mind, if in fact data does tend towards certain kinds of distributions a lot of the time, then that assumption would not be helpful either.
2:25 PM Apr 7th

Guy123
Following up on the OF Arm contribution, the SD at the team level for past three seasons has been 7 runs. If the SD has been that large historically (seems unlikely that it was significantly smaller in prior decades), I believe that would account for about 71,000 runs in Bill's schema (7 * 4 SD * 2550 teams). So that's similar in magnitude to DP, and greater than 5 other factors -- seems worth accounting for.

And again, it seems possible that the skill of infields/catchers/pitchers at preventing base runner advancement might be non-trivial, but I don't know if anyone has ever tried to measure it.
3:55 PM Apr 4th

Guy123
One additional category you may want to consider including is base runner advancement on balls in play. This can be a reasonably substantial source of defensive value. For example, B-Ref calculates an OF arm metric, which includes both base runners thrown out and holding runners to fewer extra bases than expected (note that this reflects the ability to get to balls quickly, as well as pure arm strength/accuracy). Some of the career leaders (in runs) appear to be:
Clemente 84
Barfield 58
Walker 57
Andruw 57
Kaline 55
Mays 49.

I suppose there's a comparable ability for infielders to get more forces on the bases (as opposed to settling for batter at 1B), preventing runners on 3B from scoring on IF outs, etc. -- but I don't think I've seen that measured.

10:31 AM Apr 4th

FrankD
Like the series and the live action research. My question has more to do with statistics. Are 'we' better off adjusting the standard deviation bounds to define outliers for each baseball stat population (strike outs, balks, etc.) as the data is analyzed or should we assume an a priori distribution and stay with that distribution for each stat population, i.e., not adjust standard deviation bounds as we collect more data? Also, should we assume that there are different distributions for each stat population or just use a one-size fits all distribution for all stat populations? I guess this is a fundamental statistical analysis problem: do we assume a distribution first and then analyze data collected or do we collect data, even if sparse, and then build the distribution from the collected data? And when should we choose which to do? This is beyond my level of understanding......
10:31 PM Apr 3rd

taosjohn
Shouldn't you also account for LOB? An at bat/ PA ends then the runner is put out OR the inning ends before he scores.

A third out with runners on is simply more valuable than a groundout leading off, isn't it?
9:57 PM Apr 3rd

apaster
Sorry if you have answered this but is it possible that the run value of the various events is different in an abnormal run environment? For example I would think that a strikeout in the steroid era would save more runs than it would in the deadball era. Well, of course it’s possible, the question is whether it makes sense to adjust it in order for the system to work better in extreme eras or whether the effort is disproportionate to the benefit.
7:37 PM Apr 3rd

shthar
Did the Red Sox have thier own system for coming up with a defensive # for a player? Or did they just go by DZR and the rest?

That really seems like something a smart team would try to come up with to try to get an advantage over the rest of the league.

4:49 PM Apr 3rd

bjames
I'm wondering if you computed the standard error by decades, you would find the system is more reliable for recent years than it was 100+ years ago.

The reliability of the system depends to a large extent on the run environment. Since there is this large problem bugging the system so far, the system works much better in a "normal" decade--about 4.50 runs per game--than in an abnormal period like the mid-1960s or the steroid era. This issue is SO large that it would obstruct your vision of smaller questions like 1900 vs. 2019. Thanks.
2:57 PM Apr 3rd

W.T.Mons10
Bill,

I'm wondering if you computed the standard error by decades, you would find the system is more reliable for recent years than it was 100+ years ago.
1:44 PM Apr 3rd

evanecurb
This is exciting research. One application I was thinking of: we can now talk about defenders' "big seasons" with more confidence. We can do it to some extent based on range factor or runs saved, and this system provides another check on that data. What was Mark Belanger's best season? (1975) How much better was he that season than in a typical season? What was his worst season? These are all things we discuss regularly when talking about great hitters - Ruth's best season was 1920, Musial 1948, etc. But we don't talk about Ozzie Smith's 1984 season vs. his fielding 'slump' in 1992 (made those years up).
12:53 PM Apr 3rd

Pause

COMMENTS (11 Comments, most recent shown first)

Leave a comment

Report inappropriate comment


Type of Abuse:
Comments: