Responding to Bill James’ Misunderstanding of Defensive Runs Saved

March 7, 2015

Defensive Runs Saved (DRS) is a system of metrics that has been developed over the last 12 years (going back to 2003) by John Dewan, Ben Jedlovec and the team at Baseball Info Solutions (BIS).  Bill James himself has played a key role in the development of DRS as it was his suggestion to create a plus/minus type method at the outset of the research.  DRS has multiple components that convert numerous techniques that measure individual aspects of defense (like catcher throwing, to name one) into a common currency.  That common currency is one that all baseball fans understand: runs.  Using a common currency was vital to insure that the numbers produced by the system could be easily understood.

In Bill James’ massive effort called The Fielding Jones on Bill James Online, Bill takes exception with some of the methods being used to develop Defensive Runs Saved.  However, much of those exceptions are based on a misunderstanding of the system.  Ben Jedlovec, President of Baseball Info Solutions, has been my partner in furthering the developing of Defensive Runs Saved since he joined BIS in 2008.  Below are Ben’s comments regarding Bill’s misunderstanding that he wrote in an email to Bill right after Bill published his thoughts.  I apologize in advance that it’s a bit technical in nature, but we want to set the record straight.  The comments here are primarily about Bill’s concern regarding the way we estimate the run impact of each play, as described in Section XXXIII of The Fielding Jones entitled "The Marginal Run Question".

 

Ben’s email to Bill James:

"If my understanding of your argument is correct, I believe you are missing one or two crucial points. Let me try two different explanations, a short one and a long(er) one.

1) Your argument would be valid if we were to simply apply .75 times the number of plays made, and/or .75 times the number of hits allowed. However, that's not what we do because we factor in the difficulty of each play, which you could call the "expectation" that the play will be made. All of our Runs Saved components factor in the "expectation" for the event, given an average fielder at the position.

2) Offensive Linear Weights similarly have an "expectation", or baseline. It's zero. In other words, when the batter steps up to the plate, he could single (+0.45 runs), homer (+1.40 runs), strike out (-0.30 runs), or any number of possible outcomes. (Obviously, the actual run expectancy change of the play will depend on the runners and outs, but we're dealing with averages here as Linear Weights does.) The batter will make an out (-0.30) far more often than he'll get a hit, but the average of every outcome's run value will be 0.00. By definition. If the frequency of events changes, the run values of each event change to keep the average at 0.00.

You could explain our Range and Positioning calculation in the following way. Back to our batter stepping up to the plate. The expectation of this play is zero, 0.00. The pitcher and batter do their thing, and sometimes the ball is put in play. The ball is in the air or on the ground, weak grounder or screaming liner, on its way to something. En route, even before the play has been resolved, we can consider the expectation of the play to have changed.

It might be a screaming liner that is a sure single 100% of the time. In this case, the value of the play is now 0.45, the run value of a single.

It might be a pop up on the infield, a sure out 100% of the time. In this case, the value of the play is now -0.30, the run value of an out.

It might be a hard grounder towards third. Let's say we estimate that it's a single 50% of the time, and it's an out 50% of the time. In this case, the expectation of the play is now 0.50*0.45+0.50*(-0.30) = 0.075. (We're using a simple binary example, but note that we can use any number of outcomes with their respective estimated frequencies and run values.) The play has jumped from 0.00 to 0.075, so that hard grounder was a positive outcome for the batter.

Now we get to evaluate the fielder. If he makes the play, it's an out. We credit the fielder 0.075 - (-0.30) = +0.375. If he fails to make the play, it's a single, and we penalize the fielder 0.075 - 0.45 = -0.375. So, while the difference between the hit and the out is 0.75, we're actually giving/penalizing the fielder a fraction of that, depending on the difficulty (expectation) of the play.

The Defensive Runs Saved system's job is to accurate estimate the difficulty, or "expectation" of each play given a league average fielder, since we use a league average baseline for each position. "

 

Bill’s New Technique

Bill wants to do fielding metrics a different way, as described in Section XXXVIII "What I would Like to See in Fielding Statistics".  He wants to re-create what we called Where Hits Landed, as published in the first Fielding Bible.  He lamented how this information should be made available to the public, or in his words "to foster the creation of a organized universe of fielding data."  But he forgot that this information was published before and provided it to Bill (and the public) back in 2006 in the first Fielding Bible.  There was minimal interest at that time in this data by the public, or by Bill himself, so we discontinued the report.

However, we will again provide that information to Bill to use as he wishes.  Baseball Info Solutions is giving Bill a spreadsheet that gives all of this data going back to 2003.  If and how Bill wants to make this available to the public through Bill James Online, we will leave that up to him.  He may decide to provide only the most recent data because, as he puts it "the value of data from 2006 is minimal in any case."  Or he may provide just his analysis.  It’s up to him.

There is no question in my mind that Bill can do some valuable things with this data.  He will come up with some very good information and it will be interesting to see how it compares to the one component of DRS that he is trying to measure: the Range and Positioning component of DRS.  Having said that, I don’t think it is possible to use less data that has far less detail and get a better result. 

Is it possible that Bill will come up with a better way to present defensive data?  Certainly.  People will be developing baseball information 100 years from now and will continually come up with better and better ways to create and present defensive analytics.  One of the most important reasons for this is that there will continue to be better and better data to be analyzed.  Plus, there are certainly plenty of avenues of research that we haven’t yet pursued with the DRS system. 

The proof is in the pudding.  If a new baseball metric is measuring something well, it needs to be demonstrated.  The DRS system has been studied in this manner in many, many ways, by ourselves in several independent studies.  Research on each component of the system has shown each to be consistent and predictive.  On top of that, the results need to be consistent with what people predominantly see with the naked eye.  A high percentage of the players known for good defense should come out with good numbers otherwise credibility is lost.  That is true of the DRS system.

Of course, not everyone has bought into defensive metrics. We’ve spent more time and energy on the subject that just about anybody (we’ve published four books on the topic, for goodness sake!), so we know the Defensive Runs Saved system and its results better than anyone. However, not everyone takes our word for it, and that’s fine. We’re hopeful Bill can find better ways to present the information. He’s been very successful over his entire career doing exactly that. We hope that it will even add to our understanding of defense.

But will it change the DRS bottom line, that Andrelton Simmons is an incredible defender and that Matt Kemp is a liability in the field? We doubt it.

 
 

COMMENTS (18 Comments, most recent shown first)

jedlovec3
All,

Not sure I agree with all of the comments here, but we'll continue to do everything we can to make our data and analytics better.

I'm sure we meant to allude to independent studies done by ourselves and others.

Chris, I'm glad you tracked down the clarification for the Good Plays/Misplays component of Runs Saved. You're right in the end: we focus on the Good Plays/Misplays (such as first baseman scoops) that aren't incorporated in other components.

Oh, and not that it matters all that much, but you're attributing comments to me that are from John. My words end at the closing quotation marks. I don't want to shortchange John's work here.

Thanks,

Ben
6:54 PM Mar 13th
 
Guy123
Season DRS UZR Scale
2010-2014 10.9 8.6 128%
2003-2009 10.5 9.5 111%


One other factor to consider is that a few years ago BIS reportedly corrected its range bias problem. That should have increased the SD, consistent with what you show for DRS. The fact that UZR's SD shrunk even as the underlying data is describing a larger spread is a bit puzzling, and perhaps reflects some change in UZR methodology in approximately 2010?
8:19 PM Mar 8th
 
tangotiger
MGL:

I think I can somewhat describe numerically your point about the kind of error you are discussing.

If we knew perfectly the chance of an out for every BIP, we'd get one SD = .010 runs per BIP, purely from random variation.

Except, as you point out, we don't know exactly the chance of an out. Whereas it may be a true 20% out rate, we may estimate it at 10% or 25% or even 40%. We have an uncertainty in our mean estimate. Maybe the uncertainty of our mean estimate is one SD = .050 to .100 outs per BIP, or around .04 to .08 runs per BIP.

Adding that to our original .010 runs per BIP, and now, one SD = .011 to .013 runs per BIP. So, let's say .012 runs per BIP, or 6 runs per 500 BIP.

Therefore, that's what I'd suggest is our uncertainty in our estimates, based on random variation and uncertainty of our true mean. We get 6 runs per 500 BIP off the bat.

Of course, some people will INSIST that we do NOT remove random variation (just like we don't in hitting). But, we NEED to at the least remove the uncertainty of the estimate of our mean.

12:07 PM Mar 8th
 
tangotiger
If you figure around 500 BIP for each fielder, then one SD PURELY from random variation is going to be .020 outs per BIP (assuming normal distribution). But, since the distribution is closer to bi-modal (near-automatic outs and near-automatic hits), one SD from random variation is going to be closer to .014 outs per BIP, or .010 runs per BIP. Which is 5 runs per 500 BIP.

If spread in talent is one SD = 9 runs per 500 BIP (just for ILLUSTRATIVE purposes), then we'd OBSERVE one SD = 10.3 runs.

Of course, our ability to estimate what we observe can't possibly be that good. Hence, we can only observe say one SD = 9.5 runs or so at best (under this illustration).

DRS is observing one SD = 10.9 runs, which therefore assumes the spread in talent in fielding is much wider than one SD = 9 runs.

***

Note that that observation includes random variation, which we all have seem to decided to include that for hitting and pitching and running, and so, we have to keep it for fielding. But, that's a separate discussion altogether.​
11:19 AM Mar 8th
 
cderosa
Page 453, Fielding Bible III: an appendix I missed previously answers some of my questions regarding double counting: the "ground ball through infielder" is *not* converted into runs, nor are some of the other GFP/DM that would be obvious double-counts of range issues.

So tentatively, double-counting does not explain the rise over time of DRS's standard deviation relative to Ultimate Zone Rating's standard deviation compiled by Tango below.

Sorry for a false lead.

I still think we need to see all the Good Fielding Plays and Defensive Mistakes clearly laid out.

Chris DeRosa


10:06 AM Mar 8th
 
tangotiger
One SD in batting linear weights (relative to position, to keep it at the same scale as we have for fielding) is 16 runs for the same time period and for the same amount of playing time.

I would expect therefore that one SD in fielding will be around 10 runs, maybe one SD = 11 runs at the most. Closer to 10 though.

But seeing that our estimate for fielding is less certain than for hitting, our ESTIMATE should likely be closer to one SD = 9 runs for fielding.

So, I do think that the DRS estimates are too wide, and that the estimate of the width from UZR is pretty much spot on.


9:55 AM Mar 8th
 
tangotiger
This is all non-catchers, 2003-2014, standard deviation, min 1000 IP at a position (all data from Fangraphs):

Season DRS UZR
2014 10.5 8.4 125%
2013 11.4 8.7 130%
2012 11.0 8.8 125%
2011 9.9 8.2 120%
2010 11.9 8.6 138%
2009 10.2 8.8 117%
2008 10.4 9.8 107%
2007 11.8 11.1 107%
2006 10.1 9.1 111%
2005 12.4 10.2 122%
2004 9.6 9.1 106%
2003 9.0 8.5 105%

Season DRS UZR Scale
2010-2014 10.9 8.6 128%
2003-2009 10.5 9.5 111%

The "scale" is simply DRS/UZR. As you can see, DRS has gotten slightly wider in scale, while UZR has gotten narrower.

The issue is also as to how wide the fielding range should be. Actually, how wide we can ESTIMATE the range to be. Even if you can argue that the range should be one SD = 10 or 11 runs, can we possibly ESTIMATE it to be that wide?

9:18 AM Mar 8th
 
cderosa
MGL offers the caution that granular is better only if the methodology is sound. Another thing in DRS that deserves some more discussion is the method by which the Good Fielding Plays and Defensive Mistakes are converted into runs.

The system calls for a comparison of how many good plays or mistakes a fielder makes compared to the average player in the same number of “touches,” then it converts the difference into runs based on the average run value of the play. In an example in the Fielding Bible III, we see Robinson Cano docked 1.25 defensive runs for the year because he had only four GFP13s, “double play despite aggressive slide” and that’s low compared to other second basemen in the same number of touches.

First of all, is that the right question? I think what we really want to know is not how often we see different second basemen make this play in general (“touches”), but how many times per aggressive slide.

Second, I wonder if we are perilously close to giving points for style here. I appreciate the painstaking work it takes to keep counts of these observations, and I think we are richer for it. But it might be that this is the sort of information (only four times did we see Cano suspended in midair over a sliding runner, completing the DP) that helps explain why the Yankees did well or poorly in completing the number of double plays we might expect them to have in say, Bill’s expected double play estimates, rather than constitute the estimate itself.

7:33 AM Mar 8th
 
cderosa
Other potential double-counts in the Good Fielding Play/Defensive Mistake system include:

Ball stuck in glove: Are we penalizing a fielder for losing an out on a batted ball (which would be a range issue), or just for a baserunner advancing while ball is so situated?

The summary of infielders’ Good Fielding Plays on Throwing (p 103, FB III): Do you get one of those whenever you make an incredible throw, or only on throws other than the ones that make assists off batted balls (i.e. the ones that should be covered by the +/- measure)?

Outfielder failing to anticipate wall (DM38): I’d expect the +/- range measure to account for the fly ball not being caught, and for the fact that it is going to go for extra bases with runners advancing, I’d expect that to be covered by the average value of doubles or triples when you convert the +/- result into a run value in Defensive Run Saved.

Quick start on a double play (GFP11): The fielder presumably already gets credit for the first out in the range-measuring system; is the bonus for this play only in a fractional contribution to the second out? And is that fraction subtracted from the pivot man’s score? Otherwise, it is possibly a double-count.

7:30 AM Mar 8th
 
studes
Hey MGL, I (mostly) get what you're saying in your Simmons example. But...two questions:

1. Isn't it true that, while some of those balls in the 50% bucket might be 100%, another bunch might be 0%? The issue is that, given a small sample size, things don't necessarily even out?

2. Along the same lines, the Simmons example probably isn't an issue for career numbers? Isn't the particular issue more pertinent to seasonal numbers?
5:53 AM Mar 8th
 
mgl
I want to make another thing clear about fielding metric numbers in general. I have talked about this many times. The sample numbers we see, whether they are 1 year or 10 games or 3 years, are going to be large (how large depends on the data and the methodologies) for two reasons. One, they are always going to be larger (the spread that is) that the true talent of the fielders, simply because they are measuring sample performance. That is true of all metrics, offensive, defensive, pitching, etc. The larger the sample, the more these numbers will converge on true talent.

The other reason these defensive sample numbers are large is not well understood. Not only are they larger than defensive true talent for obvious reasons, they are also larger than actual performance, where that performance includes good and bad "luck" (performance above or below true talent). The reason for that is noise in the data!

When Ben or anyone else has a bunch of batted balls in a 50% bucket (50% of those particular location and speed are fielded half the time), not all of those balls are really fielded half the time. Some are impossible to field and some are fielded nearly 100% of the time). Why is that? Because we don't, from the data, really know how difficult a ball was to field. We don't know where the fielder started, we don't know EXACTLY where the ball was located, we don't know EXACTLY how hard it was hit, we don't know whether it took a good or bad hop, we don't know how much spin it had, whether the wind made it more difficult to catch, what the game situation was (for example, in some cases it is or is not correct for a fielder to dive for a ball and risk extra bases), etc.

So, technically IF one were to construct a metric which in reality reflects true performance, there needs to be one of two things: One ,perfect information for every batted ball, in terms of how difficult it was (i.e., how often the average fielder would make that exact play, given the exact same conditions), a regression on the numbers that the metric spits out in order to better reflect actual performance.

Let me give you an example of what I mean, using made up numbers. Let's say that Simmons has a +23 for 2014 in DRS or UZR (or any other similar metric). That means that his actual performance compared to an average fielder in his place given the exact same balls in play under the exact same conditions is probably something like +14 (again, I am making that number up) AND that his estimated true talent is something like +10.

And that is why, as I have said many times, I do NOT like the idea of adding together offensive and defensive metrics in WAR. The defensive portion does NOT reflect what actually happened, whereas in some sense the offensive portion does. The more the defense portion deviates from 0, the more problematic it is. As I said, if Simmons has a UZR in 2014 of +23, probably only 10 to 15 of that should be used for 2014 WAR. So we are probably overvaluing him by 5 or 10 runs. If another player has +5, then the error is probably only on the order 1 or 2 runs.
10:22 PM Mar 7th
 
mgl
I also agree mostly with Ben, although I have not followed this discussion. As Tango said, the idea that the more granular the data, or simply more data, the better the results, that assumes two things, of course, which I think is fair to assume unless addressed otherwise: One, the more or more granular data is sound - that is, it is not overly biased or noise-filled. Two, the methodologies used to create the results are as sound as possible. Obviously if one methodology that uses less or less granular data is better than another methodology that uses more or more granular data, then it is possible for the results of the former to be better than the results of the latter. But, as Ben says, all things being equal, more and more granular data always yields better results.

As far as his description of the methodology for coming up with the number (runs saved or earned above/below average) for each play, it is perfectly sound, although it is not clear from his description what happens if there is some shared responsibility for a batted ball, between or among fielders, which is almost always the case. For example, if another fielder makes a play or does not make a play, you cannot assign a different fielder the same value (debit, because HE did not make the play), otherwise everything will not add up correctly. It is a little tricky and it is by no means crystal clear how to handle shared responsibility plays. In the example Ben gives, the accounting is simple whether the fielder makes the play or not. That is not a practical example though. In most cases, you will have a batted ball that might be a hit 50% of the time, be fielded by the 3rd baseman 10% of the time and the SS 40% of the time, for example. In that case, you probably want to assign a different debit value to the 3rd baseman when he does not make the play, given that A, the SS DID make the play, or B, that the SS DID NOT make the play. There are several reasons for this. One, if the SS DID make the play, it is likely that the ball was closer to the SS than the 3B (so that the normal 10% does not even apply) no matter where the ball location was recorded), and two, the 3B may have deliberately deferred to the SS for whatever reason. That is especially true when a 3B makes a play and the SS did not. The 3B usually cuts in front of the SS on a ball in the hole, such that if the 3B makes a play, we have no idea whether the SS could have made the play as well.

Finally, I want to address the idea of splitting responsibility for a batted ball between the pitcher and fielder. The pitcher may be responsible for how hard the ball was hit, whether it is on the ground or air, hang time, even where it is hit, but once we use the location and ball speed (or hang time), then we are completely (100%) measuring the responsibility of the fielder. The pitcher is out of the equation.​
10:06 PM Mar 7th
 
tangotiger
Chris: excellent point. This may explain why it looks like the DRS numbers have "expanded" in scale recently.
5:33 PM Mar 7th
 
cderosa
A different point re: Defensive Runs Saved, if you will indulge me. I think each edition of The Fielding Bible (I have the first three) has improved on the last. But one reason I'm not embracing the DRS system is that you added this Good Fielding Plays/Defensive Mistakes piece into it without showing us the whole list with descriptions of what's being counted.

There are stray references to individual GFP/DMs though, and some of these raise further questions. On page 96 of Fielding Bible III, for example, you've got "ground ball through infielder" as one of the Mistakes, but how is that not covered already, as an easy play not made, in the +/- system? Is this a kind of double-counting then?

"Taking a Bad Route to the Ball," DM12: same issue: are we talking about a route to a fly ball that falls (and hence a range issue), or just chasing a fallen ball badly and turning a single into a double?

We learn that there are 82 of these plays, but I have to know what they are before I can judge how much stock I want to put in the weight you assign them.

If these are fully described in the 4th edition, that's another improvement to be proud of, but if not, perhaps you can publish them on this site.

Thanks,
Chris DeRosa


5:17 PM Mar 7th
 
MarisFan61
Adding to what Tango said: It isn't just 'slicing the data in a granular enough manner' that can keep more-data-with-more-detail from giving a better result. More generally and I think more particularly, it's what you make of the data that you have -- how you think of it and what you do with it. I think less-data-with-less-detail very very often can give a better result, in sabermetrics or in anything. Sure, 'more' gives you a better chance, all else being equal -- but it isn't equal, mainly because it depends on who's doing it, but also because, in line with what Tango said, 'more' also gives you more opportunity to get removed from relevance.
12:48 PM Mar 7th
 
cderosa
Saying that Defensive Runs Saved (DRS) is consistent and predictive, and that it shows Simmons is good and Kemp is bad, I think only speaks to points already conceded. The marginal run objection, and most of the doubting reactions to the system that I have seen, are not about whether it is measuring stuff in the right directions, but whether it has the scale right.

Looking back on Bill's statements on the scale issue, I note the following arguments in play:

1. Intuitive Resistance: having fielders with what look like whopping totals, this guy +40, this guy -31, etc., isn’t realistic: the spread between the best defenses and the worst defensives doesn’t support individual totals like these.

2. More Credit to Pitchers Needed: On a ball in play, DRS gives all the responsibility for any deviation from the average outcome on such a ball, good or bad, to the fielder. The original win shares split responsibility for a team having above or below average Defensive Efficiency between the pitching staff and the fielders. It treats not just the routine outcome as a thing pitchers and fielders do together, but also the act of getting a tough out, or losing an easy out. Bill hasn’t defended the split decision directly, and it may no longer even be part of today’s win shares. But he has argued elsewhere that when the pitcher gets the batter to put the ball in play anywhere, he is getting the batter “mostly out,” so substantial credit is due.

3. Better to Be Conservative: In another passage (I think it concerned Carlos Gomez having a higher WAR than Mike Trout one year), he also mentioned that we aren’t sure how much chance variation is involved with fielders’ numbers balls in play and didn’t want to go too hog wild relying on them.

4. Dave Parker Concession: In a piece Tango linked in a previous thread, Bill dropped the intuitive resistance (at least temporarily), and reasoned that although it might not seem sound at first, it was plausible that a failing right fielder could cost you 27 runs or something in a season.

5. Marginal Run Issue: I don’t want to paraphrase it because what Bill is saying here is beyond my immediate grasp. I read that a few times and I don't quite get it. But the result, he suggests, is that we are giving more way more weight to fielding plays than is warranted.

Putting aside the intuitive resistance (after all, once, people thought it was obvious that Dick McAuliffe couldn’t have had as good a year as Roberto Clemente, and lots of other things we take for granted now), these other issues seem to form the battleground if you are trying to prove that DRS is the right or wrong approach.

11:56 AM Mar 7th
 
tangotiger
I agree with Ben on most of this. A couple of exceptions:

***

"Having said that, I don’t think it is possible to use less data that has far less detail and get a better result."

It's certainly possible. You slice the data in a granular enough manner, and just wait for systematic bias to give you less reliable results.

***

I'm not sure if this is Bill's point, but if a player does/doesn't make a play, that non-play is not necessarily a hit allowed: it could be an out made by another fielder.

So, while an out (+.25) to a hit (-.50) is a .75 run difference, the out to out is 0 runs of difference. If we assume that 10% of the time, there's "overlap", that brings us down to .675 runs in this illustration. Not a big difference to be sure, but something to consider.

***

I also take exception to Ben's characterization of "independent studies" performed by.... themselves. That's not the definition of independent studies.


7:44 AM Mar 7th
 
OldBackstop
Interesting. Did Neyer ever address Bill's remarks? He left here muttering darkly...
6:54 AM Mar 7th
 
 
©2019 Be Jolly, Inc. All Rights Reserved.|Web site design and development by Americaneagle.com|Terms & Conditions|Privacy Policy