Username:	Password:

Remember me

Forgot your username/password?

Print Email

Home>Articles

The Harmony of the Stitches

By Bill James

January 28, 2008

A. Stating The Question

To what extent does John Dewan’s +/- fielding rating match with other, more primitive methods to evaluate fielders? Let us suppose, for the sake of argument, that John’s system is 100% accurate in evaluating fielders. If John’s system is (or if it were) 100% accurate, then the accuracy of any other fielding metric could be found by finding the extent to which it agrees with the +/-. If the +/- is 100% accurate and another system agrees with the +/- 58% of the time, then how accurate is the other system? It’s 58% accurate, right?

In the modern world, the last five to ten years, we have very good ways of evaluating fielders. It is no longer within the range of substantive dispute whether a given player is a quality defensive player. There is a blizzard of data available, and there are many people looking at that data and reaching consonant conclusions about fielders. Essentially, we now know how good a fielder someone is.

However, most of baseball history lies outside the reach of this blizzard of data. We could extend the benefits of modern analysis to baseball history if we could develop methods that were consistent with the modern methods. We are, for the first time, in a position to answer very basic and important questions such as “How accurate is fielding percentage, as a measure of a player’s defensive ability?” As a first step toward that goal, I thought it would be helpful to ask this set of questions:

1) To what extent does the evaluation of fielders by the +/- system agree with range factors?

2) To what extent does the evaluation of fielders by the +/- system agree with the evaluation of the same fielders by fielding percentage?

3) To what extent does the evaluation of fielders by the +/- system agree with the evaluation of the same fielders by the ratio of double plays to errors?

B. Outlining the Method

This study deals only with third basemen, and only with the seasons 2005 through 2007. The study group for this report is “all major league third basemen playing 900 or more innings (cumulative) in the years 2005 through 2007”. There are 41 such players.

I began by putting together three-year fielding records for those 41 players,using the Fielding Bible Data given in the statistics section of this service. These records are given below. The categories of this record are the Player’s Name (Player), his full defensive innings (Inn) and thirds of an inning (I3), his Putouts (PO), Assists (A), Errors (E) and Double Plays (DP), his Fielding Percentage (FPct), his Range Factor (RF; range factor is putouts plus assists per nine innings), his Expected Plays Made on ground balls by the +/- system (Ex P), his Actual Plays Made on ground balls by the +/- system (Plays), the difference between these two (Diff), his +/- on balls hit in the air (Air), the total +/- on ground balls and balls hit in the air (+/-) and the Enhanced Fielding Rating (Enh, the Enhanced Fielding Rating being the +/- adjusted for extra base hits, so that cutting off a double is of more value than cutting off a single), and the Enchanced Fielding Rating per 1000 innings (Per In).

Player	Inn	I3	PO	A	E	DP	F Pct	RF	Ex P	Plays	Diff	Air	+/-	Enh	Per In
Alfonzo, Edgardo	913	2	81	177	8	11	.970	2.54	177	164	-13	0	-13	-11	-12.0
Atkins, Garrett	3862	0	260	800	50	93	.955	2.47	789	766	-23	-2	-25	-19	-4.9
Bautista, Jose	1390	2	123	319	22	23	.953	2.86	328	295	-33	0	-33	-34	-24.4
Bell, David	2512	2	188	599	40	49	.952	2.82	523	559	36	1	37	29	11.5
Beltre, Adrian	3963	0	397	881	47	81	.965	2.90	819	853	34	-2	32	44	11.1
Betemit, Wilson	1441	2	81	291	19	33	.951	2.32	273	274	1	-1	0	-3	-2.1
Blake, Casey	1249	0	103	267	15	25	.961	2.67	269	262	-7	0	-7	-4	-3.2
Blalock, Hank	2776	0	186	610	29	52	.965	2.58	606	585	-21	-1	-22	-27	-9.7
Boone, Aaron	2160	2	143	493	34	36	.949	2.65	471	455	-16	-2	-18	-16	-7.4
Braun, Ryan	945	1	61	161	26	12	.895	2.11	186	148	-38	-2	-40	-39	-41.3

Cabrera, Miguel	2882	2	236	578	42	71	.951	2.54	582	544	-38	1	-37	-38	-13.2
Castilla, Vinny	1755	1	197	324	16	31	.970	2.67	308	301	-7	-1	-8	-5	-2.8
Chavez, Eric	3288	2	292	751	26	86	.976	2.85	705	720	15	0	15	27	8.2
Crede, Joe	2768	2	245	679	24	80	.975	3.00	609	661	52	1	53	43	15.5
Encarnacion, Edwin	2577	1	240	524	51	47	.937	2.67	530	498	-32	5	-27	-30	-11.6
Ensberg, Morgan	2846	1	220	661	39	71	.958	2.79	614	632	18	1	19	15	5.3
Feliz, Pedro	3184	0	256	777	38	75	.965	2.92	664	717	53	6	59	58	18.2
Figgins, Chone	1554	2	108	310	26	26	.941	2.42	302	295	-7	-2	-9	-6	-3.9
Glaus, Troy	3367	0	271	778	47	86	.957	2.80	749	746	-3	6	3	-1	-0.3
Gordon, Alex	1135	0	99	247	14	22	.961	2.74	232	241	9	-3	6	9	7.9

Inge, Brandon	4101	1	354	1101	63	100	.958	3.19	1040	1090	50	-3	47	64	15.6
Iwamura, Akirnoi	1042	1	79	197	7	17	.975	2.38	200	194	-6	-4	-10	-7	-6.7
Izturis, Maicer	1430	0	86	298	24	27	.941	2.42	274	276	2	-1	1	2	1.4
Jones, Chipper	2799	1	242	572	32	56	.962	2.62	538	523	-15	-2	-17	-17	-6.1
Koskie, Corey	1277	2	106	308	14	30	.967	2.92	278	292	14	-1	13	11	8.6
Kouzmanoff, Kevin	1151	1	93	213	23	12	.930	2.39	206	202	-4	1	-3	-5	-4.3
Lowell, Mike	3749	1	355	821	27	107	.978	2.82	779	778	-1	-1	-2	6	1.6
Mora, Melvin	3664	0	275	857	45	59	.962	2.78	821	824	3	3	6	7	1.9
Mueller, Billy	1465	1	102	325	18	27	.960	2.62	318	321	3	2	5	4	2.7
Nunez, Abraham	1946	1	131	529	27	41	.961	3.05	499	501	2	1	3	7	3.6

Punto, Nick	1663	1	142	369	16	31	.970	2.76	330	352	22	1	23	27	16.2
Ramirez, Aramis	3464	2	268	730	39	50	.962	2.59	676	686	10	-2	8	11	3.2
Randa, Joe	1521	2	157	295	16	28	.966	2.67	307	289	-18	1	-17	-18	-11.8
Rodriguez, Alex	4002	1	317	800	49	80	.958	2.51	799	792	-7	2	-5	-9	-2.2
Rolen, Scott	2636	2	200	695	31	71	.967	3.05	600	653	53	-1	52	51	19.3
Sanchez, Freddy	1299	1	98	373	10	43	.979	3.26	319	344	25	-3	22	28	21.5
Teahen, Mark	1992	0	192	481	34	54	.952	3.04	475	452	-23	3	-20	-33	-16.6
Tracy, Chad	1652	0	130	333	29	33	.941	2.52	330	324	-6	1	-5	-8	-4.8
Wigginton, Ty	1226	2	92	260	22	17	.941	2.58	272	240	-32	0	-32	-40	-32.6
Wright, David	4188	0	315	949	64	77	.952	2.72	886	882	-4	-4	-8	-11	-2.6

Zimmerman, Ryan	2911	0	298	637	38	74	.961	2.89	563	584	21	3	24	21	7.2

The method used here was to look for “agreement” or “disagreement” between two methods of comparison. Let us take, for example, the two New York third basemen, Alex Rodriguez and David Wright. Rodriguez has a Range Factor of 2.51, and is rated by the +/- system (enhanced) at –9 plays in 4002.1 innings, which is –2.2 plays per 1000 innings. Wright has a Range Factor of 2.72, and is rated by the +/- system at –11 plays in 4188 innings, which is –2.6 plays per 1000 innings. So, do the two systems agree about which is the better fielder, or do they disagree?

Obviously, they disagree. Range Factor says that Wright is better than Rodriguez (by a relatively small margin), while the +/- system says that Rodriguez is better than Wright (by a very tiny margin.) Our method here is simply to make every possible player-to-player comparison, and to count how many times the two systems agree. With 41 players in the study there are 820 player-to-player comparisons {(41 * 40) ÷ 2}.

C. Simple or Direct Results

Fielding Percentage. Let’s start with Fielding Percentage, since fielding percentage is still the most-commonly cited fielding statistic. Fielding Percentage and the +/- system agreed as to which of two third basemen was a better fielder 551 times in the 820 comparisons, and disagreed the other 269 times. Fielding Percentage thus agreed with the +/- system 67% of the time—a little more than two times in three.

Double Plays to Errors Ratio. It was actually Double Plays to Errors Ratio that started this project. I noticed, looking at the data, that Mike Lowell over the last three years has a double plays to errors ratio of 107 to 27, which, just looking at the data for a few other third basemen, seemed to be quite exceptional. Adrian Beltre, for example, was 81 to 47, Ryan Zimmerman 74 to 38, Aramis Ramirez 50 to 39, Melvin Mora 59 to 45, and several of the less experienced third basemen are actually less than even. I have always advocated Double Plays to Errors ratio as a simple way to get a line on a fielder’s ability, and when I saw that I thought “wouldn’t it be interesting if the DP/Errors ratio, over a period of years, turned out to be a highly reliable indicator of defensive quality vis a vis these new defensive metrics?”

The Double Plays to Errors Ratio may be measuring something that the +/- is actually missing; I’m not sure, and perhaps John can add a post script explaining that. My opinion is that Lowell has an exceptional double plays to errors ratio in part because he has an exceptionally accurate throwing arm. If you think about a third baseman making a throw to start a double play, every inch that the throw is off target probably has a measurable impact on the number of double plays resulting. If the throw to the second baseman covering is three feet off target—let’s say that’s it’s three feet high, or that it pulls the second baseman three feet out toward left field—that almost certainly is going to eliminate the chance of getting a double play, most of the time. If the throw is one foot off target, it certainly would make an easily measurable difference in the chance of turning the double play.

Lowell over the three years of the study has worked with three different primary second basemen—Luis Castillo in 2005, Mark Loretta in 2006, Dustin Pedroia in 2007—so it is not likely that his exceptional Double Play total would result from the actions of the second basemen. On the other hand, it might be dangerous (in normal cases) to evaluate a third baseman by his ability to start double plays, because it might be (in other cases) that the second baseman controls this more than or as much as the third baseman himself. Thus, it might be that this measurement is picking up something that the +/- system can’t really key on.

In any case, the Double Plays/Errors ratio agrees with the +/- system as to which of two third basemen is a better fielder 567 times in 818 comparisons, or 69% of the time, slightly higher than the degree of agreement between Fielding Percentage and Dewan +/-. There were two sets of players who had identical Double Plays to Errors Ratios.

Range Factor. For the 820 possible comparisons there are 597 cases in which Range Factor and Dewan’s +/- agree as to which is the better fielder, and 223 cases in which they disagree. There is 73% agreement between the two—actually 72.8%.

D. Implied Results

This is a study of just 42 players at one position, and nothing can be safely inferred from this data. However, let us assume for the sake of argument that this 72.8% figure would hold up with additional research.

Based on that result, I think we could safely conclude that Range Factors are 75 to 80% accurate in evaluating fielders. I reason as follows:

Two systems can agree upon a result if they are both correct, or if they are both incorrect. Suppose that you have two systems which are both 90% accurate in a comparison of this nature. How often would they agree?

If they are randomly aligned against the truth, they would agree 82% of the time--.9 * .9 when they are both right, and .1 * .1 when they are both wrong. 81% plus 1% = 82%.

If each system is 80% accurate then they would probably agree 68% of the time—64% plus 4%. If one system was 90% accurate and the other was 60% accurate, they could be expected to agree 58% of the time--.9 * .6, plus .1 * .4. We can call this number the agreement rate.

We have been assuming for the sake of argument that Dewan’s +/- system was a perfect test of a fielder’s ability, but of course this cannot be true. There are many distinctions in the data which are razor thin—like the difference between David Wright and Alex Rodriguez—and it is unreasonable to assume that the system gets all of those distinctions right.

Using these assumptions, if a system agrees with the Dewan method 72.8% of the time, the only way that it can be less than 72.8% accurate is if the Dewan system is less than 27.2% accurate. If we assume that the Dewan +/- is 99% accurate, then to agree with that method 72.8% of the time, Range Factors would need to be .7327 accurate:

.99 * .7327 = .7254

.01 * .2673 = .0027

Agreement Rate .7281

If Dewan’s method is 95% accurate, then Range Factors would need to be 75.3% accurate in order for the two to be expected to agree 72.8% of the time. The lower accuracy is assigned to Dewan’s method, the more accuracy must be assigned to Range Factors in order to get the same degree of agreement between the two. This remains true until we assume that Dewan’s method is less than 50% accurate. Below 27.2%, you can get a theoretical agreement factor of .728 or higher by assuming that both methods are wrong a very high percentage of the time.

But that’s an absurd conclusion; no reasonable person could believe that both methods are wrong 75% of the time, nor is it possible to explain how that could happen. Therefore, we have to conclude that the only way for Range Factors to agree with Dewan’s +/- 72.8% of the time is if they are more than 72.8% accurate.

Let us assume, as one border, that Dewan’s +/- system cannot be more than 95% accurate in comparing player to player, since

a) many of the comparisons are close, and

b) there is a measure of randomness in all outcomes.

If we assume that Dewan’s +/- system is 95% accurate or less, this implies that Range Factors are 75% accurate or more.

For a border on the other end, suppose that we assumed that Dewan’s +/- was 80% accurate. In order to get 72.8% agreement with a system that was 80% accurate, Range Factors would have to be 88% accurate.

But it seems entirely unreasonable to believe that Range Factors, a primitive method prone to many different illusions of context, can be 88% accurate if Dewan’s +/-, a sophisticated analysis which attempts to remove every illusion of context, is only 80% accurate. It seems unreasonable to believe that Range Factors are as accurate as the +/-, let alone that they are more accurate.

If we assume that Dewan’s +/- is 88% accurate, that would imply that Range Factors are 80% accurate, in order to get an Agreement Rate of .7280:

.88 * .8000 = .7040

.12 * .2000 = .0240

Agreement Rate .7280

This would seem to be as far as it is reasonable to go in assuming that Range Factor and Dewan’s +/- are of comparable accuracy—88% for Dewan, 80% for Range Factor. Therefore, we must conclude that Range Factors are not more than 80% accurate in comparing the relative strength of two fielders.

Range Factors appear to be 75 to 80% accurate in evaluating the relative abilities of two fielders.

By the same logic, Fielding Percentages would appear to be 69 to 73% accurate in evaluating the relative abilities of two fielders, and the Double Plays to Errors Ratio would appear to be 72 to 75% accurate.

E. Housekeeping

Again, I will acknowledge that one study of the players at one position is not a reliable test of anything. My purpose here was more to outline a pathway for future research than to stake a claim for any particular results.

Also, the reliability of the various methods here is being tested on multi-year data, which must be assumed to be substantially more accurate than single-season data. If Range Factors are 75 to 80% accurate in a three-year look, this would suggest that they are something less than 75 to 80% accurate in a single-season study.

Also, the tests here were conducted by comparing the Range Factors to the Enhanced +/- rating, not to the raw +/- rating. The reason this was done is that the agreement between Range Factors and the Enhanced +/- was slightly higher than the agreeement with the raw +/-. I tested it both ways and used the higher figure, although the figure for the raw +/- was just a little bit lower. Whereas the count was 597-223 for agreement between Range Factor and the Enhanced +/-, it was 593-227 for agreement with the “un-enhanced” or “raw” +/-.

COMMENTS (8 Comments, most recent shown first)

birtelcom
I ran CORREL in Excel comparing the "Enhanced +/- Per Inning" column in Bill's table to Fielding Win Shares per inning over the same seasons (I had to drop some players in the table because they played a significant number of innings at positions other than 3rd base, and I don't have the data to break out Fielding WS by position -- I ended up with 31 players who played only, or virtually only, 3rd). The CORREL I got for Enhanced +/- Per Inning to Fielding WS per inning was 72.1.
10:45 AM Mar 1st

jrickert
With regard to Bill James' answer to robneyer's question, there's a slight correction. If PM is 97% accurate, RF is 77%, and FP is 71%, AND they are independent measures, then all three are indeed wrong about the same player with a frequency of .002001, but they are in agreement with a frequency of only .5323. Therefore, given that the three are in agreement, the probability that they are correct is about 99.624%. Further, we could verify these percentages by noting that if two of the three agree, the majority is the pair PM,RF with frequency .2215; PM,FP with frequency .1651, and RF,FP with frequency .00811 (and the corresponding probabilities that PM selects the correct fielders are approximately 97.788%, 95.942%, and 79.777%).
Comparing the actual frequencies of agreement from the list, I got
all three agreeing with frequency .52927 and pairs (PM,RF)=.19878, (PM,FP)=.14146, (RF,FP)=.13049. This gives us a system of equations that we can solve to get accuracy percentages: PM at 84.9%, PF at 82.7%, and FP at 74.5%. If we throw in a fourth independent measure, then the system might not be solvable and we'd have to do some sort of estimate (for example, a least squares estimate)
There's also an added problem that since we're comparing players AtoB,AtoC,BtoC,..., the sampling error for A affects all comparisons involving A, so that even if the measures are independent, the counts for the comparisons don't represent 820 independent counts.
Unfortunately, it's not that simple if the measures are not independent. For example, suppose that PM is 90% accurate, and that when PM is correct, RF is 80% accurate, but when PM is wrong, RF is affected by the same bias and is only 5% accurate.
Overall then, PM would be 90% accurate and RF 72.5%. If they were independent, the two would agree with frequency .6800, in which case PM would be correct about 95.956% of the time, but correct only about 77.344% of the time that they disagree. But in the non-independent model that I've built here they agree with frequency .7250, in which case PM is correct about 99.3% of the time, while being correct only about 65.45% of the time that they disagree. If we assumed independence and PM=90%, then we'd infer from the agreement frequency that RF=78.1% instead of the actual 72.5%.
Alas, it's not that simple. If one player is much better than the other (e.g. Crede and Braun) then the systems are each correct about 99-100% of the time. If the players are of nearly equal ability, then each measure will have smaller success rates, and these rates will likely be changing at different rates as the skill difference changes - and their correlation might also change as the skill difference changes.
But it's not that simple. What if neither player is better? For example, one has a bit more range, but the other makes fewer errors (So that they pretty much balance out) or is better suited to the type of pitchers on his team? Or perhaps one player is better in half the parks and the other player is better in the other half. Or one of the players is a poor traveler and is weaker when he has to travel to the other coast, but stronger when staying near home... [commenter continues for several hours]...
rob: if the measures are independent and probabilities of PM,RF,FP,DP/E are p,r,f,d, respectively, then the probability that they are correct if all four are in agreement is (p*r*f*d)/(p*r*f*d+(1-p)*(1-r)*(1-f)*(1-d))
3:33 AM Mar 1st

birtelcom
It would be interesting to know how this test works for the more sophisticated historical fielding rating systems such as Fielding Win Shares and BP's Fielding Runs Above Average (or even the two combined!)that also use the basic stats that are available for much of baseball history. One would hope that the correlation is better than it is for less refined stats such as Range Factor and Fielding Pct.
12:15 AM Feb 27th

tangotiger
Focusing only on the release/footwork and accuracy (i.e., not strength), ARod and Inge drop off the list of the 3B that have better arms than Lowell. _______ Also note that Ryan Zimmerman gets several DP, recorded to him as a 3B, even though he plays in the SS position on shifts. So, he gets some DP where he makes the out at 2B. I've always believed that it's more important to mark a guy's position as to where he actually plays, than to have a strict recording of accounting for everyone at one of the 9 positions. I would introduce a "roverOF" position to handle anyone who is playing in the short-outfield, or a "roverIF" position to handle the fifth infielder.
3:28 PM Feb 5th

tangotiger
With respect to the "good" arm, I did mean it in terms of the combination of release/footwork, arm strength, and arm accuracy (not just strength). And, the Fans are notorious for overvaluing really good players who are now in their 30s. It's possible that ARod qualifies for that.
11:04 AM Feb 4th

bjames
If we assume that Dewan +/- is 97% accurate, Range Factors are 77% accurate and Fielding Percentage 71% accurate, then the chance that all three would be wrong about the same player would be .002. We would be able to make a judgment there with 99.8% confidence.
With respect to the list of fielders who are supposed to have arms as good or better than Lowell, two points:
1) What most people would understand as a "good" arm is a STRONG arm. Arm strength is not an issue here, since any third baseman can easily throw to second. I was talking about arm accuracy.
2) Rolen, Feliz, Crede and Chavez have among the highest double plays/errors ratios in the study, all in the top eight.
3) The notion that A-Rod's arm is as good as Lowell's is preposterous, and is an obvious halo effect.

2:56 PM Feb 2nd

robneyer
Question: What if all three measures -- fielding percentage, range factor, DP/E -- agree with +/- that one fielder is better than another? I'm not nearly smart enough to answer that question, but I'm assuming a smart person might easily answer it...
2:22 AM Feb 1st

tangotiger
A correlation of RF to EnhPerInn is r=0.62. Fielding% to EnhPerInn is r=0.63. The advantage here is to also include the differences, not just which number is higher than the other number.
______________
As a curiosity, I also compared Dewan's number to the Fans' Scouting Report I collect. I have 30 of those 3B in 2007. If I rerun the above two correlations to this smaller dataset, I get r=.67 and r=.71 respectively. When I run Dewan's number to the Fans' number, I get r=.75. Basically, the Fans of 2007 did better in correlating to Dewan's 2005-2007 numbers than RF and Fld% of those same years did.
_______________
For whatever it is worth, the Fans evaluated these 3B as having arms as good or better than Lowell: Rolen
Feliz
Beltre
Inge
Crede
Chavez
Rodriguez

3:37 PM Jan 29th

The Harmony of the Stitches

COMMENTS (8 Comments, most recent shown first)

Leave a comment

Report inappropriate comment


Type of Abuse:
Comments: