Remember me

The Harmony of the Stitches

January 28, 2008

A.  Stating The Question

 

            To what extent does John Dewan’s +/- fielding rating match with other, more primitive methods to evaluate fielders?   Let us suppose, for the sake of argument, that John’s system is 100% accurate in evaluating fielders.   If John’s system is (or if it were) 100% accurate, then the accuracy of any other fielding metric could be found by finding the extent to which it agrees with the +/-.   If the +/- is 100% accurate and another system agrees with the +/- 58% of the time, then how accurate is the other system?   It’s 58% accurate, right?

            In the modern world, the last five to ten years, we have very good ways of evaluating fielders.   It is no longer within the range of substantive dispute whether a given player is a quality defensive player.   There is a blizzard of data available, and there are many people looking at that data and reaching consonant conclusions about fielders.   Essentially, we now know how good a fielder someone is.

            However, most of baseball history lies outside the reach of this blizzard of data.  We could extend the benefits of modern analysis to baseball history if we could develop methods that were consistent with the modern methods.    We are, for the first time, in a position to answer very basic and important questions such as “How accurate is fielding percentage, as a measure of a player’s defensive ability?”  As a first step toward that goal, I thought it would be helpful to ask this set of questions:

            1)  To what extent does the evaluation of fielders by the +/- system agree with range factors?

            2)  To what extent does the evaluation of fielders by the +/- system agree with the evaluation of  the same fielders by fielding percentage?

            3)  To what extent does the evaluation of fielders by the +/- system agree with the evaluation of the same fielders by the ratio of double plays to errors?

 

B.  Outlining the Method

 

            This study deals only with third basemen, and only with the seasons 2005 through 2007.   The study group for this report is “all major league third basemen playing 900 or more innings (cumulative) in the years 2005 through 2007”.   There are 41 such players.

            I began by putting together three-year fielding records for those 41 players,using the Fielding Bible Data given in the statistics section of this service.  These records are given below.   The categories of this record are the Player’s Name (Player), his full defensive innings (Inn) and thirds of an inning (I3), his Putouts (PO), Assists (A), Errors (E) and Double Plays (DP), his Fielding Percentage (FPct), his Range Factor (RF; range factor is putouts plus assists per nine innings), his Expected Plays Made on ground balls by the +/- system (Ex P), his Actual Plays Made on ground balls by the +/- system (Plays), the difference between these two (Diff), his +/- on balls hit in the air (Air), the total +/- on ground balls and balls hit in the air (+/-) and the Enhanced Fielding Rating (Enh, the Enhanced Fielding Rating being the +/- adjusted for extra base hits, so that cutting off a double is of more value than cutting off a single), and the Enchanced Fielding Rating per 1000 innings (Per In). 

 

Player

Inn

I3

PO

A

E

DP

F Pct

RF

Ex P

Plays

Diff

Air

 +/-

Enh

Per In

Alfonzo, Edgardo

913

2

81

177

8

11

.970

2.54

177

164

-13

0

-13

-11

-12.0

Atkins, Garrett

3862

0

260

800

50

93

.955

2.47

789

766

-23

-2

-25

-19

-4.9

Bautista, Jose

1390

2

123

319

22

23

.953

2.86

328

295

-33

0

-33

-34

-24.4

Bell, David

2512

2

188

599

40

49

.952

2.82

523

559

36

1

37

29

11.5

Beltre, Adrian

3963

0

397

881

47

81

.965

2.90

819

853

34

-2

32

44

11.1

Betemit, Wilson

1441

2

81

291

19

33

.951

2.32

273

274

1

-1

0

-3

-2.1

Blake, Casey

1249

0

103

267

15

25

.961

2.67

269

262

-7

0

-7

-4

-3.2

Blalock, Hank

2776

0

186

610

29

52

.965

2.58

606

585

-21

-1

-22

-27

-9.7

Boone, Aaron

2160

2

143

493

34

36

.949

2.65

471

455

-16

-2

-18

-16

-7.4

Braun, Ryan

945

1

61

161

26

12

.895

2.11

186

148

-38

-2

-40

-39

-41.3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Cabrera, Miguel

2882

2

236

578

42

71

.951

2.54

582

544

-38

1

-37

-38

-13.2

Castilla, Vinny

1755

1

197

324

16

31

.970

2.67

308

301

-7

-1

-8

-5

-2.8

Chavez, Eric

3288

2

292

751

26

86

.976

2.85

705

720

15

0

15

27

8.2

Crede, Joe

2768

2

245

679

24

80

.975

3.00

609

661

52

1

53

43

15.5

Encarnacion, Edwin

2577

1

240

524

51

47

.937

2.67

530

498

-32

5

-27

-30

-11.6

Ensberg, Morgan

2846

1

220

661

39

71

.958

2.79

614

632

18

1

19

15

5.3

Feliz, Pedro

3184

0

256

777

38

75

.965

2.92

664

717

53

6

59

58

18.2

Figgins, Chone

1554

2

108

310

26

26

.941

2.42

302

295

-7

-2

-9

-6

-3.9

Glaus, Troy

3367

0

271

778

47

86

.957

2.80

749

746

-3

6

3

-1

-0.3

Gordon, Alex

1135

0

99

247

14

22

.961

2.74

232

241

9

-3

6

9

7.9

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Inge, Brandon

4101

1

354

1101

63

100

.958

3.19

1040

1090

50

-3

47

64

15.6

Iwamura, Akirnoi

1042

1

79

197

7

17

.975

2.38

200

194

-6

-4

-10

-7

-6.7

Izturis, Maicer

1430

0

86

298

24

27

.941

2.42

274

276

2

-1

1

2

1.4

Jones, Chipper

2799

1

242

572

32

56

.962

2.62

538

523

-15

-2

-17

-17

-6.1

Koskie, Corey

1277

2

106

308

14

30

.967

2.92

278

292

14

-1

13

11

8.6

Kouzmanoff, Kevin

1151

1

93

213

23

12

.930

2.39

206

202

-4

1

-3

-5

-4.3

Lowell, Mike

3749

1

355

821

27

107

.978

2.82

779

778

-1

-1

-2

6

1.6

Mora, Melvin

3664

0

275

857

45

59

.962

2.78

821

824

3

3

6

7

1.9

Mueller, Billy

1465

1

102

325

18

27

.960

2.62

318

321

3

2

5

4

2.7

Nunez, Abraham

1946

1

131

529

27

41

.961

3.05

499

501

2

1

3

7

3.6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Punto, Nick

1663

1

142

369

16

31

.970

2.76

330

352

22

1

23

27

16.2

Ramirez, Aramis

3464

2

268

730

39

50

.962

2.59

676

686

10

-2

8

11

3.2

Randa, Joe

1521

2

157

295

16

28

.966

2.67

307

289

-18

1

-17

-18

-11.8

Rodriguez, Alex

4002

1

317

800

49

80

.958

2.51

799

792

-7

2

-5

-9

-2.2

Rolen, Scott

2636

2

200

695

31

71

.967

3.05

600

653

53

-1

52

51

19.3

Sanchez, Freddy

1299

1

98

373

10

43

.979

3.26

319

344

25

-3

22

28

21.5

Teahen, Mark

1992

0

192

481

34

54

.952

3.04

475

452

-23

3

-20

-33

-16.6

Tracy, Chad

1652

0

130

333

29

33

.941

2.52

330

324

-6

1

-5

-8

-4.8

Wigginton, Ty

1226

2

92

260

22

17

.941

2.58

272

240

-32

0

-32

-40

-32.6

Wright, David

4188

0

315

949

64

77

.952

2.72

886

882

-4

-4

-8

-11

-2.6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Zimmerman, Ryan

2911

0

298

637

38

74

.961

2.89

563

584

21

3

24

21

7.2

 

 

 

            The method used here was to look for “agreement” or “disagreement” between two methods of comparison.    Let us take, for example, the two New York third basemen, Alex Rodriguez and David Wright.    Rodriguez has a Range Factor of 2.51, and is rated by the +/- system (enhanced) at –9 plays in 4002.1 innings, which is –2.2 plays per 1000 innings.  Wright has a Range Factor of 2.72, and is rated by the +/- system at –11 plays in 4188 innings, which is –2.6 plays per 1000 innings.   So, do the two systems agree about which is the better fielder, or do they disagree?

            Obviously, they disagree.  Range Factor says that Wright is better than Rodriguez (by a relatively small margin), while the +/- system says that Rodriguez is better than Wright (by a very tiny margin.)  Our method here is simply to make every possible player-to-player comparison, and to count how many times the two systems agree.  With 41 players in the study there are 820 player-to-player comparisons {(41 * 40) ÷ 2}.

 

C.  Simple or Direct Results

 

            Fielding Percentage.  Let’s start with Fielding Percentage, since fielding percentage is still the most-commonly cited fielding statistic.   Fielding Percentage and the +/- system agreed as to which of two third basemen was a better fielder 551 times in the 820 comparisons, and disagreed the other 269 times.   Fielding Percentage thus agreed with the +/- system 67% of the time—a little more than two times in three.

            Double Plays to Errors Ratio.  It was actually Double Plays to Errors Ratio that started this project.   I noticed, looking at the data, that Mike Lowell over the last three years has a double plays to errors ratio of 107 to 27, which, just looking at the data for a few other third basemen, seemed to be quite exceptional.    Adrian Beltre, for example, was 81 to 47, Ryan Zimmerman 74 to 38, Aramis Ramirez 50 to 39, Melvin Mora 59 to 45, and several of the less experienced third basemen are actually less than even.   I have always advocated Double Plays to Errors ratio as a simple way to get a line on a fielder’s ability, and when I saw that I thought “wouldn’t it be interesting if the DP/Errors ratio, over a period of years, turned out to be a highly reliable indicator of defensive quality vis a vis these new defensive metrics?”

            The Double Plays to Errors Ratio may be measuring something that the +/- is actually missing; I’m not sure, and perhaps John can add a post script explaining that.   My opinion is that Lowell has an exceptional double plays to errors ratio in part because he has an exceptionally accurate throwing arm.   If you think about a third baseman making a throw to start a double play, every inch that the throw is off target probably has a measurable impact on the number of double plays resulting.  If the throw to the second baseman covering is three feet off target—let’s say that’s it’s three feet high, or that it pulls the second baseman three feet out toward left field—that almost certainly is going to eliminate the chance of getting a double play, most of the time.   If the throw is one foot off target, it certainly would make an easily measurable difference in the chance of turning the double play.

            Lowell over the three years of the study has worked with three different primary second basemen—Luis Castillo in 2005, Mark Loretta in 2006, Dustin Pedroia in 2007—so it is not likely that his exceptional Double Play total would result from the actions of the second basemen.   On the other hand, it might be dangerous (in normal cases) to evaluate a third baseman by his ability to start double plays, because it might be (in other cases) that the second baseman controls this more than or as much as the third baseman himself.   Thus, it might be that this measurement is picking up something that the +/- system can’t really key on.

            In any case, the Double Plays/Errors ratio agrees with the +/- system as to which of two third basemen is a better fielder 567 times in 818 comparisons, or 69% of the time, slightly higher than the degree of agreement between Fielding Percentage and Dewan +/-.  There were two sets of players who had identical Double Plays to Errors Ratios.

            Range Factor.   For the 820 possible comparisons there are 597 cases in which Range Factor and Dewan’s +/- agree as to which is the better fielder, and 223 cases in which they disagree.   There is 73% agreement between the two—actually 72.8%.

 

D.  Implied Results

 

            This is a study of just 42 players at one position, and nothing can be safely inferred from this data.   However, let us assume for the sake of argument that this 72.8% figure would hold up with additional research. 

            Based on that result, I think we could safely conclude that Range Factors are 75 to 80% accurate in evaluating fielders.   I reason as follows:

            Two systems can agree upon a result if they are both correct, or if they are both incorrect.   Suppose that you have two systems which are both 90% accurate in a comparison of this nature.   How often would they agree?

            If they are randomly aligned against the truth, they would agree 82% of the time--.9 * .9 when they are both right, and .1 * .1 when they are both wrong.   81% plus 1% = 82%.

            If each system is 80% accurate then they would probably agree 68% of the time—64% plus 4%.   If one system was 90% accurate and the other was 60% accurate, they could be expected to agree 58% of the time--.9 * .6, plus .1 * .4.   We can call this number the agreement rate.

            We have been assuming for the sake of argument that Dewan’s +/- system was a perfect test of a fielder’s ability, but of course this cannot be true.   There are many distinctions in the data which are razor thin—like the difference between David Wright and Alex Rodriguez—and it is unreasonable to assume that the system gets all of those distinctions right. 

            Using these assumptions, if a system agrees with the Dewan method 72.8% of the time, the only way that it can be less than 72.8% accurate is if the Dewan system is less than 27.2% accurate.   If we assume that the Dewan +/- is 99% accurate, then to agree with that method 72.8% of the time, Range Factors would need to be .7327 accurate:

 

            .99 * .7327 =                           .7254

            .01 * .2673 =                           .0027

             Agreement Rate                       .7281

 

            If Dewan’s method is 95% accurate, then Range Factors would need to be 75.3% accurate in order for the two to be expected to agree 72.8% of the time.  The lower accuracy is assigned to Dewan’s method, the more accuracy must be assigned to Range Factors in order to get the same degree of agreement between the two.   This remains true until we assume that Dewan’s method is less than 50% accurate.   Below 27.2%, you can get a theoretical agreement factor of .728 or higher by assuming that both methods are wrong a very high percentage of the time.

            But that’s an absurd conclusion; no reasonable person could believe that both methods are wrong 75% of the time, nor is it possible to explain how that could happen.   Therefore, we have to conclude that the only way for Range Factors to agree with Dewan’s +/- 72.8% of the time is if they are more than 72.8% accurate.

            Let us assume, as one border, that Dewan’s +/- system cannot be more than 95% accurate in comparing player to player, since

            a)  many of the comparisons are close, and

            b)  there is a measure of randomness in all outcomes.

            If we assume that Dewan’s +/- system is 95% accurate or less, this implies that Range Factors are 75% accurate or more.

            For a border on the other end, suppose that we assumed that Dewan’s +/- was 80% accurate.   In order to get 72.8% agreement with a system that was 80% accurate, Range Factors would have to be 88% accurate.

            But it seems entirely unreasonable to believe that Range Factors, a primitive method prone to many different illusions of context, can be 88% accurate if Dewan’s +/-, a sophisticated analysis which attempts to remove every illusion of context, is only 80% accurate.   It seems unreasonable to believe that Range Factors are as accurate as the +/-, let alone that they are more accurate.

            If we assume that Dewan’s +/- is 88% accurate, that would imply that Range Factors are 80% accurate, in order to get an Agreement Rate of .7280:

 

.88 * .8000 =                           .7040

            .12 * .2000 =                           .0240

             Agreement Rate                       .7280

 

            This would seem to be as far as it is reasonable to go in assuming that Range Factor and Dewan’s +/- are of comparable accuracy—88% for Dewan, 80% for Range Factor.  Therefore, we must conclude that Range Factors are not more than 80% accurate in comparing the relative strength of two fielders.

            Range Factors appear to be 75 to 80% accurate in evaluating the relative abilities of two fielders.

            By the same logic, Fielding Percentages would appear to be 69 to 73% accurate in evaluating the relative abilities of two fielders, and the Double Plays to Errors Ratio would appear to be 72 to 75% accurate.

 

 

E.  Housekeeping

 

            Again, I will acknowledge that one study of the players at one position is not a reliable test of anything.   My purpose here was more to outline a pathway for future research than to stake a claim for any particular results.

            Also, the reliability of the various methods here is being tested on multi-year data, which must be assumed to be substantially more accurate than single-season data.  If Range Factors are 75 to 80% accurate in a three-year look, this would suggest that they are something less than 75 to 80% accurate in a single-season study.

            Also, the tests here were conducted by comparing the Range Factors to the Enhanced +/- rating, not to the raw +/- rating.   The reason this was done is that the agreement between Range Factors and the Enhanced +/- was slightly higher than the agreeement with the raw +/-.    I tested it both ways and used the higher figure, although the figure for the raw +/- was just a little bit lower.   Whereas the count was 597-223 for agreement between Range Factor and the Enhanced +/-, it was 593-227 for agreement with the “un-enhanced” or “raw” +/-.

 
 

COMMENTS (8 Comments, most recent shown first)

birtelcom
I ran CORREL in Excel comparing the "Enhanced +/- Per Inning" column in Bill's table to Fielding Win Shares per inning over the same seasons (I had to drop some players in the table because they played a significant number of innings at positions other than 3rd base, and I don't have the data to break out Fielding WS by position -- I ended up with 31 players who played only, or virtually only, 3rd). The CORREL I got for Enhanced +/- Per Inning to Fielding WS per inning was 72.1.
10:45 AM Mar 1st
 
jrickert
With regard to Bill James' answer to robneyer's question, there's a slight correction. If PM is 97% accurate, RF is 77%, and FP is 71%, AND they are independent measures, then all three are indeed wrong about the same player with a frequency of .002001, but they are in agreement with a frequency of only .5323. Therefore, given that the three are in agreement, the probability that they are correct is about 99.624%. Further, we could verify these percentages by noting that if two of the three agree, the majority is the pair PM,RF with frequency .2215; PM,FP with frequency .1651, and RF,FP with frequency .00811 (and the corresponding probabilities that PM selects the correct fielders are approximately 97.788%, 95.942%, and 79.777%).
Comparing the actual frequencies of agreement from the list, I got
all three agreeing with frequency .52927 and pairs (PM,RF)=.19878, (PM,FP)=.14146, (RF,FP)=.13049. This gives us a system of equations that we can solve to get accuracy percentages: PM at 84.9%, PF at 82.7%, and FP at 74.5%. If we throw in a fourth independent measure, then the system might not be solvable and we'd have to do some sort of estimate (for example, a least squares estimate)
There's also an added problem that since we're comparing players AtoB,AtoC,BtoC,..., the sampling error for A affects all comparisons involving A, so that even if the measures are independent, the counts for the comparisons don't represent 820 independent counts.
Unfortunately, it's not that simple if the measures are not independent. For example, suppose that PM is 90% accurate, and that when PM is correct, RF is 80% accurate, but when PM is wrong, RF is affected by the same bias and is only 5% accurate.
Overall then, PM would be 90% accurate and RF 72.5%. If they were independent, the two would agree with frequency .6800, in which case PM would be correct about 95.956% of the time, but correct only about 77.344% of the time that they disagree. But in the non-independent model that I've built here they agree with frequency .7250, in which case PM is correct about 99.3% of the time, while being correct only about 65.45% of the time that they disagree. If we assumed independence and PM=90%, then we'd infer from the agreement frequency that RF=78.1% instead of the actual 72.5%.
Alas, it's not that simple. If one player is much better than the other (e.g. Crede and Braun) then the systems are each correct about 99-100% of the time. If the players are of nearly equal ability, then each measure will have smaller success rates, and these rates will likely be changing at different rates as the skill difference changes - and their correlation might also change as the skill difference changes.
But it's not that simple. What if neither player is better? For example, one has a bit more range, but the other makes fewer errors (So that they pretty much balance out) or is better suited to the type of pitchers on his team? Or perhaps one player is better in half the parks and the other player is better in the other half. Or one of the players is a poor traveler and is weaker when he has to travel to the other coast, but stronger when staying near home... [commenter continues for several hours]...
rob: if the measures are independent and probabilities of PM,RF,FP,DP/E are p,r,f,d, respectively, then the probability that they are correct if all four are in agreement is (p*r*f*d)/(p*r*f*d+(1-p)*(1-r)*(1-f)*(1-d))
3:33 AM Mar 1st
 
birtelcom
It would be interesting to know how this test works for the more sophisticated historical fielding rating systems such as Fielding Win Shares and BP's Fielding Runs Above Average (or even the two combined!)that also use the basic stats that are available for much of baseball history. One would hope that the correlation is better than it is for less refined stats such as Range Factor and Fielding Pct.
12:15 AM Feb 27th
 
tangotiger
Focusing only on the release/footwork and accuracy (i.e., not strength), ARod and Inge drop off the list of the 3B that have better arms than Lowell. _______ Also note that Ryan Zimmerman gets several DP, recorded to him as a 3B, even though he plays in the SS position on shifts. So, he gets some DP where he makes the out at 2B. I've always believed that it's more important to mark a guy's position as to where he actually plays, than to have a strict recording of accounting for everyone at one of the 9 positions. I would introduce a "roverOF" position to handle anyone who is playing in the short-outfield, or a "roverIF" position to handle the fifth infielder.
3:28 PM Feb 5th
 
tangotiger
With respect to the "good" arm, I did mean it in terms of the combination of release/footwork, arm strength, and arm accuracy (not just strength). And, the Fans are notorious for overvaluing really good players who are now in their 30s. It's possible that ARod qualifies for that.
11:04 AM Feb 4th
 
bjames
If we assume that Dewan +/- is 97% accurate, Range Factors are 77% accurate and Fielding Percentage 71% accurate, then the chance that all three would be wrong about the same player would be .002. We would be able to make a judgment there with 99.8% confidence.
With respect to the list of fielders who are supposed to have arms as good or better than Lowell, two points:
1) What most people would understand as a "good" arm is a STRONG arm. Arm strength is not an issue here, since any third baseman can easily throw to second. I was talking about arm accuracy.
2) Rolen, Feliz, Crede and Chavez have among the highest double plays/errors ratios in the study, all in the top eight.
3) The notion that A-Rod's arm is as good as Lowell's is preposterous, and is an obvious halo effect.

2:56 PM Feb 2nd
 
robneyer
Question: What if all three measures -- fielding percentage, range factor, DP/E -- agree with +/- that one fielder is better than another? I'm not nearly smart enough to answer that question, but I'm assuming a smart person might easily answer it...
2:22 AM Feb 1st
 
tangotiger
A correlation of RF to EnhPerInn is r=0.62. Fielding% to EnhPerInn is r=0.63. The advantage here is to also include the differences, not just which number is higher than the other number.
______________
As a curiosity, I also compared Dewan's number to the Fans' Scouting Report I collect. I have 30 of those 3B in 2007. If I rerun the above two correlations to this smaller dataset, I get r=.67 and r=.71 respectively. When I run Dewan's number to the Fans' number, I get r=.75. Basically, the Fans of 2007 did better in correlating to Dewan's 2005-2007 numbers than RF and Fld% of those same years did.
_______________
For whatever it is worth, the Fans evaluated these 3B as having arms as good or better than Lowell: Rolen
Feliz
Beltre
Inge
Crede
Chavez
Rodriguez

3:37 PM Jan 29th
 
 
©2024 Be Jolly, Inc. All Rights Reserved.|Powered by Sports Info Solutions|Terms & Conditions|Privacy Policy