Remember me

Reliable and Unreliable Stats, Revisted

August 27, 2022

Reliable and Unreliable, Revisited


            I posted an article on August 22, "Reliable and Unreliable Batting Averages", which has drawn a number of responses, the number increased somewhat because our software chose to post two or five copies of some of them.  I very much appreciate the responses, and I’ll take a minute to respond to a few of them. 

            First of all—and please don’t overreact to what might seem at first to be the harsh tone of this—several people posted responses suggesting that it would have been better if I had age-adjusted the player’s true batting ability, making the .270 hitter a .260 hitter at age 21 and a .285 hitter at age 27, or whatever.  Several people; I won’t name names because of what I have to say next.

            What that would do is completely ruin the study so that you couldn’t conceivably learn anything whatsoever from doing it, but on the positive side it would deliver no benefits and destroy whatever value the study might have.  I have to take responsibility for the misunderstanding, because (a) I wrote the article, so it is on me to be clear about what I am doing, and (b) several people shared the same misunderstanding, which suggests that the basis of the misunderstanding must be in the article somewhere. 

            My bad, but the article doesn’t have anything to do with a career—any player’s career, or a theoretical player’s career.  It is about the reliability of batting average against increasing numbers of at bats.  It is not intended to model a career; it is intended to model a batting average.

            The study is about what we might call the Hank Aguirre question, from the introduction to the article, or what might be called the table game problem, or what might be called the Terry Forster problem, from a "Hey, Bill" question posted on August 27.  Even after apparently reading the article, a reader wants to take seriously the fact that Terry Forster hit .397 in 78 career at bats.  The point of the article was:  that doesn’t mean shit to a tree, because 78 at bats is not a meaningful number.  He’s not a .397 hitter; he’s probably a .240 hitter who got lucky. 

            This is a digression, but I’ll follow it for two more paragraphs.   What I SHOULD have pointed out in that response is this:   Walter Johnson one year hit .433 in 97 at bats.  .433 is a higher average than .397, and 97 at bats is more than 78, so whatever the odds were of a .240 hitter hitting .397 (Forster), they were certainly quite a lot longer against a .240 hitter hitting .433 in 97 at bats. 

            But Johnson wasn’t even a .240 hitter!   He was a career .235 hitter.   I mean, I grant you that Walter’s batting average would likely have been higher post-1920, but you see the point:  if a .235 hitter can hit .433 in 97 at bats, then a .240 hitter could certainly hit .397 in 78 at bats.

            Anyway.  The key missing understanding here is, a model is a SIMPLIFICATION of what happens in real life.  There are many thousands of things that we don’t understand in baseball and can’t figure out from studying real-life statistics, because

a)     Real life events are very messy, and involve hundreds or thousands of complicating factors, and

b)     Real life statistics are derived from a necessarily limited set of trials, whereas in a model we can run the test through 10 million events if we need to do that, which we frequently do. 

What we do in studying problems with models is to get rid of all the complications that we CAN get rid of, and then run the model through many more iterations than can ever occur in real life. 

Also, when you make a model more complicated, you make the results of the model much harder to interpret.  

An APBA or Strat-o-Matic or Diamond Mind or Ballpark game is a SIMPLIFIED image of the real game.  All of the probabilities that you see referenced hundreds of times a day in the sports world, such as "the Milwaukee Brewers currently have a 68% probability to make the playoffs" or "that ball would have been a home run in 26 of the 30 major league parks". . .all of those are derived from simplified models of the problem.   

In those cases, in many cases, in most cases there is a benefit to making the model like real life.   In this case, there is no benefit whatsoever to making the model more complicated, more like real life, because the is no one person or place to which the model must correlate.  The question here is simply "How many at bats does it take for a batting average to become reliable?"

If I was studying what you guys apparently assumed that I was studying, then I would have had to set aside a certain percentage of at bats to represent "with the platoon advantage" and a certain percentage to represent "without".  But the problem with that is, every complication that you add into the model that doesn’t relate to the question you are trying to address makes the answer less reliable.   I could assume that the player hits .260 at age 20 and .285 at age 27, but what if that’s the wrong ratio?   It’s probably the wrong ratio for almost every hitter, since every hitter is different.  I could assume that the .270 hitter hits .280 with the platoon advantage and .254 without it, but that’s not a fact; that’s just an assumption, and it is almost certainly wrong for almost every player.   I could assume that the player will get 28% of his at bats with the platoon advantage and 72% without, but that assumption will be wrong about each individual player.   Every complication that you add to the game is a source of additional error. 

And, in this case, there is no benefit to it.  What I was REALLY trying to do here is to get people to understand how unreliable batting averages are in less than a couple of thousand at bats.  It goes back to a discussion I had on air with Bob Costas 40 years ago.  Broadcasters had just gotten access to career batter-vs-pitcher matchup data, and had instantly fallen in love with it.  I was trying to explain to people how massively uninformative that data is.  It basically has no predictive value at all.  "Well," said Bob, "If a player is 3-for-7 against a pitcher, that’s .429 and I won’t mention  that, but if is 11-for-27, that’s different."  To which I replied "No, it isn’t."  11 for 27 doesn’t mean anything; it doesn’t predict anything.  

So far I have made zero progress on getting people to understand this.   But I’ve only been working at it for about 50 years, so keep a good thought. 

Fireball Wenz

The Hank Aguirre problem is why I never used any "one-season" simulation. Dave Rader shouldn't be better than Carlton Fisk. Even at age 11, I was using three seasons worth of data in my homemade board games.


            When I was 11 years old Carlton Fisk was only 13, so his career wasn’t very impressive yet. 



Also, I wonder how this study would do with WAR.


That would require that you make a very, very large number of assumptions about unknowns.  In this study, as I did it, I made NO assumptions about unknowns.  To do THAT study, you’d have to make a vast number of such assumptions.

Well, I guess I had to create a theoretical array of possible batting average, running from .170 to .370 but centered at .270.  I guess that would be an assumption.   



Is this not a binomial distribution question?


Since I have no idea what a binomial distribution question is, I have no idea.   My understanding is that if you use phrases like "binomial distribution", St. Peter bans you from his nightclub. 




First, I'd would like to address the question as to what to do about Hank Aguirre in a 1967 replay so that his hitting in a simulation is more akin to his real-life ability.


As any aficionado of baseball simulations knows, you are not the only one who has come across a "Hank Aguirre" problem. Or it might be a 1974 "Rick Auerbach" problem, where an otherwise .220 career hitter batted .342 in 73 at bats for the Dodgers.


There's only one way to "fix" this in order to have a player perform more like his real-life ability: adjust his "card" (or in this case, the inputted stat line) the simulation uses to randomly create the result on any given plate appearance. Perhaps use Aguirre's 1966 stats in your sim instead.


My prefered sim is Diamond Mind Baseball (DMB). I've played APBA for Windows and of course Strat in the pre-computer days. Extra Innings as well. With DMB is is fairly easy to "re-card" a player. Many of them might be late season call-ups (where I believe MLEs should be used rather than wierd .100 or .500 batting averages or 11.00 ERAs.) Naturally, in a league with other players there needs to be some pre-season agreement as to what to do with outier seasons for some players.


Diamond Mind was created by a good friend of mine, Tom Tippett.   I’ve lost track of Tom, don’t know where he is now.  If any of you know, you might let me know.  Tom came to Kansas to talk some of the simulation issues through with me when he was developing the game, mid-1980s.  I later worked with him at STATS Inc. in the 1990s; a variation or off-shoot of Diamond Mind was the software underlying Bill James Classic Baseball, the game in which you would pick players from the past and we would simulate results for you.  From about 2003 to about 2017 I worked with him with the Red Sox.   I arrived a little before he did, and stayed a little bit longer.  He might have made it through 2018, I don’t know.   I remember when he came out to Kansas we went to a Royals game, and spent the whole game having an animated discussion about simulation issues, which just irritated the living hell out of some woman who was sitting one row in front of us.  Tom says he doesn’t remember that woman at all.  

            Anyway, when I create Ballpark. . . .user-created Ballpark cards.  When I create user-created Ballpark cards, my process for unreliable batting stats is something like this.  

            First, I double the plate appearances for the season in question.  For Hank Aguirre in 1967 (1-for-2 with a triple) I count that as going 2-for-4 with two triples. 

            Then I assume that every player’s batting performance must be based on at least 100 plate appearances, and I figure how many more we need.  In this case, that would be 96 additional plate appearances.    

            Then, IF the player had 100 career at bats or more, I fill in the missing 96 plate appearances with plate appearances representing the career performance of the player, scaled down to 96 plate appearances.  Since Aguirre hit .085 in his career, that would be 8.16 hits in 96 PA, ignoring the walks and stuff, so Aguirre be treated as if he had 10.16 hits in 100 at bats (except that I would work out the walks and stuff), but he would hit about .102.

            Assuming a player has only at bats, let us suppose that a player who was a lifetime .150 hitter was to hit .367 on 11-30.  Then I would count his season contribution as 22-for-60, and his "career" contribution as 6-for-40, which would make him a .280 hitter for the season.  That’s not "realistic" in one sense, because he is not really a .280 hitter, but it’s many times preferable to letting him hit .367. 

            But what if the player doesn’t have 100 career plate appearances?  In that case, I would fill in the missing plate appearances not with his career numbers, but with the average performance for players at that position in that season, or in that era.   Suppose that a pitcher went 2-for-3, and never had an at bat other than in that season.  Then I would double the 3 at bats to 6, and fill in the missing 94 plate appearances with the normal performance of a pitcher in that season.  Major league pitchers in 1967, for example, hit .138, so assuming this was 1967, he would get 94 plate appearances as a .138 hitter and 6 at bats as a .667 hitter, so he would come in at about .170.  

            Details vary. . . sometimes I use a moving three-year average, but if you do that you still have to fill in the missing plate appearances for somebody like Aguirre, who only had 1 or 2 at bats per season.  And I see that I have made some error in describing this, because the way I described it, I would treat 50 plate appearances as completely reliable data, which I know that I don't do.  

            Anyway, I know that 100 plate appearances is not anything like enough to make a player’s performance "reliable", but we’ll live with it.  Jose Iglesias in 2020 hit .373 in 142 at bats, 150 plate appearances.  We all know perfectly well that Jose Iglesias is not a .373 hitter, but it’s 150 plate appearances.   We’ll live with it.  


Fireball Wenz

Even if you limit a player's plate appearances to the number he had in the season, it allows a table-game manager to save him for crucial situations - bases loaded in the ninth, now batting for George Brett, 1980 Dave Rader!


The question comes: do you you want the simulation to focus on reflecting the TEAM's outcome (Rader hit what he did for the Red Sox in 1980), or the INDIVIDUAL's actual ability level (Rader never appeared in a MLB game after 1980, because everyone knew no matter his stats, he was no .320 hitter.


I always focused on trying to estimate the individual's ACTUAL ability levels, not his one-season stats.


            Yeah, I think about it this way.  Are you old enough to remember primitive xerox machines, the machines that we had until about 1980?  The copy quality was so low that if you made a copy of a copy, it was unreadable.  When I can first remember, copy quality was so bad that a third-generation copy of a document was nothing but a grayish/brown sheet of paper, with nothing distinguishable on it.   Even by 1990, a fifth-generation copy was usually illegible.  I remember, I think in the 1990s, that xerox began advertising that their copiers were so good that a tenth-generation copy was still good.

            Well, a table-top game is a "copy" of a real game.   To take the Hank Aguirre issue again, if you assume that Aguirre should hit .500 on a card, your managers will sneak him into the game to get as many at bats as possible, so he would probably get 50, 60 at bats.  In that space, he might very well go 33-for-60, a .550 hitter. 

            If you then took the stats for THAT season and made a table game based on that—a copy of a copy—the Aguirre model might very well get 200 at bats and hit .527, thus being the Most Valuable player in the league.  In other words, a second-generation copy would not be a recognizable image of the original.  It would just be a gray piece of paper.

            This is a problem not just for Hank Aguirre, but for EVERY hitter, even those with 550 at bats.  A player who hit .300 in real life might very probably hit .258 in the simulation, while a player who has hit .240 might very well hit .270.  If you then took that first-generation copy and made a copy of THAT, the .300 hitter, based on his .258 season, might hit .220, while the .240 hitter who had the .270 average in the first copy might very well hit .320 (based on his .270 average) in the second copy.  

            So that’s where we are; we don’t know how to make "copies" of baseball season in such a way that a second-generation copy would be a recognizable image of the original.   We haven’t figured that out yet. 

            My article was intended to be a step along that road.   It was intended to say that in 550 at bats, THIS is the reliability of the data (53%).  If you accept that as a valid estimate, perhaps you can use that information in building a better model.  


            I think Mr. Wenz wrote



That disconnect between what some replayers will accept and what other replayers expect is what DrArbiter was talking about when he wrote:


(a) have a program that simulates baseball, or

(b) have a program that outputs season statistics similar to real life.


            Well, yes, but is actually worse than that, because NO simulation will actually output season statistics similar to real life, unless you play a million games.  


In regards to the Hank Aguirre situation... I was in a tournament once and in every game my opponent brought George Puccinelli in to pinch hit (16 AB, .563, 3 HR's in 1930). Ugh!


            Right; I remember one time about 45 years ago I made up a "George Puccinelli All Star team."   It included players like Luis Aloma, who in 1950 went 6-0 with a 1.82 ERA, and also hit .350.  I think Walt Bond was the first baseman, and maybe  …was it Gil Coan who hit .500 in 11 games one year?


COMMENTS (3 Comments, most recent shown first)

As it happens, I did a little analysis of pitchers-as-hitters in 1970, wanting to see how well managers did in picking pitchers to pinch hit, and getting a sample unpolluted by the DH rule.

All pitchers that year hit .146. That's .150 for the starters, and .122 for the relievers. There's no real actual difference in hitting ability here; that gap is pretty much what you'd find when coming off the bench. Everyone hits worse off the bench. The relievers, per 500 PA, had 6 2B, 1 3B, and 3 HR; the starters were about the same, but had 9 2B. The relievers per 500 PA walked 28 times while the starters walked 21 times (which is probably meaningless), and struck out 201 times while the starters struck out 161.

For those pitchers used as pinch hitters one time, they overall hit .178, with an extra base hit spread per 500 PA of 11-3-3 and 31 BB, 116 SO. For the pitchers who were used as pinch hitters more than once (counting Jim Rooker's two plate appearances in left field in one game here), they hit .227, with an extra base hit spread of 15-2-7, 26 BB, and 123 SO per 500 PA. (These pitchers were Bob Gibson, Jim Kaat, Gary Peters, Mel Queen, Jim Rooker, and Clyde Wright.)

In short, you can use how often a pitcher batted at another position, including pinch hitter, as a proxy of hitting ability. For non-pitchers, how many plate appearances he had is a very good indicator of hitting ability. It's lost on another computer, but I took all the non-pitchers in 1985 and sorted them by number of PA, and found that hitting performance goes up pretty fast as PA goes up. I also compared that to the same batters in 1984 and 1986, and found that below about 400 PA, batters will underperform their talent level (meaning they didn't hit in a lesser sample and their managers gave up on them, but their combined totals of the surrounding seasons had better performance). At about 400 PA, the collective performance in 1985 was the same in each PA number grouping as in combined 1984+1986.

Add some extra PA to everyone, with results of those PAs based on playing time, to even things out.
4:12 PM Aug 30th
Bill addresses this point in "Hey Bill," arguing that it's more important to address the baseball issue not the statistical theory. But I do think that there’s a clear connection to the literature on how to determine a fair coin.
9:47 PM Aug 28th
It's disappointing you are dismissive of the Binomial question. That gives you the theoretical framework to answer all these questions without simulations.
9:40 AM Aug 28th
©2024 Be Jolly, Inc. All Rights Reserved.|Powered by Sports Info Solutions|Terms & Conditions|Privacy Policy