The DiMaggio Problem

August 27, 2007
            What is the chance that some player in 2008 will break Joe DiMaggio’s consecutive-game hitting record?   What is the chance that someone in your lifetime will break the record?   Is it realistically possible to do this, or is the record just completely out of range in the modern game?
            Of course, I have studied this problem a number of times before, and others have addressed it countless times.   As time moves, however, our ability to study this issue—any issue--evolves relentlessly and with stunning rapidity.   The conclusions we reached five years ago are no longer near to being the best that we can do.
            Apologies in advance, I will have to lead you down a primrose path here, a long march toward a false conclusion.   I think you’ll understand; I don’t know any way to get you to see the problem other than to march you up against it.   When we hit it, we’ll back off and go down the better path.
            OK, as a starting point, what is the chance that a given player will break the record. . .let’s say David Wright, 2005.    His chance of breaking the record depends on the percentage of games in which he hits safely.   David Wright in 2005 played in 160 games and hit safely in 117 of them, which is .73125.    His chance of hitting safely in 57 straight games, then, would be .73125 to the power 57, which is .000 000 017 855, or one in 56 million.
            David Wright has not one 57-game stretch during the season, but 104 of them (in a 160-game season.)   The chance that he will hit safely in 57 straight games, then, is .000 000 017 855 times 104.
            Actually, it isn’t exactly that.   The reason this doesn’t quite work can be illustrated by studying the chance that he would hit safely in ten straight games.   The chance that he would hit safely in ten straight games is .0437, and he has 151 ten-game stretches during the season to take his shot.   His chance of hitting safely in ten straight would then be .0437 times 151, which is 6.60.   Obviously his chance of hitting in ten straight games is not 600%. 
            We have to stay clear of the possibility of multiple streaks.  We do that by figuring the chance that he WON’T hit safely in 57 straight games, raising that to the 104th power, and subtracting that from one.  Of course, when dealing with 57-game hitting streaks the possibility of multiple streaks is negligible, so we could actually ignore that and just pretend that multiplying by 105 does it.   But we won’t; we’ll play it straight.   Since the chance that he WOULD hit safely in 57 straight games is .000 000 017 855, the chance that he would NOT is .999 999 982 145.   That number raised to the 104th power is .999 998 1431.   Subtracted from one, that is .000 001 8569—which is also, not coincidentally, .000 000 017 855 times 104, but we don’t mind the extra work because the computer does it for us.   Anyway, Wright’s chance of breaking the record in 2004 was .000 001 8569, or one in 539,000, more or less.  
            This gets us as far as David Wright, 2005, but in order to find that number, I had to go to Retrosheet and find out how many games David Wright had hit in during the 2005 season.    Trying to generalize from that to all players, we have a problem, which is that we don’t know how many games each player has hit safely in.   We’ll have to devise some way to estimate that.
            To estimate that, I figured actual data (actual games in which hit) for 100 players, using Retrosheet, and then looked for a formula to predict the data.   I tried a dozen different approaches and tinkered with each one.   The method that worked best was based on the players’ hits per game:
            1) Divide the player’s hits by five times his games played.
            2) Subtract that from 1.00.
            3) Raise that to the power 4.80.
            That will be the percentage of games in which he hits.  
            Tony Gwynn in 1994 had 165 hits in 110 games.   Dividing 165 by 550 yields .300.   Subtracting that from 1.00 gives us .700.   Raising that to power 4.80 gives us .820.   Multiplying .820 times 110 gives us 90.   We would expect Tony Gwynn to hit safely in 90 games in 1994, based on the fact that he had 165 hits in 110 games. 
            And, in fact, Tony Gwynn did hit safely in 90 of the 110 games in 1994; he was one of the 100 players that I used to study that problem.  Of course, we’re not always right, nor even usually right.  Our estimate for David Wright was that he would hit safely 111 games, whereas in fact, as I said earlier, he actually hit safely in 117.   But the formula generally works well, with an average error of slightly less than three games.   The worse estimate for any player in my study was for Matty Alou, 1964.   My method estimated that he would hit safely in 50 games, but in fact he hit safely in only 40.   On the other hand, 36 of the 100 estimates were within one game of the actual number.
            So we can estimate, based on hits per game, what percentage of the player’s games in which he has hit safely.   From that, we can estimate the chance that he would hit safely in 57 games.  
            I figured these estimates for every player in baseball history who played in at least 57 games and had at least 57 hits during a season (since obviously, if you don’t get 57 hits or play 57 games, you can’t hit in 57 straight.)   The conclusion. . .and here I will warn you that we are near the point at which I realized that something was not right. . .the conclusion was that the most likely player ever to have hit in 57 straight games was Hugh Duffy in 1894.   Duffy had 237 hits 125 games, which suggests that he probably hit safely in approximately 90% of his games (.898 570).   A player hitting safely in 90% of his games has a .002 251 chance of hitting safely in 57 straight games, and Duffy had 69 strings of 57 games.   Thus, I estimated, Duffy’s chance of hitting safely in 57 straight games was .144, or one in seven.
            This is by far the highest number in baseball history.   I figured the estimates for every player in baseball history, reaching the conclusion that there were an expected 1.26 57-game hitting streaks in baseball history.
            Had I stopped at this point I might never have realized that there was a problem with my analysis.   However, while I was doing this, I also figured the chance that each player would have a 40-game hitting streak, a 45-game hitting streak, etc.   This led me to the conclusion that there should have been eleven 40-game hitting streaks since 1900. In fact, there have been only four.
            Hmm.   It is possible that that’s a natural discrepancy, but it’s certainly a worrisome one.   Early on in this process I came to a fork in the road, and I may have picked the wrong tong. It has to do with this problem.   Suppose that there are a very large number of balls in a barrel, half of them red and half of them blue.   If you pick out two balls, what is the chance that both of them will be blue?  
            It’s one in four. . in two, times one in two.  
            Suppose, however, that there are six balls in a bag, three of them red and three of them blue.   If you pick out two balls, what is the chance that both of them will be blue?
            It’s one is six—3/6, times 2/5, which is 5/30, or one in six.   After you have chosen the first ball the odds are no longer 3 in 6, but 2 in 5.   Choosing a set of events from a finite group is different than choosing from a random (unlimited) set.
            In thinking this through initially, I assumed that the unlimited (or infinite) set of games more accurately represented the real situation, because a hitter’s season hit total is not fixed in advance.   However, in working with real batter’s seasons, we are using data that is limited after the fact.          I’m still fuzzy about the logic, but. . ..a real hitter’s chance can only be figured after the fact.   The after the fact numbers are NOT a true expression of a player’s underlying skill; they embody the randomly arranged events of the past.  Tony Gwynn in 1994 hit .394, but he wasn’t really a .394 hitter. He was a .350 hitter having some good breaks.   By assuming that he was truly a .394 hitter, I’m actually offering a kind of “invisible speculation” that he might get lucky from THAT base, and might hit .430.   This is what causes our estimate to misfire. 
            Tony Gwynn’s chance of hitting safely in 57 straight games in 1994 can also figured as
            Game 1            90 /110 times  
            Game 2            89/109 times
            Game 3            88/108 times
            Game 4            87/107. . .. … .until
            Game 57          34/54.
            But here again we have problems.   Let’s take perhaps the most likely player since 1900 to have had a long hitting streak, who would be George Sisler in 1920 or 1922.   This method would estimate Sisler’s chance of having a 40-game hitting streak at only .000 740 in 1920 (one in 1,350), and only one in 600 in 1922.   Intuitively, I know that that isn’t right, and also I know that Sisler did have a long hitting streak in one of those seasons.   If that was the right method it is unlikely that anyone would ever have had a 40-game hitting streak.  
            There are two options here approved by statistical analysts—neither of which works.   The “limited set” option doesn’t work because it assumes that there are a fixed number of games in which a player is allowed to hit, which is untrue and causes problems.   But the “unlimited set” option doesn’t work, either, because it assumes that seasons in which a player has exceeded his real level of ability by 40 or 50 points of batting average represent his true level of ability, and thus allow him to randomly outperform THAT in a 40 or 57-game stretch, thus creating an unrealistically high estimate of the likelihood of a player hitting .450 over a stretch of 40 or 57 games.
            What we need is a method which doesn’t over-estimate a player’s ability based on his season’s record, but also doesn’t limit him unrealistically.   What about using a player’s CAREER record?
            A problem here is that a player’s career record tends to understate his ability at its best.   Stan Musial hit .331 in his career, but through 1951 he had hit .347.    Wade Boggs hit .328 in his career, but through his first 1000-plus games he hit .356.    If we use career numbers we’re going to wind up understating the player’s real ability at his best.
            OK, here’s what I finally decided to do.   I figured each player’s chance of having a long streak in each season by the “unlimited set” method—but cutting the hits-to-games ratio a little bit by mixing in the player’s career hits to games ratio. I used a hits-to-games ratio which is a compromise between the player’s data in that particular season and in his career, a compromise weighted 2-to-1 toward the season.   In other words, since Tony Gwynn in 1994 averaged 1.500 hits per game, but for his career he averaged 1.2873 hits per game, I assigned him a hits per game average of 1.4291—two-thirds of his actual season average, and one-third of his career average.   I based the calculations on that number, rather than on his undiluted statistics. I also decided arbitrarily to exclude from the study anybody who had less than 75 hits in a season, since I don’t think it is realistic to talk about a player hitting in 57 straight games with less than 75 hits in a season.  
            Using this set of assumptions, we would expect that there would be 4.2 forty-game hitting streaks in major league baseball since 1900, and another 4.6 in the 19th century. . . .reasonably close to actual major league history.   These, by my calculations, were the ten players who were most likely to have a forty-game hitting streak:
            Player, Year                         Expectation    Which is
            1. Ross Barnes, 1876              .257        One in Four  
            2. Hugh Duffy, 1894                .223        One in Four
            3. Willie Keeler, 1897             .220        One in Five
            4. Jesse Burkett, 1896             .173        One in Six
            5. Tip O’Neill, 1887                .139        One in Seven
            6. Sam Thompson, 1895         .135        One in Seven
            7. Sam Thompson, 1894         .120        One in Eight
            8. George Sisler, 1922             .117        One in Nine
            9. Nap Lajoie, 1901                .111        One in Nine
            10. Ed Delahanty, 1894           .110        One in Nine
            Most of those are 19th century players.    The most likely players to have done it since 1900 were:
            Player, Year                         Expectation    Which is
            1. George Sisler, 1922             .117        One in Nine
            2. Nap Lajoie, 1901                .111        One in Nine
            3. Ty Cobb, 1911                   .101        One in Ten
            4. George Sisler, 1920             .085        One in Twelve
            5. Ichiro Suzuki, 2004             .079        One in Thirteen
            6. Al Simmons, 1925               .064        One in Sixteen
            7. Ty Cobb, 1912                   .053        One in Nineteen
            8. Bill Terry, 1930                   .052       One in Nineteen
            9. Rogers Hornsby, 1922        .047        One in Twenty-One
            10. Jesse Burkett, 1901           .045        One in Twenty-Two
            I think several of those players actually did have long streaks, but anyway. . ..since nine of those players are pre-1930, the players with the best chance to have done it since 1930 are:
            Player, Year                         Expectation    Which is
            1. Ichiro Suzuki, 2004             .079        One in 13
            2. Ichiro Suzuki, 2001             .040        One in 25
            3. Al Simmons, 1931               .025        One in 41
            4. Rod Carew, 1977               .019        One in 53
            5. Joe Medwick, 1937            .017        One in 62
            6. Kirby Puckett, 1988            .014        One in 69
            7. Earl Averill, 1936                .014        One in 70
            8. Tony Gwynn, 1997              .013        One in 80
            9. Joe DiMaggio, 1936            .012        One in 81
            10. Ichiro Suzuki, 2006           .011        One in 88
            Moving now to the chance of a 57-game streak. . ..The most likely player ever to have had a 57-game streak was still Hugh Duffy in 1894, with a chance now estimated at .0168 or one in 60.   The top ten are exactly the same as the top ten on the 40-game list, and the other lists are generally the same.    Among players post-1900 the leader is Sisler, 1922, at One in 168.   Joe DiMaggio ranks 48th on the list (1936), 91st (1939), 109th (1937), 172nd (1941) and 248th (1940), so he was a player who was a very good candidate to hold this record, albeit not as good as Sisler or Cobb.
            By this method, what would we expect to be the all-time record for consecutive games with a hit? We would expect it to be 50, possibly 51.    The expected number of players with a streak this long is closest to 1.000 at 50 games. 
            Returning now to the question with which I began, which is:   What is the chance that somebody will hit in 57 straight games in 2008, or 2009, or 2010?
            This is being written during the 2007 season, so the numbers there are moving, and we can’t really study 2007 yet.   But the best answer I can give to that question is that it is likely that the chance that it will be broken in 2008 is similar to the chance that it would have been broken in 2006, or 2005, or 2004.   We can form a statement of that chance by adding together all of the percentages for all of the hitters from those seasons. Doing that, the results are:
            Year 2000        .000 432          One in 2314
            Year 2001        .001 251          One in 800
            Year 2002        .000 156          One in 6423
            Year 2003        .000 225          One in 4441
            Year 2004        .003 307          One in 302
            Year 2005        .000 146          One in 6845
            Year 2006        .000 390          One in 2561
            The chances go up and down depending on whether someone like Ichiro has a great year.   Essentially however:
            1) The chance that the record will be broken in 2008 is probably less than one in a thousand, but almost certainly greater than one in five thousand. 
            2) The chance that the record will be broken in your lifetime depends on your age and on how the game changes over the next few years, but it is not unreasonable to think that many of you may live to see the record broken.   A chances of the record being broken are relatively volatile from year to year, and are in a range where a small change in the way major league baseball is played—a change toward more singles hitters and fewer strikeout-prone power hitters—could very plausibly cause the record to become endangered, although it is certainly not endangered at the present time. 

COMMENTS (1 Comment)

Ok...I'm just starting to get into advanced metrics. Not because I'm good at math, but because I'm interested in baseball and intuitively understand that there are marked limitations to the standard data set typically employed by baseball fans. I signed up, here, because "Bill James" is the biggest name in the field.

And I can honestly say I've entered a world which stretches beyond the limitations imposed by my own intelligence. I think I was able to follow that, mostly, but I couldn't hope to explain it to someone else.

It's great seeing how people like you can understand the game. I think I need a nap, however. Reading that took a lot out of me..
9:53 AM Apr 28th
©2019 Be Jolly, Inc. All Rights Reserved.|Web site design and development by|Terms & Conditions|Privacy Policy