On the Reliability of Park Effects

By Bill James

December 7, 2022

On the Reliability of Park Effects

The general question here is "To what extent can park effects be reliably measured, and by what approach?" This issue is triggered to some extent by the discussion about Jack Kralick, 1961. My question about Kralick is "What went wrong here?" One of the candidates for what went wrong is the park effects, so. . .well, I had this question rolling around in my head for years, and I happened to hit on an idea about how to test it, so I did the work. Some of the work; a cowboy’s work is never done.

First of all, there are two different things: Park Effects, and Park Factors. Park Effects have to do with the games played in the park, compared to the same team on the road. Park Factors have to do with how we apply that information in interpreting a player’s data. If a team scores and allows 700 runs in their home games and 800 on the road, the Park Effect will be .875 (7/8). But that doesn’t mean that their player’s overall stats have been cut by 1/8^th, because the player doesn’t play ALL of his games in that park. Generally, if the Park Effect is .875, then the Park Factor is .94. There are lots of annoying details; if you bring these up in the comments section as if you were the only person who had thought of them, I promise to come to your house and trim your toenails with a chain saw.

Anyway, this is the idea I had to test the accuracy of park effect measurements. This is a quasi-scientific study, a scientific study as best I am capable of doing the science, so I want to warn you that you will have to read it two or three times if you want to follow the details, but I will state conclusions clearly enough that you can get those.

First, we take all teams within my organized data, and we figure park effects for them. Second, we then split each team’s games randomly into two groups, one of which we pretend is Home Games the other of which we will pretend is Road Games. To explain the underlying issue, I like to use the 1955 Boston Red Sox. In 1954, the Red Sox scored and allowed 741 runs in Fenway Park, as opposed to 687 on the road—a "Park Effect" of less than 10%. In 1956 they scored/allowed 796 in Fenway, 735 on the road, a Park Effect of less than 10%. But in 1955 the Red Sox scored/allowed 865 runs in Fenway Park, only 542 on the road. It’s a Park Effect of 60%.

And do you know what caused that?

Nothing. Nothing happened that "caused" the measured park effect to explode in Fenway in 1955; it’s just a fluke. It happened that all of their big-scoring games that season were in Fenway; that’s all.

If you treat that as a real thing, as if Fenway in 1955 was REALLY increasing offensive levels by 60%, you will reach screwy conclusions. I know, because I have, in the past, signed on to some of those screwy conclusions. The one-year park effect in that case is so incongruous as to make the stat useless, for that team in that season. You’re better off not making a park adjustment, rather than adjusting the numbers based on the one-year data. But the general question here is "How common is that?" Are park effects based on one-year data GENERALLY useless, or merely useless in that specific case?

They’re generally useless. That’s the conclusion of the research; they’re generally useless. The research shows that one-year park effects have such a low level of reliability that you really should not use them. In my data I can make park effects for 1,614 teams. We will assume that, if there is no difference between how the team performs at home vs. on the road, the average park effect would be 1.000. It’s actually 1.0031; this happens because 720/700 + 700/720 is a number very slightly greater than 2.000. The difference between 1.0031 and 1.0000 is not important; for present purposes we will assume that the average is 1.00.

But if we split the games RANDOMLY into "home" and "road" categories and then that number is not 1.00, that means that we are measuring something that is not actually there. We are measuring a a park effect where none exists. If the measured effect in random data was the same as the measured effect in the "meaningful" data, then we would conclude either (a) that the effect being studied does not exist, or (b) that our manner of studying it is not working. This happens very often in sabermetrics. There are many, many studies in which the meaningful data appears to be no more meaningful than randomized, meaningless data.

This is not one of those situations. Park effects certainly exist. However, when we measure the standard deviation of park effects in one-season randomized data, we get. . .I get, in this study. . I get .080643. When I measure the ACTUAL effect, what we assume to be the "real" effect, I get .125539. That means that the effects of randomization in creating the appearance of a park effect are on substantially the same scale as the actual effects.

If I have done the math right—which is questionable, because this is outside my area of expertise—but if I have done the math right, what those two numbers show is that, in measuring park effects with a standard deviation of .125539, 41% of the effect being measured is just noise. Using Nate Silver’s terms of the signal and the noise, what we have in one-year park effects if 59% signal, 41% noise.

That’s too much noise. That’s basically useless data. I conclude that when you measure park effects based on one season of data, the measurement is so unreliable that you should never use it.

That was my first study of the issue. In my first study of the issue, I studied the "real" and "randomized" park effects for each real team for which I have organized data. In other words, I took the data for the 1936 New York Yankees, and every other team in my data, and split the games into two randomized groups, one of which I labeled "home" and one "road", but they were all the actual games from the 1936 Yankees, or whatever the team was. My second study was more theoretical.

For my second study, I generated a full list of the frequency with which the total number of runs scored in a game was 1, 2, 3, 4, etc., up to 49, which is the highest number of runs scored in a game in the last 100 years. (I guess that 26-23 game was more than 100 years ago now; it was August 25, 1922.) For purposes of this study, I excluded tie games, since we don’t have tie games anymore.

In my data, 164,670 games, there were 3,487 games in which one run was scored, 3,487 1-0 games. There were 3,251 2-0 games, and 10,709 games in which there were three runs scored, games which ended either 2-1 or 3-0. For whatever it is worth, this is the full chart of the frequency with which each number of runs in a game have occurred.

Runs	Games	Pct
49	1	0.000006073
45	1	0.000006073
36	6	0.000036437
35	9	0.000054655
34	9	0.000054655
33	17	0.000103237
32	12	0.000072873
31	25	0.000151819
30	33	0.000200401
29	51	0.00030971
28	56	0.000340074
27	106	0.000643712
26	134	0.000813749
25	220	0.001336005
24	270	0.001639643
23	459	0.002787393
22	536	0.003254995
21	907	0.005507986
20	1056	0.006412826
19	1782	0.010821643
18	1983	0.012042266
17	3316	0.020137244
16	3241	0.019681788
15	5423	0.032932532
14	5426	0.03295075
13	8542	0.051873444
12	7772	0.047197425
11	12621	0.076644197
10	10435	0.063369163
9	16200	0.098378575
8	12168	0.073893241
7	18321	0.111258881
6	11576	0.070298172
5	16320	0.099107306
4	8189	0.049729763
3	10709	0.065033096
2	3251	0.019742515
1	3487	0.021175685

I then set up a system which randomly generated a number of runs for a game, with each number of runs matching the frequency with which it has occurred in real life. I then used that random generator to create 40,000 team/seasons worth of data, with 81 home games and 81 road games in each season. 40,000 team/seasons is more team/seasons than there are in the actual history of baseball.

The ACTUAL park effect here is zero, so if we have a measured park effect, we know that is "noise", rather than signal. But measured in one-year groups, the standard deviation of park effects was .081479. This is essentially the same number I got in the first study, which was 080643.

But with the theoretical data (Study 2), it is very easy to measure the noise in the data if the park effects are measured in one-year groups, three-year groups, five-year groups, etc. These numbers represent what the standard deviation of park effects would be, measured over a period of seasons, if there was no real park effect:

One year .081 479

Three year .046 240

Five year .035 831

Seven year .030 397

Nine Year .026 939

11 Year .024 448

In other words, the random error in the measurement (noise) becomes less significant as the scale of the data increases, which is obvious anyway, but without running the simulations, we don’t know what the expected random error would be. Even if you study park effects over an 11-year period, you’ve still got a 2.4% expected error simply due to the fact that the data sample is not large enough for everything to even out.

I then contrasted

(a) the calculated park effects if measured over a multi-year period, numbers taken from the first study,

(b) with the expected random errors, numbers taken from the second study.

The standard deviation of measured park effects [(a) above] also decreases steadily as more seasons are taken into consideration. I measure the standard deviation of park effects as:

.1255, if measured in a one-year estimate,

.1024, if measured in three-year cycles,

.0949, if measured in five-year cycles,

.0911, if measured in seven-year increments,

.0883, if measured in nine-season groups of teams, and

.0863, if measured in eleven-year cycles.

There are a hornet’s nest of methodological issues that arise in a study like this. In any eleven-year period, a team may have changed something about their park, and it is nearly impossible to track all of those changes. At the extreme, if you regard every fence move as a re-start, then you would lose almost all of the data, since, with an eleven-year window, you lose five seasons on either side with a re-start. Even if ONE park doesn’t change, some other park within the league will change, which changes the park-to-league relationship which is at the core of the measurement. Combining data from two different streams is a little bit problematic, although not really; we all do that every day without realizing we are doing it.

With all of that, the eleven-year measurements do appear to be very good data when and where they are available. You get solid, sensible measurements that almost never out of line with reasonable expectations. There are a lot of places where you just can’t make an eleven-year measurement—for example, you can’t make an eleven-year estimate for any part of Sandy Koufax’ career. His early years in Brooklyn, there’s no 11-year measurement because they were within a few years of leaving Ebbets Field, then there’s no 11-year assessment for the Los Angeles Coliseum because the Dodgers only played there for four years, and then you can’t make an 11-year assessment for Dodger Stadium until 1962-1972, which centers in 1967. I’m just saying, when and where you CAN make an 11-year measurement, that appears to me to be the way to go.

Anyway, here’s the conclusions. Here’s the math, as best I am capable of figuring it.

In a one-year measurement, the standard deviation of park effects is .1255; the expected error in the measurement is .0806. These numbers suggest that the resulting data is about 59% signal, about 41% noise. 59-41.

In a three-year measurement, the standard deviation of park effects is .1024; the expected error in the measurement is .0462. These numbers suggest that the resulting data is about 80% signal, about 20% noise. 80-20.

In a five-year measurement, the standard deviation of park effects is .0949; the expected error in the measurement is .0358. These numbers suggest that the resulting data is about 86% signal, about 14% noise. 86-14.

In a seven-year measurement, the standard deviation of park effects is .0911; the expected error in the measurement is .0304. These numbers suggest that the resulting data is about 89% signal, about 11% noise. 89-11.

In a nine-year measurement, the standard deviation of park effects is .0883; the expected error in the measurement is .0269. These numbers suggest that the resulting data is about 91% signal, about 9% noise. 91-9.

In an eleven-year measurement, the standard deviation of park effects is .0863; the expected error in the measurement is .0244. These numbers suggest that the resulting data is about 92% signal, about 8% noise. 92-8.

Tom Tango has also written numerous times about issues related to this. One such post was Tuesday, January 1, 2019,

§ How much random variation is there in park factors? // January 01 2019

Thank you for reading.

COMMENTS (12 Comments, most recent shown first)

evanecurb
not sure about that last post (spread of colds and flu?). Anyway, it appears the White Sox kept balls in a humidifier to help their ground ball-focused pitchers keep the ball in the park. Not sure of the exact date, but I'm guessing this practice ended around 1968 or so. I also think their logic was flawed, as it would seem that ground ball pitchers would benefit less, not more, from heavier balls.
2:37 PM Dec 12th

evanecurb
The Chicago Tribune https://www.chicagotribune.com/sports/white-sox/ct-chicago-white-sox-roger-bossard-frozen-balls-20200308-bfnq4aqrh5eyzfxf3gdsutuhke-story.html reports that balls were kept in a humidifier and would cause the spread of colts, flu, etc.
12:26 AM Dec 12th

evanecurb
I think I found a real anomaly with the park effects published on BBRef for Comiskey Park, 1968. Order is mutil-year batting/pitching, followed by one year batting/pitching

1965 93/91, 92/90
1966 93/92, 94,93
1967 94/93, 93/92

These park factors are about what one would expect, but look at 1968 through 1970:

1968 105/107, 102/104
1969 106/107, 109/110
1970 101/102, 105/108

OK, so what's going on here? Well, actually, quite a lot. Through 1967, the White Sox played all home games in Comiskey. In 1968-1969, they played ten home games each year at Milwaukee County Stadium. Then, in 1969 and 1970, the moved Comiskey's fences in 20 feet, then moved them back in 1971. They also installed an Astroturf infield in 1970 which remained through 1975. There were also changes in the league parks: Oakland replaced KC Municipal in '68, then KC Municipal and Sicks Stadium came into the league in '69, then Milwaukee County replaced Seattle in '70.

Baseball Reference has a chart for each season where they provide Wins Above Average for every position on each team. In 1968, the White Sox won 68 games and finished eighth. Their pitching staff as a whole and their bullpen as a position ranked first in the AL in Wins Above Average. This is not so far fetched; the Sox had Gary Peters, Joe Horlen, Hoyt Wilhelm, Tommy John, and Wilbur Wood on their staff. But that same staff on the '67 team that won 87 games had been credited with five fewer wins above average while allowing 36 fewer runs in a league where offense was nominally higher than it was in '68.
11:57 AM Dec 11th

evanecurb
Correction: Kralick was on Senators/Twins through first part of ‘63 and Cleveland 1963-64.
8:20 PM Dec 9th

evanecurb
I learned all of this form Jack Kralick's SABR bio: https://sabr.org/bioproj/person/jack-kralick/

Jack Kralick pitched the first no-hitter in Twins' history in 1962, barely missing a perfect game when he walked George Alusik in the ninth inning. The following year the Twins traded Kralick to the Indians in a challenge trade, straight up for Jim Perry. He began to have injury problems in 1964 and that was the last season in which he would pitch more than 100 innings. Kralick's problems in 1965 were more than just injuries on the field - he was injured in a fistfight with teammate/roommate Gary Bell and was injured in an auto crash that same year.

In retirement, Kralick enjoyed hunting, fishing, snowmobiling, and coached briefly in the Alaska Summer Collegiate League. He eventually moved to San Bias, Mexico, where he passed away after a series of strokes in 2012.

From 1960 to 1964 while pitching for the Senators/Twins (a bad team), Kralick was 59-48 with a 3.39 ERA. No, he wasn't Camilo Pascual or Whitey Ford, not even in his best seasons, but he was pretty good there for a while.
12:43 PM Dec 8th

RichEddy44
Great job Bill. Still throwing that heater!
8:41 AM Dec 8th

FrankD
I asked Bill this question vis-a-vis Kralick debate: given Park Effects are averaged over so many games or seasons but a pitcher does not see that many games in a season for this average to be corrective, how can this correction be applicable? My extreme example, a pitcher for the Cubs happens to pitch at home more often when the wind in blowing out (or in). The averaged park effect correction does not apply correctly in this example. Even a lesser example would be to look at the roster when each pitcher pitches, maybe the 'best' defenders aren't playing then. I realize these are secondary and even tertiary corrections but when you are dealing with limited pitcher appearances these corrections are important. My question can be answered by selecting what appear to be extreme differences of different models and checking the weather for home games. If the stadium dimensions do not change the biggest variable year-to-year in Park Effects is weather ... and by averaging multi-seasons you are averaging weather effects.
3:36 AM Dec 8th

jgf704
Nice work!
11:02 PM Dec 7th

tangotiger
And by the way, I agree with the basic point here. You should use as many years as you can, because a park is a park.

The only reason you might consider weighting it is because of wind or temperature changes. If one year the wind blows out more than another year, that'll affect that park. Or the temperature was unusually hot or cold for that park one year.

These blips however are better handled with adjustments, rather than just discarding years and years of data. And the adjustments should be thoughtful, rather than relying on the data in hand.

And as noted, you have to worry about real park changes (2022 Camden was a REALLY big one). And new parks (say 1993 Coors Field) upsetting the "average" park that year. Or the introduction of the humidor at Coors. Or for that matter in two-thirds of the parks in 2022.

Camden 2022 is a good example: you wouldn't just use only 2022. You can use all of Camden history, but adjust it based on what "should have" happened based on moving LF back so much. We made a good guess pre-season as to how many HR would be lost. And what happened? Those exact number of HR were in fact lost. So, we can construct thoughtful park factors in cases like this.

These are more annoyances than roadblocks. So yes, as much as possible, wherever possible, use a long-term park factor.

7:42 PM Dec 7th

tangotiger
The random error is proportional to the square root of the number of samples.

Here is Bill's data, followed by a simple estimate:

1 0.081 0.080
3 0.046 0.046
5 0.036 0.036
7 0.030 0.030
9 0.027 0.027
11 0.024 0.024

That simple estimate is: 0.080 / sqrt(Seasons)
2:47 PM Dec 7th

On the Reliability of Park Effects

COMMENTS (12 Comments, most recent shown first)

Leave a comment

Report inappropriate comment


Type of Abuse:
Comments: