On the Reliability of Park Effects
The general question here is "To what extent can park effects be reliably measured, and by what approach?" This issue is triggered to some extent by the discussion about Jack Kralick, 1961. My question about Kralick is "What went wrong here?" One of the candidates for what went wrong is the park effects, so. . .well, I had this question rolling around in my head for years, and I happened to hit on an idea about how to test it, so I did the work. Some of the work; a cowboy’s work is never done.
First of all, there are two different things: Park Effects, and Park Factors. Park Effects have to do with the games played in the park, compared to the same team on the road. Park Factors have to do with how we apply that information in interpreting a player’s data. If a team scores and allows 700 runs in their home games and 800 on the road, the Park Effect will be .875 (7/8). But that doesn’t mean that their player’s overall stats have been cut by 1/8th, because the player doesn’t play ALL of his games in that park. Generally, if the Park Effect is .875, then the Park Factor is .94. There are lots of annoying details; if you bring these up in the comments section as if you were the only person who had thought of them, I promise to come to your house and trim your toenails with a chain saw.
Anyway, this is the idea I had to test the accuracy of park effect measurements. This is a quasi-scientific study, a scientific study as best I am capable of doing the science, so I want to warn you that you will have to read it two or three times if you want to follow the details, but I will state conclusions clearly enough that you can get those.
First, we take all teams within my organized data, and we figure park effects for them. Second, we then split each team’s games randomly into two groups, one of which we pretend is Home Games the other of which we will pretend is Road Games. To explain the underlying issue, I like to use the 1955 Boston Red Sox. In 1954, the Red Sox scored and allowed 741 runs in Fenway Park, as opposed to 687 on the road—a "Park Effect" of less than 10%. In 1956 they scored/allowed 796 in Fenway, 735 on the road, a Park Effect of less than 10%. But in 1955 the Red Sox scored/allowed 865 runs in Fenway Park, only 542 on the road. It’s a Park Effect of 60%.
And do you know what caused that?
Nothing. Nothing happened that "caused" the measured park effect to explode in Fenway in 1955; it’s just a fluke. It happened that all of their big-scoring games that season were in Fenway; that’s all.
If you treat that as a real thing, as if Fenway in 1955 was REALLY increasing offensive levels by 60%, you will reach screwy conclusions. I know, because I have, in the past, signed on to some of those screwy conclusions. The one-year park effect in that case is so incongruous as to make the stat useless, for that team in that season. You’re better off not making a park adjustment, rather than adjusting the numbers based on the one-year data. But the general question here is "How common is that?" Are park effects based on one-year data GENERALLY useless, or merely useless in that specific case?
They’re generally useless. That’s the conclusion of the research; they’re generally useless. The research shows that one-year park effects have such a low level of reliability that you really should not use them. In my data I can make park effects for 1,614 teams. We will assume that, if there is no difference between how the team performs at home vs. on the road, the average park effect would be 1.000. It’s actually 1.0031; this happens because 720/700 + 700/720 is a number very slightly greater than 2.000. The difference between 1.0031 and 1.0000 is not important; for present purposes we will assume that the average is 1.00.
But if we split the games RANDOMLY into "home" and "road" categories and then that number is not 1.00, that means that we are measuring something that is not actually there. We are measuring a a park effect where none exists. If the measured effect in random data was the same as the measured effect in the "meaningful" data, then we would conclude either (a) that the effect being studied does not exist, or (b) that our manner of studying it is not working. This happens very often in sabermetrics. There are many, many studies in which the meaningful data appears to be no more meaningful than randomized, meaningless data.
This is not one of those situations. Park effects certainly exist. However, when we measure the standard deviation of park effects in one-season randomized data, we get. . .I get, in this study. . I get .080643. When I measure the ACTUAL effect, what we assume to be the "real" effect, I get .125539. That means that the effects of randomization in creating the appearance of a park effect are on substantially the same scale as the actual effects.
If I have done the math right—which is questionable, because this is outside my area of expertise—but if I have done the math right, what those two numbers show is that, in measuring park effects with a standard deviation of .125539, 41% of the effect being measured is just noise. Using Nate Silver’s terms of the signal and the noise, what we have in one-year park effects if 59% signal, 41% noise.
That’s too much noise. That’s basically useless data. I conclude that when you measure park effects based on one season of data, the measurement is so unreliable that you should never use it.
That was my first study of the issue. In my first study of the issue, I studied the "real" and "randomized" park effects for each real team for which I have organized data. In other words, I took the data for the 1936 New York Yankees, and every other team in my data, and split the games into two randomized groups, one of which I labeled "home" and one "road", but they were all the actual games from the 1936 Yankees, or whatever the team was. My second study was more theoretical.
For my second study, I generated a full list of the frequency with which the total number of runs scored in a game was 1, 2, 3, 4, etc., up to 49, which is the highest number of runs scored in a game in the last 100 years. (I guess that 26-23 game was more than 100 years ago now; it was August 25, 1922.) For purposes of this study, I excluded tie games, since we don’t have tie games anymore.
In my data, 164,670 games, there were 3,487 games in which one run was scored, 3,487 1-0 games. There were 3,251 2-0 games, and 10,709 games in which there were three runs scored, games which ended either 2-1 or 3-0. For whatever it is worth, this is the full chart of the frequency with which each number of runs in a game have occurred.
Runs
|
Games
|
Pct
|
49
|
1
|
0.000006073
|
45
|
1
|
0.000006073
|
36
|
6
|
0.000036437
|
35
|
9
|
0.000054655
|
34
|
9
|
0.000054655
|
33
|
17
|
0.000103237
|
32
|
12
|
0.000072873
|
31
|
25
|
0.000151819
|
30
|
33
|
0.000200401
|
29
|
51
|
0.00030971
|
28
|
56
|
0.000340074
|
27
|
106
|
0.000643712
|
26
|
134
|
0.000813749
|
25
|
220
|
0.001336005
|
24
|
270
|
0.001639643
|
23
|
459
|
0.002787393
|
22
|
536
|
0.003254995
|
21
|
907
|
0.005507986
|
20
|
1056
|
0.006412826
|
19
|
1782
|
0.010821643
|
18
|
1983
|
0.012042266
|
17
|
3316
|
0.020137244
|
16
|
3241
|
0.019681788
|
15
|
5423
|
0.032932532
|
14
|
5426
|
0.03295075
|
13
|
8542
|
0.051873444
|
12
|
7772
|
0.047197425
|
11
|
12621
|
0.076644197
|
10
|
10435
|
0.063369163
|
9
|
16200
|
0.098378575
|
8
|
12168
|
0.073893241
|
7
|
18321
|
0.111258881
|
6
|
11576
|
0.070298172
|
5
|
16320
|
0.099107306
|
4
|
8189
|
0.049729763
|
3
|
10709
|
0.065033096
|
2
|
3251
|
0.019742515
|
1
|
3487
|
0.021175685
|
I then set up a system which randomly generated a number of runs for a game, with each number of runs matching the frequency with which it has occurred in real life. I then used that random generator to create 40,000 team/seasons worth of data, with 81 home games and 81 road games in each season. 40,000 team/seasons is more team/seasons than there are in the actual history of baseball.
The ACTUAL park effect here is zero, so if we have a measured park effect, we know that is "noise", rather than signal. But measured in one-year groups, the standard deviation of park effects was .081479. This is essentially the same number I got in the first study, which was 080643.
But with the theoretical data (Study 2), it is very easy to measure the noise in the data if the park effects are measured in one-year groups, three-year groups, five-year groups, etc. These numbers represent what the standard deviation of park effects would be, measured over a period of seasons, if there was no real park effect:
One year .081 479
Three year .046 240
Five year .035 831
Seven year .030 397
Nine Year .026 939
11 Year .024 448
In other words, the random error in the measurement (noise) becomes less significant as the scale of the data increases, which is obvious anyway, but without running the simulations, we don’t know what the expected random error would be. Even if you study park effects over an 11-year period, you’ve still got a 2.4% expected error simply due to the fact that the data sample is not large enough for everything to even out.
I then contrasted
(a) the calculated park effects if measured over a multi-year period, numbers taken from the first study,
(b) with the expected random errors, numbers taken from the second study.
The standard deviation of measured park effects [(a) above] also decreases steadily as more seasons are taken into consideration. I measure the standard deviation of park effects as:
.1255, if measured in a one-year estimate,
.1024, if measured in three-year cycles,
.0949, if measured in five-year cycles,
.0911, if measured in seven-year increments,
.0883, if measured in nine-season groups of teams, and
.0863, if measured in eleven-year cycles.
There are a hornet’s nest of methodological issues that arise in a study like this. In any eleven-year period, a team may have changed something about their park, and it is nearly impossible to track all of those changes. At the extreme, if you regard every fence move as a re-start, then you would lose almost all of the data, since, with an eleven-year window, you lose five seasons on either side with a re-start. Even if ONE park doesn’t change, some other park within the league will change, which changes the park-to-league relationship which is at the core of the measurement. Combining data from two different streams is a little bit problematic, although not really; we all do that every day without realizing we are doing it.
With all of that, the eleven-year measurements do appear to be very good data when and where they are available. You get solid, sensible measurements that almost never out of line with reasonable expectations. There are a lot of places where you just can’t make an eleven-year measurement—for example, you can’t make an eleven-year estimate for any part of Sandy Koufax’ career. His early years in Brooklyn, there’s no 11-year measurement because they were within a few years of leaving Ebbets Field, then there’s no 11-year assessment for the Los Angeles Coliseum because the Dodgers only played there for four years, and then you can’t make an 11-year assessment for Dodger Stadium until 1962-1972, which centers in 1967. I’m just saying, when and where you CAN make an 11-year measurement, that appears to me to be the way to go.
Anyway, here’s the conclusions. Here’s the math, as best I am capable of figuring it.
In a one-year measurement, the standard deviation of park effects is .1255; the expected error in the measurement is .0806. These numbers suggest that the resulting data is about 59% signal, about 41% noise. 59-41.
In a three-year measurement, the standard deviation of park effects is .1024; the expected error in the measurement is .0462. These numbers suggest that the resulting data is about 80% signal, about 20% noise. 80-20.
In a five-year measurement, the standard deviation of park effects is .0949; the expected error in the measurement is .0358. These numbers suggest that the resulting data is about 86% signal, about 14% noise. 86-14.
In a seven-year measurement, the standard deviation of park effects is .0911; the expected error in the measurement is .0304. These numbers suggest that the resulting data is about 89% signal, about 11% noise. 89-11.
In a nine-year measurement, the standard deviation of park effects is .0883; the expected error in the measurement is .0269. These numbers suggest that the resulting data is about 91% signal, about 9% noise. 91-9.
In an eleven-year measurement, the standard deviation of park effects is .0863; the expected error in the measurement is .0244. These numbers suggest that the resulting data is about 92% signal, about 8% noise. 92-8.
Tom Tango has also written numerous times about issues related to this. One such post was Tuesday, January 1, 2019,
§ How much random variation is there in park factors? // January 01 2019
Thank you for reading.