Since I developed the point-comparison ranking system that we have used to rank NFL teams, NBA teams and (in 2007) NCAA football teams, people have been asking me if it could be used to compare baseball teams. I have finally approached this problem, and I’ll have a report for you shortly, but first, I need to make sure everybody is up to speed on the method. Even though I have explained this several times by now.
Suppose that Albany plays Binghamton, and beats them 70-52. The only thing we can conclude, based on this one game, is that Albany is 18 points better than Binghamton. If we assume that an average team is “100”, then, we have to conclude, based on this one game, that Albany is at 109, Binghamton is at 91.
Suppose that Cheektowaga plays Deer Park, and defeats them 122 to 41. The only thing we can conclude, based on the data we have, is that Cheektowaga is 81 points better than Deer Park. Since this is the only game info we have about either team, we place Cheektowaga at 140.5, and Deer Park at 59.5.
Suppose, however, that Albany plays Deer Park and beats the 82-68. Albany, 107; Deer Park 93, and that Binghamton plays Cheektowaga, and defeats them 92-89. Now we have two estimates for each team:
Albany 109 107 Average 108.00
Binghamton 91 101.5 Average 96.25
Cheektowaga 140.5 98.5 Average 119.50
Deer Park 59.5 93.5 Average 76.25
So the teams appear to rank in this order:
1. Cheektowaga 119.50
2. Albany 108.00
3. Binghamton 96.25
4. Deer Park 76.25
The data—like all such data—is internally inconsistent. If Cheektowaga is 81 points better than Deer Park, Binghamton is 3 points better than Cheektowaga, and Albany is 18 points better than Binghamton, than Albany must be 102 points better than Deer Park. But Albany played Deer Park, and defeated them by only 14. Our estimates, then, must not be exactly right.
Suppose that we re-run all of the games, but using these averages above (the end point data) as the starting point. In the first round of games, since all teams are initially assumed to be average, Albany’s “game output score” for the Binghamton game was 109, based on
(100 + 100 + 70 – 52) = 109.00
But in the second round, we assume that Albany is at 108 and Binghamton at 96.25, and score the game for Albany as:
(108 + 96.25 + 70 – 52) = 111.125
The results of the second-round calculations for all games are:
Albany 111.125 99.125 Average 105.125
Binghamton 93.125 109.375 Average 101.25
Cheektowaga 138.375 106.375 Average 122.375
Deer Park 57.375 85.125 Average 71.25
The rank is now
1. Cheektowaga 122.4
2. Albany 105.1
3. Binghamton 101.1
4. Deer Park 71.3
We then use these second-round outputs as the third-round starting figures, and re-calculate again.
After the third round, Binghamton has pulled ahead of Albany:
1. Cheektowaga 123.8
2. Binghamton 103.8
3. Albany 102.7
4. Deer Park 68.8
And, as we continue to re-calculate, they will pull further ahead. Finally, after a few dozen rounds of re-calculation, we will reach these values:
1. Cheektowaga (1-1) 125.25
2. Binghamton (1-1) 106.25
3. Albany (2-0) 102.25
4. Deer Park (0-2) 66.25
And when we reach those numbers, they stop moving. If you re-calculate 10,000 more times, the numbers will stay right there. Deer Park is supposed to be 81 points worse than Cheektowaga and 14 points worse than Albany; we can’t make that happen, so we compromise on 59 points worse than Cheektowaga (81 – 22) and 36 points worse than Albany (14 + 22). Albany is supposed to be 18 points better than Binghamton but only 14 points better than Deer Park. We can’t make that happen, so we compromise on 36 points better than Deer Park (14 + 22) but 4 points worse than Binghamton (18 – 22). Cheektowaga is supposed to be 81 points better than Deer Park but 3 points worse than Binghamton. We can’t make that happen, so we compromise on 59 points better than Deer Park (81 – 22) but 19 points better than Binghamton (-3 + 22).
Let me emphasize this: that it makes no difference whatever what values you initially assume each team has. If you initially assume that Deer Park has a value of 400 and all of the other teams are at zero, you get first-round outputs like this:
Albany 9 207 Average 108.0
Binghamton -9 1.5 Average -3.75
Cheektowaga 240.5 -1.5 Average 119.5
Deer Park 159.5 193 Average 176.25
But after the second-round calculations, you get these averages:
1. Cheektowaga 122.4
2. Deer Park 121.2
3. Albany 105.1
4. Binghamton 51.2
And, after a large number of re-calculations, you wind up with exactly the same numbers we had before. The initial assumption entirely washes out as the “comparison” data is re-introduced again and again and again.
In this example there are only four games, but in real leagues there are hundreds or thousands of games, each of them trying to push the ranking for a team up or down. It takes many rounds of re-calculation for the system to resolve all the tensions as well as it can, but the process eventually reaches a stopping point, which is the point at which the output values are the same as the input values.
In basketball we assume that an average team has a value of 200, in football, an average of 100, and in baseball, an average of 10. Those values are arbitrary, and it doesn’t matter; it’s just a way of establishing a center. You could make 27.418 the center, and the relative values of the teams would be just the same.
Anyway, I have done this for various sports, and people keep asking me, “Could you do this for baseball?” I didn’t think about doing it for baseball, honestly, because
1) One doesn’t tend to assume that the outcome of a single baseball game is representative of the ability of the team, and
2) The universe of games is immense.
Once Retrosheet published the data for 2008, however, I took a look at the issue of whether the game logs for each team could be moved into a spreadsheet like the ones we use. I concluded that it could. I’m not a programmer; it took me 25, 35 hours of actual work to do this, but I was able to get the games into a spreadsheet hooked up with all the necessary formulas.
OK, this is the ranking, by this method, for the 30 Major League teams in 2008, not including post-season play:
|
Boston
|
A
|
11.35
|
|
Toronto
|
A
|
11.11
|
|
Tampa Bay
|
A
|
11.10
|
|
New York
|
A
|
10.84
|
|
Chicago
|
A
|
10.81
|
|
Minnesota
|
A
|
10.81
|
|
Chicago
|
N
|
10.76
|
|
Angels
|
A
|
10.71
|
|
Cleveland
|
A
|
10.56
|
|
Philadelphia
|
N
|
10.37
|
|
Detroit
|
A
|
10.14
|
|
NY Mets
|
N
|
10.12
|
|
Oakland
|
A
|
10.07
|
|
Milwaukee
|
N
|
10.06
|
|
Texas
|
A
|
10.02
|
|
St. Louis
|
N
|
10.00
|
|
Baltimore
|
A
|
10.00
|
|
Kansas City
|
A
|
9.86
|
|
Los Angeles
|
N
|
9.85
|
|
Florida
|
N
|
9.69
|
|
Arizona
|
N
|
9.62
|
|
Houston
|
N
|
9.56
|
|
Atlanta
|
N
|
9.52
|
|
Seattle
|
A
|
9.49
|
|
Cincinnati
|
N
|
9.21
|
|
Colorado
|
N
|
9.12
|
|
Pittsburgh
|
N
|
8.90
|
|
San Francisco
|
N
|
8.88
|
|
San Diego
|
N
|
8.84
|
|
Washington
|
N
|
8.63
|
And this is the ranking for all 30 teams when post-season play is included:
|
Boston
|
A
|
11.26
|
|
Tampa Bay
|
A
|
11.13
|
|
Toronto
|
A
|
11.08
|
|
New York
|
A
|
10.82
|
|
Minnesota
|
A
|
10.78
|
|
Chicago
|
A
|
10.74
|
|
Chicago
|
N
|
10.68
|
|
Angels
|
A
|
10.67
|
|
Cleveland
|
A
|
10.53
|
|
Philadelphia
|
N
|
10.50
|
|
NY Mets
|
N
|
10.15
|
|
Detroit
|
A
|
10.11
|
|
Milwaukee
|
N
|
10.04
|
|
Oakland
|
A
|
10.04
|
|
St. Louis
|
N
|
10.02
|
|
Texas
|
A
|
10.00
|
|
Baltimore
|
A
|
9.97
|
|
Los Angeles
|
N
|
9.96
|
|
Kansas City
|
A
|
9.84
|
|
Florida
|
N
|
9.72
|
|
Arizona
|
N
|
9.65
|
|
Houston
|
N
|
9.57
|
|
Atlanta
|
N
|
9.55
|
|
Seattle
|
A
|
9.46
|
|
Cincinnati
|
N
|
9.22
|
|
Colorado
|
N
|
9.15
|
|
Pittsburgh
|
N
|
8.91
|
|
San Francisco
|
N
|
8.90
|
|
San Diego
|
N
|
8.86
|
|
Washington
|
N
|
8.66
|
Boston comes out first, but. . ..I hope nobody thinks this is what this is about. If you work with baseball statistics, you probably knew who would come out first before I told you. I could care less who ranks first; this isn’t how the championship is determined.
The Red Sox, by this method, were about 219 runs better than an average major league team, during the regular season. They outscored their opponents by 151 runs (845 to 694), and they played a schedule that was 68 runs better than a major league average schedule. (Actually, 66. The other two runs are created by the interaction of the two forces.) Washington is 222 runs worse than an average team—184 from being outscored, and 38 more from playing a weaker-than-average schedule.
Having done this, I see more merit in the method than I would have supposed was there going in. I was reluctant to do this; I thought it would be a lot of work to show us things that
a) we already knew, and
b) didn’t really matter.
Baseball has a very good system to resolve its championship; it doesn’t need rankings.
But I see more merit there than I suspected. First, the issue of predictability. . .do early-season rankings predict the finish of the season?
Yes, but not really. Since there are very few inter-league games early in the season, if we rank the teams based on games played through May 31, we get rankings that are sensible within the league, but with little information on how one league compares to the other:
|
|
|
|
|
|
|
AMERICAN
|
|
|
NATIONAL
|
|
|
Team
|
Rank
|
|
Team
|
Rank
|
|
Boston
|
10.70
|
|
Chicago
|
11.39
|
|
Toronto
|
10.64
|
|
Philadelphia
|
10.93
|
|
Oakland
|
10.64
|
|
Atlanta
|
10.86
|
|
Tampa Bay
|
10.59
|
|
Arizona
|
10.56
|
|
Chicago
|
10.58
|
|
NY Mets
|
10.24
|
|
Cleveland
|
10.28
|
|
Los Angeles
|
10.10
|
|
Angels
|
10.04
|
|
St. Louis
|
10.02
|
|
New York
|
10.00
|
|
Houston
|
9.97
|
|
Baltimore
|
9.79
|
|
Florida
|
9.96
|
|
Texas
|
9.75
|
|
Pittsburgh
|
9.93
|
|
Minnesota
|
9.70
|
|
Cincinnati
|
9.80
|
|
Detroit
|
9.68
|
|
Milwaukee
|
9.78
|
|
Kansas City
|
9.03
|
|
Washington
|
9.36
|
|
Seattle
|
8.96
|
|
San Francisco
|
9.01
|
|
|
|
|
San Diego
|
8.87
|
|
|
|
|
Colorado
|
8.86
|
|
|
|
|
|
|
|
|
|
|
|
|
Those rankings are reasonably predictive of the final finishes, but are they better than just looking at the standings?
Well, yes, probably. Florida was 31-23 on May 31, in first place in the NL East—but we see them here as a .500 team, and the fourth-best team in the division, Philadelphia being the best. They were one game under .500 for the rest of the season, and finished third in the division. Cleveland was 25-30 on May 31, but we see them here as an above-average team, and they did go 56-51 the rest of the way, although they never did get back in the race. On the other hand, Oakland and Atlanta, which appeared early in the season to be strong teams, ultimately proved not to be. My conclusion: Yes, there is probably some predictive significance to the method, but you wouldn’t want to rely on it. And you’d have to study many more seasons than one to know how reliable the early-season evaluations really were.
That’s a minor issue, to me, although an obvious one. The real virtues that I see in this system are:
1) It creates a meaningful and fairly reliable evaluation of how one league compares to the other.
Suppose that there was only one game played between the leagues, and that in that one game, San Diego beat Texas 13-2. If that was the case, in this system that would cause the National League teams, on average, to rank ten to twelve points ahead of the American League teams. This would happen because there would be nothing in the system resisting the input of that one game. San Diego would, of necessity, rank 11.0 runs ahead of Texas, and all of the other teams would re-orient themselves within the league based on the interlocking schedule within the league.
That’s a cautionary note; limited games, not interlocking with the rest of the schedule, can have a disproportionate impact on the rankings in this system.
But the AL/NL comparisons are not based on one game, of course; they are based on 250+ games, in which the American League, which has dominated those games for several years, went 149-103 (150-107 if we count the World Series.) Our estimate is that the average American League team is 0.86 runs per game better than the average NL team. I am sure we could derive this estimate by some other and simpler method—but I doubt that we could derive any more accurate estimate.
On the larger point, I think that the greatest potential of this method is in the comparison of leagues, and in particular in the comparison of leagues for college baseball. Our experiment here suggests that this method works quite well for baseball. There are 900+ college baseball teams, which play a completely interlocking schedule. UC-Riverside plays somebody who plays somebody who plays somebody who plays Middlebury College in Vermont and the University of Puget Sound in Tacoma and Bowdoin College in Maine.
Major league teams genuinely need to know how one college league compares to another. This method could definitively resolve that issue. The people who should do this are us—Bill James Online. So far, because of financial and programming issues, we haven’t been able to get things like that done, but if we don’t, somebody will. There is no reason we cannot clearly and definitively rank Bowling Green and Montevallo against USC and Texas.
2) It creates a sophisticated and accurate estimate of each team’s strength of schedule.
Because we have accurate rankings for each team on a run scale, we can easily figure the average strength of the opposition for each team. These are those figures for each of the 30 major league teams:
|
Team
|
Lg
|
Strength of Schedule
|
Plus Runs
|
|
Baltimore
|
A
|
10.51
|
83
|
|
Toronto
|
A
|
10.44
|
72
|
|
New York
|
A
|
10.44
|
71
|
|
Tampa Bay
|
A
|
10.43
|
70
|
|
Boston
|
A
|
10.41
|
66
|
|
Texas
|
A
|
10.41
|
66
|
|
Kansas City
|
A
|
10.39
|
64
|
|
Detroit
|
A
|
10.34
|
55
|
|
Seattle
|
A
|
10.33
|
53
|
|
Oakland
|
A
|
10.32
|
52
|
|
Chicago
|
A
|
10.28
|
46
|
|
Angels
|
A
|
10.28
|
45
|
|
Cleveland
|
A
|
10.26
|
43
|
|
Minnesota
|
A
|
10.26
|
42
|
|
Pittsburgh
|
N
|
9.83
|
-27
|
|
Cincinnati
|
N
|
9.81
|
-30
|
|
Washington
|
N
|
9.80
|
-32
|
|
Houston
|
N
|
9.76
|
-38
|
|
Florida
|
N
|
9.71
|
-47
|
|
Atlanta
|
N
|
9.71
|
-47
|
|
Milwaukee
|
N
|
9.69
|
-50
|
|
St. Louis
|
N
|
9.68
|
-51
|
|
Philadelphia
|
N
|
9.65
|
-56
|
|
San Diego
|
N
|
9.65
|
-57
|
|
Chicago
|
N
|
9.64
|
-58
|
|
San Francisco
|
N
|
9.64
|
-59
|
|
NY Mets
|
N
|
9.64
|
-59
|
|
Colorado
|
N
|
9.61
|
-62
|
|
Arizona
|
N
|
9.57
|
-70
|
|
Los Angeles
|
N
|
9.55
|
-73
|
The teams in the AL East play the strongest schedules, because there are four strong teams in that division. Baltimore plays the toughest schedule because they are the only team that has to play all four of them.
There is a lot of talk about strength of schedule. . .some whining about the imbalancing effects of the inter-league matchups, some discussion about playing so many games inside the division. This method gives us a solid, credible information with which to approach that discussion. I think that’s worthwhile.
Baltimore’s schedule is 156 runs tougher than Los Angeles’ schedule—one run a game, basically. Baltimore starts the season 156 runs behind the Dodgers. What do we think about that? Should we just live with it, or should try to do something about it?
3) It is a step toward the possible evolution of methods of adjusting statistical performance for strength of schedule.
I don’t talk about what happens in the Red Sox front office, but I think I can tell you this: We worry a lot about “Can this guy come into our division and compete?” OK, here’s a pitcher who has a good record pitching in some other division, but. .that ain’t the AL East. What’s going to happen to him against this level of competition?
This information is a step on the road toward a method that can adjust statistical performance for the level of competition.
Alternative Approach
What if we approached this problem not through run differential but through winning percentage?
In order to do that, we need to be able to state the outcome of each game as a winning percentage. I outlined a method to do that in another article (Winning Percentage from a Game). . ..a 1-0 win creates a winning percentage of .541, a 12-6 victory is .626, a 4-5 loss is .422.
If you post a .626 winning percentage against a team with a winning percentage of .482, what is your winning percentage? In other words, the .626 assumes a .500 opponent. It’s not a .500 opponent; it’s a .482 opponent. What’s the equivalent winning percentage?
It’s .609. There’s an old established method for dealing with that. . ..Dallas Adams and I invented it in the 1970s. I don’t want to get into that now, but:
.626 against a .400 team is equivalent to .527 against a .500 team.
.626 against a .450 team is equivalent to .578 against a .500 team.
.626 against a .500 team is .626.
.626 against a .550 team is equivalent to .672 against a .550 team.
.626 against a .600 team is equivalent to .715 against a .600 team.
If you combine the new method (Winning Percentage from a Game) with this old method, you can calculate a winning percentage for each game, adjusting for the quality of the competition.
We evaluate each game of the major league season in this way. Milwaukee played at Cincinnati on April 18, April 19 and April 20, Milwaukee winning 5-2 and 5-3 and losing the third, 3-4.
A 5-2 win is a winning percentage of .719. A 5-3 win is a winning percentage of .635, and a 3-4 loss is a winning percentage of .444, so Milwaukee’s winning percentages for the three games are .719, .635 and .444, and Cincinnati’s are .281, .365 and .556—without adjusting for the quality of competition.
To adjust for the quality of competition, we go through the process outlined above. On the first round of calculations, we assume that Cincinnati is a .500 opponent. Cincinnati’s winning percentage after one round of calculations, however, is .471, so in the second round of calculations we assume that their winning percentage is .471, and we re-calculate again.
After many rounds of calculations, Cincinnati’s winning percentage locks in at .470, and then it won’t move anymore; this is the end point data. We recalculate these games based on that conclusion:
.719 against .470 is .694 (meaning that it is equivalent to .694 against a .500 team.)
.635 against .470 is .607.
.444 against .470 is .414.
So Milwaukee’s winning percentage contributions for those games are .694, .607 and .414.
By calculating every game in this fashion and running it through many cycles, we get output winning percentages for every team as follows, including the playoff and World Series Games:
|
|
|
|
|
Team
|
Lg
|
Winning
|
|
|
|
Percentage
|
|
Boston
|
A
|
.558
|
|
Tampa Bay
|
A
|
.542
|
|
Toronto
|
A
|
.539
|
|
Chicago
|
N
|
.535
|
|
Angels
|
A
|
.533
|
|
Philadelphia
|
N
|
.529
|
|
Minnesota
|
A
|
.528
|
|
New York
|
A
|
.528
|
|
Chicago
|
A
|
.521
|
|
NY Mets
|
N
|
.518
|
|
Cleveland
|
A
|
.517
|
|
Milwaukee
|
N
|
.516
|
|
Los Angeles
|
N
|
.509
|
|
St. Louis
|
N
|
.508
|
|
Houston
|
N
|
.500
|
|
Florida
|
N
|
.499
|
|
Texas
|
A
|
.495
|
|
Oakland
|
A
|
.495
|
|
Arizona
|
N
|
.492
|
|
Kansas City
|
A
|
.488
|
|
Detroit
|
A
|
.488
|
|
Baltimore
|
A
|
.487
|
|
Atlanta
|
N
|
.484
|
|
Colorado
|
N
|
.475
|
|
Cincinnati
|
N
|
.470
|
|
Seattle
|
A
|
.468
|
|
San Francisco
|
N
|
.464
|
|
Pittsburgh
|
N
|
.459
|
|
San Diego
|
N
|
.452
|
|
Washington
|
N
|
.438
|
This is essentially the same as the rankings we got by the other method—a little different, but mostly the same.
A strength of this method is that it is more focused on wins and losses, and pays little attention to the difference between a 7-1 win and a 15-1 win. There are eight runs there that should be depreciated—and are depreciated by this method, not by the other one.
A weakness of this method, which was discussed in the companion article (Winning Percentage from a Game) is that the average winning percentage from all games does not track with the team’s actual winning percentage, but with a figure halfway between that number and .500. . ..600 becomes .550, .580 becomes .540, etc.
I was trying to figure out a way to work around this problem, but the best I could come up with was simply to go through the entire process, and then double the spreads (double the distance from .500) at the end of the process:
|
|
|
Centralized
|
|
De-Centralized
|
|
Team
|
Lg
|
Winning
|
|
Winning
|
|
|
|
|
Percentage
|
|
Percentage
|
|
|
Boston
|
A
|
.558
|
becomes
|
.615
|
|
|
Tampa Bay
|
A
|
.542
|
becomes
|
.585
|
|
|
Toronto
|
A
|
.539
|
becomes
|
.577
|
|
|
Chicago
|
N
|
.535
|
becomes
|
.571
|
|
|
Angels
|
A
|
.533
|
becomes
|
.565
|
|
|
Philadelphia
|
N
|
.529
|
becomes
|
.558
|
|
|
Minnesota
|
A
|
.528
|
becomes
|
.556
|
|
|
New York
|
A
|
.528
|
becomes
|
.556
|
|
|
Chicago
|
A
|
.521
|
becomes
|
.542
|
|
|
NY Mets
|
N
|
.518
|
becomes
|
.536
|
|
|
Cleveland
|
A
|
.517
|
becomes
|
.533
|
|
|
Milwaukee
|
N
|
.516
|
becomes
|
.531
|
|
|
Los Angeles
|
N
|
.509
|
becomes
|
.517
|
|
|
St. Louis
|
N
|
.508
|
becomes
|
.516
|
|
|
Houston
|
N
|
.500
|
becomes
|
.500
|
|
|
Florida
|
N
|
.499
|
becomes
|
.498
|
|
|
Texas
|
A
|
.495
|
becomes
|
.491
|
|
|
Oakland
|
A
|
.495
|
becomes
|
.490
|
|
|
Arizona
|
N
|
.492
|
becomes
|
.484
|
|
|
Kansas City
|
A
|
.488
|
becomes
|
.476
|
|
|
Detroit
|
A
|
.488
|
becomes
|
.475
|
|
|
Baltimore
|
A
|
.487
|
becomes
|
.474
|
|
|
Atlanta
|
N
|
.484
|
becomes
|
.467
|
|
|
Colorado
|
N
|
.475
|
becomes
|
.451
|
|
|
Cincinnati
|
N
|
.470
|
becomes
|
.441
|
|
|
Seattle
|
A
|
.468
|
becomes
|
.435
|
|
|
San Francisco
|
N
|
.464
|
becomes
|
.428
|
|
|
Pittsburgh
|
N
|
.459
|
becomes
|
.417
|
|
|
San Diego
|
N
|
.452
|
becomes
|
.404
|
|
|
Washington
|
N
|
.438
|
becomes
|
.375
|
|
That’s not a very good way to make that adjustment, and I’m sure somebody will suggest a better way of de-centralizing the data.
I experimented with de-centralizing the data during the calculation process—that is, de-centralizing the numbers after each round of calculations, before the next round of calculations. I thought that what might happen when we did that might be
1) That after being de-centralized in the opening rounds of the calculations, the data might stabilize at the de-centralized numbers, or
2) That the system might veer out of control, and start giving us irrational calculations.
But actually neither of those happens. What happens—it is in a sense re-assuring—is that the system persistently attempts to stabilize at the “centralized" numbers, and defies the efforts to de-centralize it. In other words, Boston is headed for .558 and Washington is headed for .438, no matter what you do. If you double the difference from .500 in the early rounds, the data will hone in on the “centralized” numbers as soon as you stop forcing it away from .500. If you double the difference from .500 after every round, the system hones in on the de-centralized numbers. Doing the de-centralization during the process is the same as doing it after the process.
It works OK; I like the other method a little better, but I can see an argument for this one, too. No matter what we do, we are going to reach the conclusion that the Red Sox were the best team in baseball in 2008, but I’ve checked my finger a number of times, and I’m really certain that there ain’t no ring there. I’m not pursuing that claim; we’re simply trying to understand the data a little bit better. By learning to make inferences from the data, we might eventually learn to rank restaurants, high schools, political candidates or movie stars. We’re starting with baseball teams.