Winners

October 7, 2016
            Reflections on the Use of Wins and Losses in an Analytical System

 

         (This is the second of two articles taken from my longer unpublished article, outlining how to calculate Win Shares and Loss Shares, using modern defensive numbers.)

 

               I wanted to explain about and defend the use of Won-Lost records in the rating of the pitcher.   I don’t believe that the other WAR systems use Wins and Losses, at all, in their determination of a pitcher’s value, and I know that many analysts will not believe that it is appropriate to use them.    My friend Brian Kenny has been on a campaign to "Kill the Win", and his new book has a chapter entitled "Kill the Save Too."  

               I think that I am in substantial measure responsible for the antipathy toward the win in the sabermetric community, due to things that I wrote in the 1980s, and I think that, because of those things, I need to explain why I feel that it is appropriate to use Wins, Losses and Saves, not as a major element in seeing most clearly the contribution of an individual player to the success of his team, but as a minor element.  

               What I could say and often do say in this debate is that we are not in the business of eliminating information, but in the business of creating it.  What I could also say, and do say, is that it is the point of the game to Win.   In this effort here, what we are really doing is creating Won-Lost records for every player—carefully, systematically justified and deeply researched Wins and Losses, yes, but still Wins and Losses.  To help his team win, and avoid losing, is the exact definition of a player’s job.   A won-lost record is, I think, the best possible shorthand notation to summarize what a player has done to help his team win.  

               But those also are shorthand arguments, arguments you make when you do not have the time or patience to deal with the real issues.    The argument against the win, in essence, is that it pretends to represent an ultimate truth, but fails to do so.   That is true enough; it does so in many cases.   Also, pitcher’s Wins and Losses are very poorly designed, carelessly designed.   The rules determining who gets the Win in a contest and who gets the Loss were never really thought through; they just happened.    Casual decisions about record-keeping have been carried through for a century and more, when it has long since become apparent that the original decision was poorly made.

 

               But this is the real issue. . . well, no, I am not quite ready to get to the real issue.    On the other side of the issue, I should say first that the Won-Lost record has inherent virtues not found in other statistics—and important inherent virtues not found in other statistics.   The won-lost record automatically balances at .500 in every season.  

               This is a tremendous asset.   In my lifetime we have had seasons when the league ERA was under 3.00; we have had seasons when it was around 5.00.   When you look at a pitcher’s career ERA and see 3.50, that could be 400 runs better than the league average in his era; it could be 300 runs worse.   You don’t know.

               In the 1950s, a two-to-one strikeout to walk ratio was a GREAT ratio.    No starting pitcher in the American League had a two-to-one strikeout to walk ratio in 1947, in 1948, in 1949, in 1950, in 1951, in 1953 or in 1954.    Now, a two-to-one strikeout to walk ratio is not only below average, it is WELL below average.   So when you look at a pitcher’s record and see a two-to-one strikeout to walk ratio, you don’t know what that means.   It could be a great ratio; it could be below average.

               This is true of every stat, except Wins and Losses.  Complete Games, Shutouts, Saves, Hits per Inning, WHIP. . .they are ALL misleading because standards change radically over time—except Wins and Losses.  

               Yes, of course it is true that the quality of his teams impacts a pitcher’s career won-lost record, and of course it is true that random factors (and the fact that the stat is poorly defined) will cause won-lost records to be deceiving even when comparing teammates.    But this is also true:   that in looking at a starting pitcher’s CAREER record, the MOST reliable piece of evidence (in the basic record of a pitcher) is his won-lost records.     The random effects more or less disappear over the course of 300 decisions or more.   The flaws in the game-by-game assignment process are not hugely significant over time, although they are hugely annoying in isolated cases.   And most pitchers, over the course of a career, pitch for a more-or-less even balance of good teams and bad teams.  

               Yes, the won-lost record, even over the course of a career, is not a PERFECT summary of the pitcher’s positive and negative contributions to his team—but it is better than anything else in the pitcher’s basic career record.    The distortions in a Won-Lost record, over a career, are clearly less than the time-and-place distortions of Earned Run Average, WHIP, Strikeouts and Walks, and every other statistical category.   

               Even in a single season, the park-effects distortions of ERA can be larger than the distortions in the Won-Lost record, although, in a single season, it is a fair fight between the two, and the ERA would win the fair fight more often than it loses, but the ERAs of pitchers in Colorado, for example, are in no way a reliable indicator of how well the pitcher has pitched.  

               So Brian is trying to get rid of what is actually the BEST information in a pitcher’s record (for a starting pitcher, over a substantial career.)   That does not seem to me to be wise.   It doesn’t seem to me like it is going to work, either; I think that we COULD persuade the baseball community to reform the pitcher’s Won-Lost record, if we could somehow agree to act together on that issue, but we will never persuade them to get rid of Wins and Losses—nor should we be trying to do so. 

               But I have not yet addressed the real issue.   The real issue is, does the won-lost record contain ANY useful information which is not contained anywhere else in the pitcher’s record?    We KNOW what the park effects are.   We KNOW what the league norms are for ERA, strikeouts and walks.   If we got rid of the won-lost record, Brian might argue, we could get replace the information—and better—with information created by the things that modern analysts know.  

               But can we?  I don’t know.   The best, most honest answer to the question of whether there is any useful information contained in the Won-Lost record and not found elsewhere in a pitcher’s record is that I don’t know, for certain, and (a) you don’t know, either, I don’t believe, and (b) I am more inclined to believe that there IS information in there which would otherwise be overlooked than that there is not.  

               I thought the opposite, in the 1970s and the 1980s; I believed the opposite to be true at that time, and I played a role in convincing Brian and many others that this was true.   I may be primarily responsible for the belief, in the sabermetric community, that Wins and Losses are a useless artifact of an old way of thinking about pitchers; I am not saying that I am responsible for that, but that I may be.   I will accept the blame for that mistake if you feel that I should. 

               Here is where I went wrong.   In the mid-1970s, there was an article published about Clutch Hitting, which concluded that that there was no such thing as an ability to hit in the clutch.   It was one of the best sabermetric articles published up to that time, and it was decades ahead of its time in having the courage to take on directly one of the central elements of the baseball community’s understanding of why teams win and lose.  

               The approach used in that article was to compare performance in clutch situations in two consecutive seasons.   In other words, the author looked at every at bat of two consecutive seasons—I believe it was 1969 and 1970, or 1970 and 1971, something like that—and isolated the "clutch" contribution of each player in one season, and then the other.   His conclusion was that there was no relationship between the lists in the two seasons.   The players who were the best clutch hitters in one season had no tendency—no tendency at all—to be the best clutch hitters in the following season, nor did the players who had been the worst clutch hitters in one season have any tendency at all to fail in the clutch in the next season.   The patterns of who was "clutch" were simply random.  

               If an ability actually exists, the author argued (implicitly or directly, I don’t remember which). . .if an ability actually exists, then it must be persistent to at least some extent.   If you look at the players who have speed in one season and those who are slow, you will find that the same players are fast and the same players are slow the next year.   If you look at the players who hit for power in one season (or who don’t), you will find that they still hit for power (or they don’t) in the next seasons.   This is true of every real ability.   If a trait has NO tendency to persist, then it isn’t a real ability; it is just luck. 

               I was very impressed by this article and by this analytical approach, and I used this method to study many other issues in the late 1970s and the 1980s.   One of those issues was whether there was any such thing as an ability to "pitch to the score", thus an ability to "win" the game which was separate from and distinct from an ability to prevent runs from scoring.   I concluded that there was no evidence that there was such an ability.  

               But about 2003, 2004, 2005, I had a terrible realization.   That method doesn’t work.   It SEEMS like it ought to work, and it will work if the "x factor" that you are trying to isolate is relatively large compared to the dominant patterns in the data, but it doesn’t work at all—at all, at all, at all—when the factor you are trying to isolate is hidden by randomization, and also by other, larger elements in the data.  "Randomization" means that some days the offense scores 8 runs; other days they get shut out, you just never know how many runs you will have to work with on a given day.   "A larger element in the data" refers to, for example, a pitcher’s ERA; obviously a pitcher’s ERA is a LARGER element in his won-lost record than this "x factor", this ability to allow 3 runs when the team scores 4, so obviously the X factor is not the DOMINANT element of the equation.

               I began to sense that this must be true about 2003, but it took me about two years to come to terms with the fact that it actually doesn’t work.   It was hard for me, because I had done many, many studies over the years which relied on the assumption that that method WOULD work, that it would isolate an X factor if an X factor existed.   I had to come to terms with the fact that I had misled many people on many issues—like Brian Kenny on this issue—because I had relied on a method that doesn’t work.  

               Let me try to explain why it doesn’t work.    A pattern of numbers may exist, and you may be able to see the pattern clearly when there is no interference.   Like this:

<​br clear="all" style="mso-special-character:line-break;page-break-before:always" />

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

   

9

9

     

9

         

9

       

9

 

9

9

9

9

9

     

9

9

   

9

     

9

 

 

 

9

   

9

   

9

         

9

       

9

     

9

       

9

   

9

 

9

     

9

 

 

9

           

9

         

9

       

9

     

9

     

9

         

9

     

9

 

 

9

           

9

         

9

       

9

     

9

     

9

         

9

     

9

 

 

9

           

9

         

9

       

9

     

9

     

9

         

9

9

9

9

9

 

 

9

           

9

         

9

       

9

     

9

     

9

         

9

     

9

 

 

9

           

9

         

9

       

9

     

9

     

9

         

9

     

9

 

 

9

           

9

         

9

       

9

     

9

     

9

         

9

     

9

 

 

 

9

   

9

   

9

           

9

   

9

       

9

       

9

   

9

 

9

     

9

 

 

   

9

9

     

9

9

9

9

9

     

9

9

         

9

         

9

9

   

9

     

9

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

               But the pattern becomes much harder to see when it is surrounded by random data.   Like this:

7

6

6

9

7

5

5

5

7

7

9

9

8

6

5

6

8

7

9

9

6

6

8

7

5

6

7

8

7

9

8

5

9

7

7

8

8

9

6

7

7

5

9

9

7

5

8

9

5

8

5

6

8

9

5

6

7

5

9

7

9

9

9

9

9

8

8

6

9

9

5

9

9

8

9

5

9

7

7

8

9

6

7

9

9

5

9

5

5

5

6

7

9

9

5

8

6

9

9

8

5

9

5

8

7

8

9

8

5

9

8

9

6

9

5

9

6

9

9

9

5

5

6

9

6

9

6

5

6

5

8

9

8

6

8

9

9

7

5

5

9

5

6

6

9

6

9

6

5

6

9

9

5

6

9

8

7

9

9

8

9

8

5

9

9

8

9

9

8

8

9

9

7

8

5

9

7

7

6

9

8

8

7

9

9

6

6

5

8

9

8

8

9

9

5

5

9

6

5

5

6

7

8

9

6

8

9

8

8

9

7

7

5

8

9

9

6

5

9

5

7

9

9

8

9

8

9

6

9

9

9

9

9

9

6

9

5

9

8

8

9

9

9

5

7

7

7

8

9

8

6

5

8

9

5

5

7

9

6

6

5

9

9

9

7

9

5

9

7

5

8

9

9

6

9

6

8

9

9

7

5

9

7

6

8

7

9

9

9

6

9

7

9

5

6

5

9

5

7

7

9

8

5

8

8

6

9

9

7

8

9

5

6

9

6

8

5

7

6

5

9

6

5

9

8

6

9

5

7

5

6

9

6

5

8

9

8

5

6

9

8

9

9

6

5

9

6

9

9

9

8

5

9

9

5

7

9

8

9

9

7

8

9

8

9

9

9

8

6

9

6

7

7

6

9

7

8

9

9

9

5

5

9

9

9

9

5

7

9

7

9

7

8

9

9

7

5

6

9

9

9

9

9

7

9

6

9

9

9

8

8

9

6

9

6

5

6

8

5

9

9

9

9

9

5

8

5

9

6

9

8

5

9

5

9

5

6

6

7

8

8

7

9

9

8

8

9

8

8

6

6

5

5

7

9

6

8

9

8

5

9

6

6

6

6

6

8

8

 

               All of the originals "9s" are still there; you just can’t see the pattern anymore because of all of the random numbers.     A pattern also becomes more difficult to see when it is in competition with another, more dominant pattern.   Like this:


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

8

       

8

   

8

8

8

8

8

 

8

8

8

8

8

   

8

8

8

8

8

 

8

8

8

8

8

 

8

8

8

   

 

 

8

   

9

9

8

   

9

 

8

     

9

 

8

   

9

 

9

9

8

9

9

 

8

 

9

9

   

8

   

8

9

 

 

8

 

9

   

8

   

9

 

8

     

9

 

8

   

9

     

8

     

8

9

   

9

 

8

     

8

 

 

8

9

     

8

   

9

 

8

     

9

 

8

   

9

     

8

     

8

         

8

     

8

 

 

8

9

     

8

   

9

 

8

     

9

 

8

   

9

     

8

     

8

         

8

     

8

 

 

8

8

8

8

8

8

   

9

 

8

     

9

 

8

   

9

     

8

     

8

8

8

8

8

 

8

9

9

8

9

 

 

8

9

     

8

   

9

 

8

     

9

 

8

   

9

     

8

     

8

         

8

8

8

 

9

 

 

8

9

     

8

   

9

 

8

     

9

 

8

   

9

     

8

     

8

         

8

   

8

9

 

 

8

9

     

8

   

9

 

8

     

9

 

8

   

9

     

8

     

8

         

8

     

8

 

 

8

 

9

   

8

   

9

 

8

       

9

8

 

9

       

8

     

8

9

   

9

 

8

     

8

 

 

8

   

9

9

8

   

8

8

8

8

8

     

8

9

         

8

     

8

8

8

8

8

 

8

     

8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

               When a data pattern is in competition BOTH with a more dominant pattern and with random effects, it becomes totally impossible to find the original pattern.   Data patterns can exist, in a mountain of data, which are so well hidden that they can resist thousands of efforts to find them.   Patterns exist in nature which are so well hidden that they have resisted thousands of years of attempts to decode them.   This is what science is, in a sense:  it is an endless effort to find the patterns in nature which have been there for millions of years, waiting for us to find them.  

               Thus, the fact that we have not yet found convincing evidence of an "X factor" in the ability to Win games—or the ability to hit in the clutch—is not proof that these things don’t exist; they may well exist, we just haven’t looked at the data in exactly the right way yet.   To find clutch hitters, using the method that was used in that seminal mid-1970s article—even though it was a very good article—but to find proof of clutch hitting using that method, I believe, would require much more data than exists in all of the history of organized baseball.  The fact that we can’t see any such pattern DOES mean, I believe, that these cannot be the dominant patterns in the data.   Clutch hitting and an ability to pitch to the score (by pitchers) cannot be dominant patterns, or I believe that we would have found them by now, but they may still be hiding in the data as meaningful but recessive elements. 

               When I was young, I didn’t understand this.   I thought that if I looked for the X factor in every way that I could think of to look for it and did not find it, then it must not be there.   As a mature researcher, I have much more respect for the ability of the universe to hide its secrets.  

               So then, we face the basic question:   Is it more reasonable to believe that this "X factor" exists, or that it does not exist?

               It seems to me much more reasonable to believe that it does exist than that it does not.    An ability to Win Games which is not reflected in the number of runs allowed could exist if the pitcher has an ability to allow 3 runs when he has 4 to work with, allowing 5 runs when he has 8 to work with.. . .it could exist in the ability to pitch to the score, but it could also exist in other places.   What I mean primarily is that an ability to Win Games which is not reflected in the number of runs allowed could be hidden in the variability of offensive conditions under which a pitcher must work.

               Sometimes a pitcher for the San Francisco Giants must pitch in Colorado, since the Rockies are in his division; other times he gets to pitch in San Diego.

               Sometimes he must pitch in 90 degree weather; other times he gets to pitch in 65 degree weather.   Many, many more runs are scored in 90 degree heat than are scored in cold weather.  

               Sometime he must pitch when the wind is blowing out.   Other times he gets to pitch when the wind is his friend.

               Sometimes he must pitch when the home plate umpire is giving pitchers the edges of the plate; other times he must pitch when the home plate umpire is calling those pitches balls, and the first base umpire won’t call a checked swing a swing. 

               Sometimes the pitcher gets to pitch when the pitcher’s mound is in the sunlight, and the batter’s box is in the shadows.   Other times the light conditions are not an issue. 

               When these conditions exist, they have an impact on all of the pitcher’s stats—except the wins and losses.   But in every set of game conditions, there is one winning team and one losing team.   If these game condition variables do not even out, over the course of a season, then there MUST be information about the ability of the pitchers which is contained in the won-lost record, but which is not contained anywhere else in the record. 

               Traditional baseball analysts asserted, until about 1980, that the won-lost record was what mattered most for a pitcher because the offensive support evened out over time.   Once we actually studied that issue, we could easily demonstrate that that was false:  offensive support does NOT even out over the course of a season, nor even necessarily over the course of a career, although it is much more even over the course of a career than over the course of a season.  

               But if we assume that offensive condition variables even out over the course of a season, does this not put us in very much the same position that we have already demonstrated to be false?    It seems to me that it does.   Suppose that two pitchers on the same team have the same ERA and the same innings pitched, but very different won-lost records.   We could call these two pitchers Chris Sale and Jose Quintana, 2015, or we could call them Kyle Lohse and Yovanni Gallardo, 2014, or we could call them Gio Gonzalez and Jordan Zimmerman, 2012, or we could call them Miguel Batista and Jarrod Washburn, 2007, or we could call them Kenny Rogers and Nate Robertson, 2006.  (Jose Quintana in 2015 had a 9-10 won-lost record despite a 3.36 ERA in 32 starts.   His teammate Chris Sale posted a 3.41 ERA in 31 starts, and also allowed more UN-earned runs than Quintana, but still finished 13-11.)   Is it likely that ALL of the difference between them is in luck, and that none of it is due to undocumented variation in offensive conditions, or is it more likely that SOME of the difference is due to undocumented variation in offensive conditions?

               To explain that Chris Sale received the support of 3.81 runs per start and Jose Quintana only 3.59 is no help, first because the difference is not large enough to explain the divergence in won-lost records, but also because an undocumented variation in offensive conditions would CAUSE such a difference, just the same as luck would cause it.  

               Unresolved scientific questions ultimately come down to a test of "What do you believe?"   It seems to be more consistent with the nature of the universe to believe that the won-lost records contain SOME useful information about the pitchers, rather than that they contain none, and so I am going to use the records as a small element in these Win Share calculations. 

 
 

COMMENTS (17 Comments, most recent shown first)

jeffburk
As an historical matter, is it the case that the rules for awarding wins and losses weren't thought through? Or is this an instance like the one Bill described in the latest volume of the Fielding Bible, about traditional fielding statistics making more sense at the time they were adopted because of the higher prevalence of errors at the time? In an era when (1) complete games were the norm rather than the exception, (2) a starting pitcher was considered not to have done his job if he failed to complete a game and (3) if a reliever entered the game, he was likely to finish it rather than be replaced by another reliever, the rules for awarding wins and losses made more sense than they do now.

That sounds pedantic, but it is relevant because it points to the bigger distortion in won-less records in modern times. Run support will always play a role no matter what rules you adopt. But the lacuna in won-loss records in contemporary circumstances is the number of no decisions. When it is common for two, three or even more relievers to follow the starting pitcher, awarding a win can become something of a crapshoot based on when the offense takes the lead that wins the game.

Fortunately, this is something that can be addressed. One way would be to look at the team's record in a pitcher's starts, regardless of whether he was awarded a win or loss. Another way would be to tweak the rules for awarding wins and losses, perhaps to favor the starting pitcher more heavily or to award wins and losses on a fractional basis among the starting pitcher and the relievers, and apply the tweaked rules to the data set. Obviously, Major League Baseball is unlikely to make changes of that nature to the statistical record, but there is no reason it cannot be done for analytical purposes. I think this would be useful in studying concepts like pitching to the score as well as more precisely allocating credit for wins and losses among starting pitchers and relievers in overall valuation systems such as win shares.
1:43 AM Nov 5th
 
bjames
But if it is a 10% factor, that IS the value that I assigned to it. I assigned it a 10% weight. So are you not, in effect, arguing that the value I assigned to this factor is exactly accurate?
7:02 PM Oct 12th
 
110phil
I agree that there could be some information in W-L records ... but, the question is, how much? Specifically, how much compared to the variation caused by noise?

I don't know, but I think we can calculate the second number, the variation caused by noise.

If you don't want to read further, the number I came up with is an SD of 2 starter wins per season, which is an SD of 3 for the difference in wins between two identical starters. Two SDs is 6 wins, so for 5% of identical pairs of starters, their records will vary by 6 or more wins (as in, 15-9 versus 9-15), just because of luck.

So, however big the signal is, it's competing with noise of 3 wins.

Here's my calculation:

-----

Seems to me that the biggest three kinds of luck affecting that record are (1) run support, (2) bullpen performance, and (3) lucky random "pitching to the score" effects.

Suppose a pitcher starts 30 games, going 7 innings each game. And suppose 70% of the decisions are his. What's the SD of the W in his W-L record, just from the effects of luck (as if you were playing Strat-O-Matic)?

(1) The SD of runs per inning is (conveniently) about 1. So, the SD of runs over 9 innings is 3. Divide 3 by the square root of 30 [games] and you get about .55. So, that's the SD of run support during those games.

(2) The bullpen accounts for 2 innings per game, for an SD of 1.4 runs in those innings. Divide by the square root of 30, to get 0.26.

(3) I did a "lucky teams" study in 2005 that found that that the SD of pythagorean luck for a team is about 4 games over a season. That's about 1.7 games out of 30, which is 17 runs out of 30, which is the equivalent of 0.57 runs per game.

So, the total luck from those is the square root of (.55 squared + .26 squared + .57 squared), which rounds to .70.

That means: the SD of runs-that-lead-to-wins for which the starter has no control is .70 per game, for a 30-game sample. That's 21 runs, or 2.1 wins.

Say, 1.5 of those 2.1 wins go to the starter's W-L record.

If you're comparing two pitchers, you multiply the 1.5 wins by root 2, which gives 2.12. So, the SD of the difference in wins between two starters is 2.12 wins. In other words, if you have two teammates who would otherwise both go 12-12 ... 5 percent of the time, one of them will go 8-16, while the other will go 16-8, for no other reason than good/bad luck.

[And that's not all! There's also the timing of runs. I was assuming that both pitchers get 70% of the decisions, but, just randomly, one might get 15 decisions out of 30 starts, while the other might get 25.]

That's the noise. To see a signal in that, you'd need a fairly big signal, I think. But it's an empirical question: what's the variation in "pitching to the score" ability, plus differences in parks, plus differences in opponents, plus differences in wind conditions?

Between starting pitchers on the same team, it would have to be 30 runs per season -- that is, 30 runs in difference between pitcher A's 30 starts and pitcher B's 30 starts -- just so you can say that when one pitcher goes 17-7 and another with identical stats goes 9-14, that HALF of the difference is due to actual environmental effects.

Tango suggests that it's not half -- that it's maybe 1/10. Even 1/10 seems high to me, but I trust Tango's judgment better than mine.



5:31 PM Oct 11th
 
tangotiger
Extending the thought of Game Score, in my version of Game Score (which is a tweak to Bill's version), Lee and Hamels both averaged 58 in 2012:
www.fangraphs.com/statsd.aspx?playerid=4972&position=P&type=&gds=&gde=&season=2012
www.fangraphs.com/statsd.aspx?playerid=1636&position=P&type=&gds=&gde=&season=2012

In Bill's original version, Hamels is at 60 and Lee is at 59.
www.baseball-reference.com/teams/PHI/2012-pitching.shtml#players_starter_pitching::none

Since neither version uses the W/L record, they both rely on some combination of the other three components that Bill has highlighted in his Win Shares metric. And we saw that Bill has them both at 3.42 runs per game.

Basically, we've just come full circle here that, at its heart, Win Shares is Game Score, with the noted adjustment for the W/L record.

2:01 PM Oct 10th
 
tangotiger
Bill: beautiful, just beautiful. Not only in how you balanced the W/L, but also in your explanation in the various components.

***

One little tidbit: if you limit the weighting to just the first three components, at the weighting of 4-3-2 like you have it, Hamels and Lee both come in as.... exactly 3.42! So, we do in fact get effectively identical pitchers, except for the W/L. And we've picked out the most extreme difference in W/L that we can imagine. This therefore because the best test case to describe the system's use of W/L.

The difference we see, 3.34 for Hamels and 3.58 for Lee is entirely due to the huge difference in win%, and so 0.24 runs per game.

You are showing 18-9 and 16-10, which is a Win Shares gap of 1.5 win shares or 0.5 wins.

I had suggested: "That's 1.3 wins, or 4 win share gap. And I think a more reasonable figure is half that."

Given that what you are giving is just about as reasonable as I could hope for, I think you've done a reasonably perfect job of balancing the W/L record with the rest of his pitching line as one can expect.

***

By the way, the "three true outcomes" is essentially FIP, which is the basis of Fangraphs WAR.

The "natural run average" is halfway between ERA and RA9, and Baseball REference uses RA9 as its basis for WAR.

Component ERA is what Baseball Prospectus uses for its WAR.

In effect, you've taken the three poles of the three WARs out there and balanced it into one. Which is very close to how my version of Game Score was developed. So, what Bill is doing aligns itself very much to the direction I'm in, which means that any disagreement I may have would be on the periphery.

***

On a related note: Bill is fond of saying that WAR is an estimate, and it's good when several estimates all point to similar answers. This is a good example of that.


1:10 PM Oct 10th
 
bjames
Responding to Tango’s interest in Cliff Lee and Cole Hamels, 2012, I figured the Win Shares and Loss Shares for the Phillies’ pitchers in that season.

First of all, let me note that there are some differences between the two which are perhaps a little bigger than Tom has suggested. Hamels had a very good year with the bat, hitting over .200 with a home run, although we can easily avoid mixing up THAT value with pitching value.

Thornier than that is the issue of the pitcher’s responsibility for what happens when he is on the mound. Two teammates can pitch the same number of innings with the same number of runs scored, and yet not have equal impact on the team, because of strikeouts, walks, and home runs allowed.

This has to be true, when you think about it in the extreme cases. Suppose that there was a pitcher who pitched 200 innings, but never struck anyone out, never walked anyone, and never allowed a home run—a ridiculous extreme, of course, but then there is Sherry Smith. In 1926 he pitched 188 innings with only 25 strikeouts, 36 walks, and 8 home runs allowed. He pretty much just put the ball in play and took his chances. The DEFENSE, in that case, is completely responsible for the outcome of those innings.

Suppose, on the other hand, there was a pitcher who pitched 200 innings, struck out 600 batters, but never allowed a hit other than a home run, but who nonetheless allowed (let’s say) 80 runs because he walked 150 batters and gave up 30 home runs. The PITCHER, in that case, would be completely responsible for the outcome of those innings.

Each pitcher might give up 80 runs, but you certainly could not say that what they had done was the same, or that their value was the same. The Sherry Smith-type pitcher might give up 60 runs with one team, which was strong defensively, but 100 runs if he was pitching for a bad defensive team. The Nolan Ryan-type pitcher is going to give up a certain number of runs no matter what, and you can have an entire team of DHs supporting him. That type of pitcher has to be held much more responsible for the innings that he pitches than does the other.

Of course no pitchers are at the extremes of this scale—but this does not mean that the differences between them as to the extent to which they should be assigned responsibility are not relevant. Hamels walked almost twice as many hitters as Lee and also struck out a few more, hit a few more with pitches, while Lee gave up two more home runs. It’s not a BIG deal, but. . .they’re not identical on this scale, either. We hold Hamels responsible for 26.96 Win Shares and Loss Shares, or one every 7.99 innings pitched, while Lee is responsible for 25.68 Win Shares and Loss Shares, or one every 8.22 innings pitched. There are other things that go into that assignment process, like Saves and Holds and the number of decisions that the pitcher has.

Anyway, this is the way the system works. We assign each pitcher a “Run Average”, which is actually composed of four different Run Averages:

A “natural” run average,

A “three true outcomes” run average,

A “component ERA” run average, and

A “wins and losses” run average.

Wins and Losses for starting pitchers. For relievers, it also
considers Saves, Blown Saves, Holds, Inherited Runners, and Games Finished/No Save, but for starters, it’s just based on the won-lost record.

The “natural” run average—that is, the pitcher’s actual ERA, but with un-earned runs also weighted at one-half each—counts for 40% of the Run Average.

The “three true outcomes” run average is 30% of the total, and the “component ERA” type run average is 20% of the total. The won-lost record (stated as a run average) is 10% of the total.

Their “natural” run averages are 3.20 (Hamels) and 3.26 (Lee)—almost the same.

Their “Three True Outcomes” run averages are 3.64 for Hamels and 3.47 for Lee, so Lee gets a little bit of an advantage there.

Their “Component ERA” run averages are 3.54 for Hamels and 3.67 for Lee.

But the “run equivalent averages” based on their won-lost records are 2.63 for Hamels, since he was 17-6, and 5.05 for Lee, since he was 6-9.

Combining these in a 40-30-20-10 weighting, then, Hamels is assigned a “Functional Run Average” of 3.34, and Lee an average of 3.59. This results in Hamels being assigned Wins and Losses (Win Shares and Loss Shares, purely as a pitcher) of 18 and 9, while Lee is 16 and 10. Their winning percentages are .673 and .618.

When we add in their hitting performance, Hamels goes to 20-11 (actually 10.53 losses), and Lee goes to 17 and 12 (actually 12.498 losses). This results in Hamels being credited with a BIS WAR of 3.3, and Lee of 2.3.

These numbers are much lower than the WAR credited to these pitchers in other sources, but they are consistent with my understanding of their actual value. The Phillies were 81-81 in 2012; they’re an interesting test case because both their pitching and their hitting were almost exactly average. If we assume that the replacement level is .333 (or .332), that makes the Phillies 27 Wins Above Replacement, as a team. Their pitchers account for 11 of that, or about 40% of the value on the team, a little more than 40%, and Lee and Hamels account for a little more than half of the value of their pitchers.




5:30 AM Oct 10th
 
MarisFan61
P.S. about "pitching to the scoreboard" (and more generally, as Frank talked about, playing to the scoreboard): On behalf of 'the other side,' i.e. the idea that it doesn't much exist, we need this clarification:
When sabermetrics says it doesn't exist, the intended meaning isn't really that it doesn't exist, but that there is very little differential among pitchers in their ability to do it. So, feeling that we know it 'exists' (which seems like absolute common sense) and citing examples of it don't at all address the traditional sabermetric claim. Of course it exists; I don't think anyone in any camp would deny it. The question is whether different players (in this case, pitchers) differ enough on it for us to think it's any issue in evaluating or comparing major league players.​
11:31 AM Oct 8th
 
tangotiger
One last test case. This time I looked for a gap in ERA of .25 to .50, but with a win% .250 lower. Only three such pairs, with the most recent:

www.baseball-reference.com/teams/NYY/2001-pitching.shtml

Clemens 20-3, 3.51, and Mussina 17-11, 3.15. In that year, Clemens won the Cy Young, and Mussina got either 1 or 2 votes in all. Mussina also led the AL in FIP.

Their stat lines are extremely close, but in almost every case, Mussina has the edge. Mussina has 8 more innings with 7 fewer runs allowed (6 fewer earned). Mussina has 3 fewer hits allowed (though 1 more HR). Mussina has 1 more K, and 30 fewer walks and 1 fewer hit batter. He even had 8 fewer WP. He made 1 more start, completed 4 (Clemens 0), and shutout 3 (Clemens 0).

So, I'd like to see the results here. Does Win Shares - Loss Shares still keep Mussina with a slight edge? Or does Clemens big lead in W-L let him overtake Mussina?

Looking forward to seeing the results!


11:12 AM Oct 8th
 
tangotiger
I looked for all pairs of teammates with at least 170 IP, where the gap in win% was at least .300 and the gap in ERA was at most 0.25. So, more guys like HAmels/Lee. Since 1982 there were 8 such pairs, one of them this year.

www.baseball-reference.com/teams/KCR/2016-pitching.shtml

Duffy was 12-3, 3.51, while Ian Kenndy was 11-11, 3.68. There was some relief games for Duffy, so that might mess us up.

Before that it was Lee/Hamels, which I think is a model test case. Before that:
www.baseball-reference.com/teams/CIN/2008-pitching.shtml

Arroyo was 15-11, 4.77, while Harang was 6-17, 4.78. I think those two matchup remarkably well. So, I'd like to see what we're talking about with these two guys.








10:42 AM Oct 8th
 
tangotiger
This is another good test case:
www.baseball-reference.com/teams/PHI/2012-pitching.shtml

That's Cliff Lee 6-9 v Cole Hamels 17-6, with as an identical a stat line as you'll find for any pair of teammates. That's a difference of .339 wins per decision. Is it possible that Cliff Lee happened to pitch in much easier run environments than Cole Hamels? They must have pitched in the same series several times, so same opponents, similar kind of weather, etc. We can look at their bullpens and see how they did on those days.

In any case, if that .339 wins per decision gap is going to be treated as a real .034 wins per decision gap (i.e., about one-tenth of the observation is real), I think that's the limit. That is, if you want to bump up based on their ERA Hamels by +0.8 wins and bring down Cliff Lee by -0.5 wins, that's I think the absolute limit. That's 1.3 wins, or 4 win share gap. And I think a more reasonable figure is half that.

So, I'd like to see the Win Shares for the 2012 Phillies, so we can see exactly what we're talking about here. If the point is that instead of giving both of them 23 win shares, and that instead you give Hamels 24 and Lee 22, then, fine, there's not really much to argue there. We used the W/L is a very small manner, but we get to say we used it.


10:07 AM Oct 8th
 
ksclacktc
I remember reading in the 1982 BJ Abstract Starting Pitcher Rankings; that the system was something like 70/30 ERA/WL record. Bill used the pitchers run support and the record of his team in those games to form a pythagorean %. The problem at the time was coming up with data. It is now easily accesible on B-R.
9:31 AM Oct 8th
 
FrankD
More .... playing to the score is prevalent in NFL, Soccer, NHL and even NBA, of course the clock drives this. But in the old days see "Pitching in a Pinch" by Mathewson - classic pitching to the scoreboard: don't use yer best stuff till the game is on the line. But, even this was allowing a hit or two (HRs not being common till HR Baker hit 'em off Matty) but then bearing down to stop a run from scoring ........
11:50 PM Oct 7th
 
FrankD
Absence of evidence is not evidence of absence. Pitchers could pitch to the score. This seems evident to average experience. Work hard when have too, slough of when its ok. But you can't screw off too long, you'll get caught (lose the game) ...... But I don't think its very large effect: top athletes hardly ever let anybody 'win' for any length of time .........
11:39 PM Oct 7th
 
MarisFan61
I think the most important thing in this article, bigger even than the very important things about Win-Losses, is the loud recognition of how things that seem to have been demonstrated or ruled out by analysis haven't really been. I hope this 'sub-aspect' of the article won't be submerged.

It so happens that the two things that Bill mentions as side-examples of that -- clutch hitting and pitching to the scoreboard -- are two of the things that I've mentioned here a few times as prime examples of things that sabermetrics hasn't debunked to the extent that it seems to think it has, particularly the latter (i.e. "pitching to the scoreboard"), which I think is in the top 3 of such things. The other 2: importance of batting order, and importance of the hitter who hits behind whomever.

Another thing even more important that the very important Wins-Losses thing: Bill's willingness and eagerness to keep questioning and refining things that might have seemed to be settled.
10:39 PM Oct 7th
 
DanaKing
I remember when I first read Bill writing about the relative values of W-L records in the commercially published Abstracts in the 80s. I also remember a phrase--I wish I could remember the exact year and wording--where he said something to the effect of, "If you study something and can find no relationship, that either means 1.) there is no relationship, or 2.) the relationship is a hell of a lot more complicated than you thought.

Seems to be this is exactly the kind of situation he was referring to.

I also appreciate how his methods--and his descriptions of him--peel back the curtain of how science works. The things scientists "discover" have been sitting there in front of them since the dawn of time. The trick was for men to devise the means to interpret them.

Thanks.
2:21 PM Oct 7th
 
tangotiger
By the way, I agree with your point that the year-to-year test isn't actually that telling. This is because even if there is a REAL thing, like for example HR/PA, if the number of trials in each sample is low, you won't get a high correlation anyway. The best way to get a low correlation: make sure the number of trials in your sample is low. The best way to get a high correlation: make sure the number of trials in your sample is high.

Compare W/L records year to year? Low correlation. Compare W/L records for a career, lumping all W/L records of even years in one bucket and all W/L records of odds years in another bucket? High correlation.

Compare OBP month to month? Low correlation. Compare OBP year to year? High correlation.
1:53 PM Oct 7th
 
tangotiger
First, I agree with the 2D representation using W/L. I do it for baseball, and I've done it for hockey and I've dabbled it for basketball. It's clear, it's concise, it keeps the system "in check" because of a verifiable point: the sum of the individuals should add up to the whole, the team W/L record.

***

I think it's fair enough to say that if Drew Hutchison was on the mound when the Jays scored 7 (!) runs per start in 2015, while his mate RA Dickey was on the mound 4 or 4.5 runs per start, that we shouldn't assume that they should both have received the Jays average of 5.6 (!) that year. Maybe the conditions Hutch pitched in was more like 5.8 and Dickey was more like 5.4 or something.

www.baseball-reference.com/teams/TOR/2015-pitching.shtml#players_starter_pitching::none

I'm totally on board with that possibility. But, if we totally ignore run support, this is akin to totally ignoring the W/L record. Information is information. And Hutch got 2.5 to 3 more runs than Dickey. Maybe it should be 2 to 2.5 because Hutch played in tougher conditions. That should still knock out some .200 to .250 win% from Hutch's record.

Hutch had a 13-5 record with a 5.57 ERA, while Dickey was 11-11, 3.91. If the net effect is to suggest that Dickey's 11-11, 3.91 can be represented as 11-11, and that Hutch's 13-5, 5.57 can be represented as 9-9, then I think that is still too much deference paid to W/L records, and still not enough to run support.

Therefore, I would like to see what kind of impact the use of W/L records are having. I can accept "some" and "small", but I'd like to see its impact specifically on the Jays pitchers in 2015.

1:00 PM Oct 7th
 
 
©2019 Be Jolly, Inc. All Rights Reserved.|Web site design and development by Americaneagle.com|Terms & Conditions|Privacy Policy