More Data I Must Have More Data

February 14, 2019
 

More Data

I Must Have More Data

 

              OK, my son was home over the Christmas season, and while he was at home he updated some files that he created for me eight or ten years ago.   The files analyze game data from Retrosheet (retrosheet.org) and create game logs for starting pitchers.  The last time these files were updated—2014—I had game logs for the years 1952 to 2014, not every game but most every game.   The updated system gives me logs for the years 1921 to 2018, and also more data about each game—how many doubles and triples the pitcher allowed in each game, for example, and how many stolen bases he allowed and caught stealing.   I had about 240,000 pitcher starts in the log before, maybe 250,000, I don’t remember, but anyway now I have 329,988.   I spent about a month converting his files into the files I need, and since then I have been studying things, mostly based on Game Scores.  

              I have 82 pages of articles, essays, comments, explanations, etc. that I have written based on this data, with the intention of sharing that with you.   I don’t want to dump 82 pages on you today, because I don’t know that anybody would read it.   I’ll start publishing it at a rate of seven pages a day, five days a week; should last until the end of the month, more or less.    

              I know that when I do a long series of articles people kind of stop reading them after a while, but I do hope that you’ll come back and read the last article of the series, "A Conclusion in Regard to Baseball Reference WAR", which should run on February 28, I think.  Thanks. 

 

Coverage

              I have 79% coverage for 1921, 79% for 1922, 81% for 1923, 82% for 1924, 81% from 1925, 83% from 1926, 85% from 1927, etc.   All of this is courtesy of Retrosheet.org, which is the third-greatest American Institution, after the Smithsonian and the Kansas University basketball team. 

              Anyway, the percentage drops as low as 69% in 1944, and goes to basically 100% in 1958.   After 1958 there’s an occasional game missing.     We can’t accurately evaluate all pitchers from 1922 or 1935 or something, because we are missing some games for some pitchers.

 

Average Game Scores by Pitcher

              We have a Game Score for each pitcher in each game.   The simplest way to proceed toward a performance summary for each pitcher in each season is simply to look at the average Game Scores.   I’ve probably done this for you before, with less data, but the highest Average Game Score for any pitcher in the data is 76.1, for Bob Gibson in 1968:

Year

First

Last

St#

Year Avg

1968

Bob

Gibson

34

76.1

1968

Luis

Tiant

32

72.4

1965

Sandy

Koufax

41

71.8

1971

Tom

Seaver

35

70.9

1997

Pedro

Martinez

31

70.8

 

              These are the highest averages for pitchers with 30 or more starts in a season, within my data.  Let’s do the top 10:

Year

First

Last

St#

Year Avg

1968

Bob

Gibson

34

76.1

1968

Luis

Tiant

32

72.4

1965

Sandy

Koufax

41

71.8

1971

Tom

Seaver

35

70.9

1997

Pedro

Martinez

31

70.8

1985

Dwight

Gooden

35

70.4

1971

Vida

Blue

39

69.9

1963

Sandy

Koufax

40

69.8

1946

Hal

Newhouser

30

69.5

1972

Steve

Carlton

41

69.3

 

              Oh, hell, let’s do the top 20:

Year

First

Last

St#

Year Avg

1968

Bob

Gibson

34

76.1

1968

Luis

Tiant

32

72.4

1965

Sandy

Koufax

41

71.8

1971

Tom

Seaver

35

70.9

1997

Pedro

Martinez

31

70.8

1985

Dwight

Gooden

35

70.4

1971

Vida

Blue

39

69.9

1963

Sandy

Koufax

40

69.8

1946

Hal

Newhouser

30

69.5

1972

Steve

Carlton

41

69.3

1924

Dazzy

Vance

32

69.3

1966

Sandy

Koufax

41

69.1

1978

Ron

Guidry

35

68.9

1946

Bob

Feller

32

68.8

1968

Denny

McLain

41

68.8

1969

Bob

Gibson

35

68.3

1972

Gaylord

Perry

40

68.2

1999

Randy

Johnson

35

68.1

1968

Dave

McNally

35

68.1

1966

Juan

Marichal

36

68.0

 

              You’re probably wondering why I do things like that, aren’t you?   It’s because I want you to actually read the list.   If I just give you a list of 20 pitchers, you’ll just look at the top five.   But if I give you the top five, let you look at them, then give you five more, then you’ll look at the next five.   I’m relying on my understanding of how you process information to try to get you to take in a little more information.

              Of course, this is a very simple approach to the problem, and there are 50 different things "wrong" with it.   There are 50 different reasons why this is not a reliable list of the best pitcher/seasons in my data.   We’re going to attack that list of problems one by one, making the list more reliable and more reliable by making adjustments for biases in the data.

              But let me point out before I do that:  This is not a bad list.  Gibson in 1968, Koufax in 1963, 1965, and 1966, Ron Guidry in 1978, Vida Blue in 1971, Doc Gooden in 1985, Pedro Martinez in 1997, Steve Carlton in 1972. . . these are the greatest pitching seasons in baseball history, or among them.   We’re going to make the list better, but we’re not starting from zero.   The process, even at this naïve level, will find the Cy Young Award winner most of the time or half of the time or something.   It’s pretty good. 

 

Margins Above 50

 

              Of course, to average a Game Score of 56 in 40 starts is different from a Game Score average of 56 in 30 starts.    It has a different impact on the won-lost record of the team.  

              We can adjust for this by measuring instead the pitcher’s margin above or below average, assuming the average to be a Game Score of 50.  If a pitcher has a Game Score of 86, we record that as +36, meaning that, in that game, he is 36 points above average.  His value for the season is his total above average.   That moves Sandy Koufax, 1965, ahead of Gibson and Tiant, into the number one spot: 

Year

First

Last

Margin

1965

Sandy

Koufax

892

1968

Bob

Gibson

886

1963

Sandy

Koufax

793

1972

Steve

Carlton

792

1966

Sandy

Koufax

783

1971

Vida

Blue

775

1968

Denny

McLain

771

1971

Tom

Seaver

732

1972

Gaylord

Perry

728

1968

Luis

Tiant

717

 

              Same pitchers, just a little different order.    By the way, the WORST pitcher with 30 or more starts was Jose Lima in 2005.   Lima was 5-16 with a 6.99 ERA.   His average Game Score was 38.2, and he was 377 points below average (below 50) for the season.    The -377 isn’t the worst season the data; it’s the second-worst.   Claude Willoughby in 1930 made 24 starts for the 1930 Philadelphia Phillies, finishing 4-17 with a 7.59 ERA.   We only have 18 of those 24 starts in our data, but his total for those 18 starts was -406.

 

Margins Above a Truer Average

This method (above) assumes that the average Game Score is 50.   Of course, the actual average Game Score is (a) higher in some seasons than in others, and (b) influenced by the park.   To move forward from this point, we need to remove those biases from the data.  Our list is dominated by 1963-1972 pitchers because those were pitching-dominated years. 

To remove those biases, we begin by figuring the average game score at home and on the road for every team in the data.   For example, the 1923 Philadelphia Phillies pitchers had an average Game Score, in their home games within my data, of 33.6.    They had the worst pitching staff in the league, park-adjusted, and they also had a park factor of 144.   The combination made the average Game Scores of their starting pitchers, for the season, 39.0—33.6 at home, and 44.4 on the road.   The 1968 Cleveland Indians had an average starting pitcher Game Score of 61.6, the highest in the data—61.5 at home, and 61.8 on the road.   Their starting rotation was Luis Tiant (21-9, 1.60 ERA), Sam McDowell (15-14, 1.81), Sonny Siebert (12-10, 2.97) and Stan Williams (13-11, 2.50), and Steve Hargan (8-15, 4.15). 

I figured the Average Game Score for every team’s pitchers, home and road, and also and equally important, the Average Game Score for every team’s opposing pitchers, home and road.   Based on that data, we can calculate a Park Effect for each park in each season.   These park effects work backwards from traditional park effects; that is, a hitter’s park leads to low Game Scores for the pitchers, thus a park effect below 100, whereas a pitcher’s park leads to high Game Scores, thus a park effect above 100.   Also, of course, park effects derived from Game Scores are not a perfect linear match with Park Effects derived from runs scored. 

With the park effects and the team averages, we can calculate the "actual" or "true" average for every game—that is, the expected Game Score for the start, based on the team and the park.   This also adjusts for year-to-year differences, of course, since each team is within a season.   The highest expected Game Score for any game(s) in the data is 63.48, which is the Expected Game Score against the 1965 New York Mets for their games played in Dodger Stadium.   The Mets had a terrible offense, scoring just 495 runs that season despite playing in a hitter’s park.  Dodger Stadium had a Park Effect of 76.  The combination would expect to yield very high average game scores.   The second-highest average for any situation, 63.22, would be the 1965 Mets playing in the Astrodome.

On the other end of the spectrum, the lowest expected Game Score would be for the pitcher facing the 1936 New York Yankees in Sportsman’s Park in St. Louis.  The 1936 Yankees had a famously intimidating offense, which scored 1,065 runs, and Sportsman’s Park was the best hitter’s park in the league, with a Park Effect of 120 (the park run effect.  The park effect on Game Scores was .882.)   The combination yields an expected Game Score, facing the 1936 Yankees in Sportsman’s park, of 35.22.

OK, the expected Game Score can go as high as 63.48 or as low as 35,22, but that’s very unusual.  Of the 329,988 games in my data, the expected park effect was higher than 60 for only 1,134, and lower than 40 for only 454.  More than 99.5% of the time, the expected Game score was between 40 and 60.  For 84% of games, the expected Game Score is between 45 and 55.  It hangs around 50. 

 

Applying this to Pitchers

The New York Mets played a double header against the Los Angeles Dodgers in Dodger Stadium on June 20, 1965, facing Sandy Koufax in the first game and Don Drysdale in the second.   Koufax pitched a 1-hitter with 12 strikeouts.  The one hit was a home run, but he won the game, 2-1.  In the second game Drysdale pitched a complete game as well, but gave up 9 hits and 3 runs, all earned, and the Mets beat him, 3-2.  

Koufax had a Game Score of 91, which would be +41 if compared to 50, but when you adjust for the fact that he is facing the 1965 Mets in Dodger Stadium, it isn’t +41; it is merely +27.52.   Drysdale had a Game Score of 56, which would be a good game in a neutral situation--+6—but is actually a poor game under the circumstances, now scoring at negative 7.48.  

On the other hand, Ivy Andrews of the Browns faced Joe DiMaggio, Lou Gehrig and company at Sportsman’s Park in St. Louis on July 22, 1936.   He pitched a complete game but gave up 10 hits and 5 walks, had no strikeouts but managed to beat the Yankees 6-5.   Giving up 10 hits and 5 runs in a game, no strikeouts, would not ordinarily be considered a strong performance, and yields a Game Score of only 42.   Compared to a neutral average he would be -8, but considering that he was facing one of the greatest offenses of all time in a bandbox ballpark, it’s actually pretty good.   Compared to expectations, it is +6.78. 

We will treat Drysdale’s contribution to the team, then, at -7.48—he did lose the game, after all—and Ivy Andrews at +6.78.  

OK, now we can recalculate each pitcher’s contributions to the success of his team, game by game.  The #1 season in our data is no longer Koufax or Gibson; it is now Pedro Martinez in the year 2000.  Pedro made just 29 starts; he was 18-6 with a 1.74 ERA, struck out 284 batters in 217 innings, and walked only 32.   Compared to expectations game by game, he was +796 points.   These are the top three seasons in my data:

Year

First

Last

Margin

2000

Pedro

Martinez

795.7

1999

Randy

Johnson

699.9

1999

Pedro

Martinez

681.4

 

              And these are the top six seasons in my data:

2000

Pedro

Martinez

795.7

1999

Randy

Johnson

699.9

1999

Pedro

Martinez

681.4

1965

Sandy

Koufax

679.7

1997

Roger

Clemens

659.9

1997

Pedro

Martinez

648.4

 

              Sandy Koufax in 1965 still does very well; his is still the fourth-best season in the data, among 16,000+  pitcher/seasons.  But the most dominant pitcher is no longer Koufax or Gibson, from the 1960s; it is now Pedro Martinez.  Martinez had amazing numbers, pitching in Fenway Park in seasons in which the league ERA was close to 5.00.  

              These are the top 51 seasons in my data, arranged chronologically by pitcher:

Year

First

Last

Margin

1924

Dazzy

Vance

646

1928

Dazzy

Vance

495

1931

Lefty

Grove

511

1937

Lefty

Gomez

529

1939

Bob

Feller

553

1940

Bob

Feller

583

1939

Bucky

Walters

500

1953

Robin

Roberts

496

1963

Sandy

Koufax

577

1965

Sandy

Koufax

680

1966

Sandy

Koufax

613

1965

Juan

Marichal

508

1966

Juan

Marichal

505

1968

Bob

Gibson

600

1969

Bob

Gibson

505

1968

Luis

Tiant

511

1971

Vida

Blue

606

1971

Tom

Seaver

584

1973

Tom

Seaver

551

1972

Steve

Carlton

614

1980

Steve

Carlton

562

1972

Gaylord

Perry

534

1974

Gaylord

Perry

518

1973

Nolan

Ryan

569

1977

Nolan

Ryan

527

1978

Ron

Guidry

585

1985

Dwight

Gooden

612

1986

Mike

Scott

594

1986

Roger

Clemens

516

1997

Roger

Clemens

660

1998

Roger

Clemens

518

1993

Randy

Johnson

497

1995

Randy

Johnson

582

1997

Randy

Johnson

552

1999

Randy

Johnson

700

2000

Randy

Johnson

594

2001

Randy

Johnson

647

2002

Randy

Johnson

616

2004

Randy

Johnson

568

1995

Greg

Maddux

537

1997

Pedro

Martinez

648

1999

Pedro

Martinez

681

2000

Pedro

Martinez

796

2001

Curt

Schilling

497

2002

Curt

Schilling

516

2004

Johan

Santana

517

2006

Johan

Santana

503

2009

Zack

Greinke

497

2011

Justin

Verlander

521

2015

Clayton

Kershaw

530

2017

Corey

Kluber

514

 

              I made it 51 so that I could sneak a second Dazzy Vance season onto the list.  He makes the list twice although we are missing 2 starts for him in 1924 and one in 1928, so I figured he should catch a break. 

              Now we have an opposite problem.   Whereas before the bias of the list favored a pitcher working in a pitcher’s park in a pitcher’s era, it now favors a pitcher working in a hitter’s park in a hitter’s era.   Why?  Because the average Game Scores are lower, which creates more "space" for the superior pitcher to work in, a larger canvas for him to paint on.   When you make the scores larger, you make the differences between pitchers larger. 

              That’s a smaller problem than the other one that we had.   This is LESS of a bias than the other bias, but it is still some bias.   We’re making progress.   We’re working on it. 

              The worst pitcher/season in the data?   Still Jose Lima in 2005, at -330.8.  

 

 
 
 

COMMENTS (1 Comment)

W.T.Mons10
" I know that when I do a long series of articles people kind of stop reading them after a while..."

The only reason I stop reading them is you stop publishing them. I'm still hoping you'll get back to the Best Players Each Year series.
8:42 PM Feb 21st
 
 
©2019 Be Jolly, Inc. All Rights Reserved.|Web site design and development by Americaneagle.com|Terms & Conditions|Privacy Policy