Flawed Logic

April 12, 2017


Flawed Logic

              Do you ever have situations in which you know there must be a hole in your logic somewhere, but you can’t figure out where it is?     Here’s one of mine.

              Let us suppose that there is a citizen of ancient Rome who had let us say six children who survived into adulthood, so that the issue of his having at least two generations of descendants is not an issue.   Let’s call this citizen Barry.  Then it would seem to me that it must be true, in that case, that every human being on the planet now would be a direct descendant of that person, Barry, including those in Africa and China, although perhaps not someone in South America or Hawaii or Borneo or someplace, but that’s a separate issue and I don’t want to get distracted by that.  

              I sort of KNOW that there is something wrong with my logic here, but here it is anyway.   Let us say that the population of the earth in the year 100 BC was 300 million, which is a high-end estimate; most estimates are a little lower than that.   But assuming that there are 300,000,000 people on the earth and that only six of them are descendants of Barry in the first generation, that would mean that in the first generation .999 999 98 of the people on earth are NOT direct descendants of Barry.  

              But, unless two descendants of Barry marry (or produce children without marrying, gasp). . .unless two descendants of Barry marry, then in the SECOND generation this number would go to .999 999 98 squared, assuming that the descendants of Barry produce an average number of next-generation descendants.   .999 999 98 squared is .999 999 96.    In the third generation this number would go to .999 999 92.  

              Note that we are not assuming here that people are banned from mating with their siblings and cousins; we are merely assuming that it is statistically improbable.   If we assume that people are actually BANNED from mating with their siblings, then the number goes down slightly, although not by enough that it makes any difference.   It would make a difference later down the chain, but then, it’s a false assumption later down the chain, so there doesn’t seem to be any reason to worry about that.

              So in the third generation .999 999 92 of the world population is NOT descended from Barry, in the fourth generation .999 999 84, and in the fifth generation .999 999 68.    It takes about 20 generations for this number to move appreciably.    It actually takes 20 generations, using these assumptions, to reach the point at which 1% of the world’s population is descended from Barry.   But after 20 generations, things change very rapidly.   The percentage of the world which IS descended from Barry goes from 1% to 2% in the next generation, the 21st generation.    It vaults past 10%--actually well past 10%--in the 24th generation.   It reaches 50%, basically, in the 26th generation.    By the 29th generation only one-half of one percent of the world’s population is NOT descended from Barry.   In the 30th generation, a child is NOT descended from Barry only if someone in that one-half of one percent of the world’s population mates with someone ELSE in that same tiny sliver of the population.   Basically, by the 30th generation, the entire population of the world is descended from Barry.   In the 31st generation, it is statistically improbable that that there would be a single person on earth who is NOT descended from Barry.

              Of course, due to racial sub-groups and isolation, after a few generations the matches are not random vis a vis Barry’s line yes/no.    Let us suppose there are two primogenitors, Barry and Umfumu, and that Umfumu was in Nigeria; my apologies if Umfumu is not a Nigerian name.   After the fourth generation those who ARE descended from Barry—let us call them Romans—are much more likely to mate with other descendants of Barry than with descendants of Umfumu.  

              But this doesn’t seem to make any difference in the key issue of whether, by the current time, every citizen of the world is descended from Barry (and from Umfumu).     It only takes 30 generations for Barry’s DNA to be included in the DNA of every person.   We have. . .well, probably about 84 generations to work with.     So if we assume that after the 30th generation there is ONE descendant of Barry who enters the breeding population of the descendants of Umfumu—just one--then in another 30 generations Barry’s line will have invaded the population of descendants of Umfumu, as well.    This doesn’t have to happen immediately; you actually have hundreds of years in there for one descendant of Barry to cross over.

              And, in fact, there have always been some small number of people who did move across barriers.    The Romans liked to capture lions and tigers and elephants and hippopotamus and such and display them in Rome, and there were Africans who came to Rome with the animals as caretakers or ringmasters.   There were Africans who were raised in Rome and became prominent citizens of Rome.   There were Romans who were delegated to relatively remote parts of Africa ("remote" from the Romans’ point of view.)   It just takes one in every several hundred years.

              Also, geographical isolation is more gradual than definitive, although there are some definitive barriers.   Town A is 25 miles from Town B, Town B is 25 miles from Town C, Town AY is 25 miles from town AZ, but Town A is a thousand miles from Town AZ.   When you have 84 generations to work with, DNA can GRADUALY work its way around the globe, moving just a few miles in each generation.

              Perhaps, if one area has an unusual degree of geographic separation from the rest of the globe, such as a Samoan island or a tribe in the Amazon jungle, then that tribe might be an exception to the rule.   But generally, it seems to me (logically) that it must be true that IF Cicero had living descendants, then you and I both HAVE to be descendants of Cicero.  

              But intuitively, it seems to me that there must be some flaw in this logic that I just have never been able to spot.   So. . .whaddaya think?



Problem 2

              OK, here is a vaguely similar problem which has to do with 1955 Boston Red Sox.    The 1955 Red Sox scored 470 runs at home, 285 on the road, while allowing 395 runs at home, 257 on the road.   They played 78 home games, 76 on the road, but still, it creates a Park Factor for the season of 156, which is remarkable even for Fenway, where the Park Run Index was 108 in 1954 and 108 in 1956.  

              Because of this extreme Park Run Index for 1955, if you ask "Who had the greatest pitcher’s season of the 1950s" and you rely on the Park Run Index, you will reach the conclusion that the greatest pitcher’s season of the 1950s was by Frank Sullivan of the Red Sox in 1955.   Maybe you won’t reach this conclusion if you use strikeouts and walks rather than runs allowed, but this is a dodge.    Jackie Jensen, who drove in 116 runs and had a .369 on base percentage, will be "normalized" to mediocrity because he is creating runs in an environment where runs are believed to be very abundant, more abundant than they actually were.  I have certainly, while using otherwise reasonable methods, concluded that Frank Sullivan in 1955 had the greatest pitcher’s season of the 1950s, and I am not the only person who has done this; other analysts have found themselves stuck with the same conclusion.

              Intuitively, we all know that this is not true; Frank Sullivan in 1955 was 18-13 with a 2.91 ERA in 260 innings, which is a very good season, but not better than Robin Roberts or Bobby Shantz in 1952, and probably not better than your average Warren Spahn season.   It merely LOOKS like an incredible season if you combine the 2.91 ERA and the 156 Park Run Index.  

              And, intuitively, we all know what the problem is.    A team CAN play a double-header in which one game is 1-0 and the other one is 16-13, and it isn’t necessarily the pitchers or the sun or anything; it just happens.   It CAN happen that all of your slugfests in a season (or 15 out of the 20) happen in your home park, while 15 out of your 20 pitcher’s duels are on the road, and this can happen without regard to the park effect.   It doesn’t happen OFTEN; it’s a one-in-a-thousand type fluke—but it happened here.  

              You can deal with this problem, as a statistical analyst, by using multi-season park effects; you can do that and I have, but there are logical problems (and practical problems) with that approach, as well.   What I am really asking is, when there is a fluke of this nature, how do we recognize that it is a fluke, without relying on external data such as data from other seasons?

              We are working on Win Shares and Loss Shares now, and we’re back to this problem.   Basically, I would rather use one-year Park Factors than three-year Park Factors, because I think generally this causes fewer problems and leads to more accurate measurements.   Also, while I will use strikeouts and walks and home runs allowed, I will rely heavily on runs allowed to evaluate pitchers, because it is more accurate to do that than to trust the thrie troo outkummes.  

              So I’m back to the problem:   How do I (1) use single-season Park Factors, and (2) use pitcher’s runs allowed, without (3) getting a stupid number for Frank Sullivan (and Jackie Jensen) in 1955? 

              In a way, it is parallel to the United Airlines problem. . .I realize that none of you will understand what the hell I am talking about, but I have been thinking about writing this article for several years, and decided to do so because the United Airlines problem reminded me of it.     All of the individual policies which led to the United Airlines Public Relations nightmare might be perfectly defensible.   You COULD have policies like "take care of any and all overbooking issues BEFORE you put people on the plane, not after" or "pay whatever you have to pay to get customers to accept being taken off a plane"; you COULD have those policies, but probably the policies they did have are individually defensible.   The only thing is that if you have that series of policies which CAN result in this kind of an outcome, then you need to have a fail-safe policy which says "No matter what happens, you don’t drag a 70-man down the aisle of the airplane with blood running down his face."   

              The same here.   You COULD have a policy of using three-year or five-year park factors; that would be OK.   You COULD have a policy of using strikeouts, walks and homers allowed instead of runs allowed; that would be OK, although I’m not sure it would actually help in this case.   Those options would be OK, but the other options would be OK, too.  

              The only thing is that if we’re going to use single-season Park Factors and runs allowed rates, then we need to have some sort of fail-safe policy which says "No matter what happens, you can’t publish ratings that you know are wrong.    Frank Sullivan, 1955, was NOT the best pitcher/season of the 1950s, so you can’t publish ratings that show that he was."

              So what I am really asking is, "What is the fail-safe policy that protects us in a case like this?"



The Roomba and the Corner

              Our house has a lot of dust.    It’s a great old house, but it is twenty years older than Fenway Park, so that’s an issue, but also, neither my wife nor I is inclined to grab a dust rag.   I have allergies which the dust dus not help, and also, sometimes I am embarrassed by how much dust we have in the house.   A couple of years ago, I told Susie that what I wanted for Christmas was one of them machines that pulls dust out of the air.   We are reasonably well off and don’t need more stuff; we need less dust.  

              Well, you know how that goes; what you get for Christmas is what your wife thinks that you need, so she said "What about a Roomba instead?"   A Roomba is a circular vacuum cleaner that moves randomly around the room for a while.    It is not cheap, so we decided to get a Roomba for Christmas; this is what we got each other, half a Roomba each.   

              Well, the Roomba helps SOME with the dust in the house, although actually not very much; we still have dust covering everything in the house that isn’t moved every couple of weeks, but maybe it is 20% better than it used to be.   The Roomba doesn’t do a great job, honestly; it will run over a dust bunny in the carpet five times and just push it further down into the carpet on each pass.

              We’re not really unhappy with the Roomba.   It’s not like we don’t use the Roomba; we use it every week, and sometimes we’ll use it every day for a couple of weeks.    It’s not like it doesn’t pick up dirt and dust; it just doesn’t pick up ALL of the dirt and dust.   Maybe it gets 60% of what it should get, so you run it a second time, and then you’ve got 84%, so you run it a third time, and then you’ve got 94%, so you run it a fourth time, and then you’ve got 97-98%.    Having the Roomba do a room four times is a lot easier than vacuuming it yourself once, so it keeps the house a little bit cleaner, maybe.   Also, the Roomba goes automatically under beds, chairs, sofas, cedar chests, desks, chests of drawers, bathroom cabinets, etc.    In our house dust would pile up under some of those places for months, others for years, so there would be a lot of dust under there, contributing to the house’s general dust level.   Roomba gets in there and cleans that dust out, so that makes a difference.       

              Anyway, one of the Roomba’s problems is that it loves corners.   It doesn’t actually GET to the dust in a corner; it’s round, so it can’t get into the corner of the corner.   I mean like a corner of the room.   If you have like a 2-foot by 3-foot area in a room, more or less blocked off from the room by furniture, Roomba will find that corner and get stuck there, cleaning that little corner of the room relentlessly until it shuts off.   Very often, if you pick it up and put it out in the center of the floor, it will immediately drive back into the corner of the room and get stuck there again.

              I don’t quite understand the mathematics of this; I sort of intuitively understand them, but I don’t really understand them.   On one level, it seems that if the Roomba can find its way through a 12-inch opening to get INTO a corner, it should be able to find its way through the 12-inch opening to get OUT of the corner.    It will, once in a while, but mostly not; mostly it just gets into the corner and stays there.    You have to learn to block off the corners before you start the Roomba, which is still easier than vacuuming the room yourself. 

              But it seems to me that this is a remarkably good symbol of what happens in life.   It seems to me that this is one of life’s great lessons:  that it is much easier to get INTO a corner than it is to get OUT of a corner.    I have known lots of people who got stuck in a corner, and just never got out, or got stuck in a corner and stayed there for decades before they could get out.   Drug use is a corner, drug addiction; it is a hell of lot easier to get into this habit than it is to get out.    Alcohol use, tobacco use, sure, but there are lots of corners like that.    A 17-year-old boy gets his 15-year old girlfriend pregnant; he’s in a corner.   They’re both in a corner.    They don’t really like each other; they don’t belong together, and they’re not financially able to provide for a child.   They get married, get divorced in three years.    They’re stuck in a corner.   It was a hell of a lot easier to get into that corner than it is to get out.    You make a couple of bad bets, you owe a bookie $10,000, you borrow $10,000 from a mob guy to pay the bookie, you’ll be paying the mob $100 a week for the rest of your life.   It was a lot easier to get into that corner than it is to get out.   A young guy takes a dead-end job, just trying to make a living; then he buys a car that it a little more expensive than he can really afford, so he can’t afford to quit the job and look for another one.   He’s in a corner.   I have known people who got stuck in a corner like that for years.    It was a hell of a lot easier to get in there than it is to get out.

              Mathematically, there is probably something to do with the relationship between floor space and perimeter space which predicts the difficulty the Roomba has in escaping a corner—but does the same math apply to the human problems, or does some parallel math that we are unable to see apply to those problems?   Just wondering.   Also, I still want one of them machines that pulls the dust out of the air.  


COMMENTS (60 Comments, most recent shown first)

I have to be honest...when I started looking into this, I really thought I was going to explain away Frank Sullivan's 1955 season. But the data didn't show me a way to do that.

In 1963, Koufax had an ERA on the road of 2.31; Sullivan's road ERA in 1955 was 2.35. It's a small sample, only 18 games for Sullivan, but it is very close.

Now, I'm not suggesting that we just ignore home games. Sullivan's home ERA was 3.53 (4.51 including unearned runs, which isn't so good): that home/road split for ERA is surprisingly close to what the park factor for the whole team suggests.

OK, it would be one of the all-time great flukes if Frank Sullivan had the best season of the 1950s. But is it out of line with the other great flukes? Hack Wilson? Davey Johnson? (Best not to mention Brady Anderson).
1:40 PM Apr 25th
Well, if, for the Red Sox in 1955, you use a factor of 153.2, then you get measurements which are not correct, are not reasonable. If you use 153.2 and base your analysis on runs, Frank Sullivan in 1955 is going to be the best pitcher of the 1950s. Frank Sullivan in 1955 is going to be as good as Koufax in 1963, or better. That's what I am trying to avoid.

Your comments are interesting, and there may be something in there useful. The observation that a model in which every park was actually the same but the data was random would created MEASURED park variances where none exist in fact is very interesting. We could use that approach to work on the problem I was trying to get to, I think.
3:20 AM Apr 21st
This week I did the most important baseball research that I have ever done.

The conclusion: when analysing past performance, we should ignore moderate park factors- that's between about 92% and 108%. Just use 100% instead. But as for the extreme park factors, say Coors or Dodger Stadium: they are almost, but not quite, as significant as they look.

A few examples, and then I'll explain.

Red Sox 1955: basic park factor (runs) is 155.5%, should use 153.2%.
Yankees 2013: basic park factor is 108.7%, should use 100.9%.
Detroit 2012: basic park factor is 107.1%, should use 100% exactly.
Washington 2016: basic park factor is 95.6%, should use 100% exactly.

You see the pattern there: if the basic factor is a long way from 100, we take just a small step towards 100; but if the basic factor is close to 100, then we take a big step towards 100.

Here's the explanation. Suppose you simulate a season for a whole league, where all the park factors are exactly 100%. At the end of the season, you calculate the park factors in the usual way for each ballpark. You will probably get a lot of numbers that are around 92% to 108%. But you know that these are all wrong, because you know that the correct answers are all 100%. All that you have done is measure the noise.

The question is, how do we separate the signal from the noise? My answer is to use an actuarial technique called Empirical Bayes Credibility Theory.

This works by looking at the normal variance in runs scored between one game and the next, and seeing how much of the difference between home games and road games can be explained by that. The really good thing is that we don't make any assumption about the distribution of the number of runs. A lot of statistical techniques assume the normal distribution...not here.

The worst thing about Empirical Bayes Credibility Theory is that the theory is quite complex; it's at about the level of a final year undergraduate mathematics programme. But it has at least been reviewed and accepted.

One good thing is that I'm not asking anyone to blindly accept it, or even to ask a friendly actuary to accept it for you. If you know how to simulate a season, then you can check that it gives good results. By "good", I mean that, while the revised park factors are certainly not perfect, they are closer to the truth than the basic way of calculating park factors. This is especially true if the true park factor is around 95-105 (as many appear to be); if the true park factor is out around 150, then my method is close to 50/50.

The actual calculations are not at all difficult, although it will be difficult to see why they have been done. I say that they are not at all difficult, but you could simplify them massively by just replicating that sentence about small steps and big steps.

I have sent an Excel file with my calculations to Bill's Stats Depository.
9:54 AM Apr 20th
(eyes opening), I always thought ole Dylan was saying that Genghis Kahn could not keep all his kings supplied with "SHEEP". It's SLEEP!! Thanks, Bill , you're a lifesaver
9:21 AM Apr 17th
A comedian we saw recently talked about how he took one of those DNA tests. He was surprised to see Mongolian show up in his analysis. Thinking there was something wrong, he called the service to make sure there wasn't a mix up with another customer. Before he could fully ask his question, the customer service rep broke in and said "Genghis Khan. Pretty much everyone is directly related to Genghis Khan."

Now, Khan had an extremely global reach for his day but he lived only 800 years ago in the 13th century. I think that this supports your overall conclusion that if you start with an individual in BC, then by now it is likely that most everyone in the world has a line back to that person.
8:53 AM Apr 15th
I travelled plenty. Why are you picking on me?
8:22 PM Apr 14th
@OldBackstop Sorry, I missed your comment. There is also the fact that until railroads came along, your chances of ever traveling more than 100 miles from your birthplace were pretty slim.
8:11 PM Apr 14th
I just said that, what am I, planted pot?
5:41 PM Apr 14th
One obstacle to the Cicero/Charlemagne lineage is that the plague, dysentery, or some other disease swept through populations regularly, wiping out entire families, which would do away with at least branches of the family tree. That's aside from warfare, a girl dying trying to give birth to her first child, and all the other calamities of pre-modern life that disrupted lineages.
4:34 PM Apr 14th
It would seem that somewhere in the explanation of the flawed logic needs to be the fact that as the percentage of the population that *is* a descendant of Barry grows, the probability that a random person would mate with someone who is *already* a descendant of Barry increases as well.

This was quite explicitly stated in my article. It still seems unlikely, to me, that THIS kind of effect would be meaningful, since it only requires 31 generations in the model, and we have 84 to work with.
10:37 AM Apr 14th
Here's a nice FAQ I found about some research on this, including some based on actual genetic data, rather than just on a statistical model:

It would seem to me, intuitively, that at this point in history a statistical model would be much more likely to have it right than a genetic analysis.
10:34 AM Apr 14th
I've heard versions statements like "all Europeans are descended from Charlemagne" and such like for years: it's a claim you'll find in lots of popular accounts about population genetics of humans. I've always figured the argument for it is exactly what you described, Bill.

Here's a nice FAQ I found about some research on this, including some based on actual genetic data, rather than just on a statistical model:

10:22 AM Apr 14th
Thank you evanecurb,

I found them on the Italian Amazon site, including the HEPA version also made by Phillips.

2:59 AM Apr 14th

I am probably not being clear about this, but I think we are mostly in agreement. You mention that the logic feels like it is flawed somewhere. I am just trying to say that I believe the logic is sound and that it is only an assumption made too strongly that is creating the illusion that there is a flaw.

Thanks. I'm still trying to figure out whether I SHOULD believe that we are all descendants of Cicero or not. But I still don't know.
9:50 PM Apr 13th
It would seem that somewhere in the explanation of the flawed logic needs to be the fact that as the percentage of the population that *is* a descendant of Barry grows, the probability that a random person would mate with someone who is *already* a descendant of Barry increases as well. (I'm not talking just cousins and siblings, but eighth time removed cousins, which would not be seen as odd and probably happens all the time). These mating pairs do not add to the percentage nearly as much as does a person who mates with someone who is not a descendant of Barry. Look at it this way - if a mating pair has three kids, you either have Barry/Not Barry-->BBB or B/B-->BBB. In the first case, one descendant of Barry produces three new descendants in the next generation. But in the second case, which would occur in greater proportions as time passes, two descendants of Barry produce three new descendants; meaning that the rate at which the percentage of descendants grows gradually reduces by half. Now factor in geographic boundaries. This would greatly increase the probability that a random descendant of Barry would mate another person who is already a descendant of Barry since you are far more likely than not to mate with someone who lives near you and only a relatively small percentage of the population ever migrates to a different part of the world. There's probably more to it than that, but that's my $0.02 worth.

9:26 PM Apr 13th
Ghenghis Khan had three wives and many consorts, and was a little "rape-y" at the office. An English publication speculated that he had thousands of children. Bet you he would have loved a Roomba around the hut.
7:37 PM Apr 13th

I am probably not being clear about this, but I think we are mostly in agreement. You mention that the logic feels like it is flawed somewhere. I am just trying to say that I believe the logic is sound and that it is only an assumption made too strongly that is creating the illusion that there is a flaw. If weakening that assumption to a more realistic level simply slows the rate of acceleration, then that is all it does.

I did a little more research into the methodology of the Genghis Khan study, and I need to backtrack a few steps. The first source that I read said something like, “DNA analysis shows that 0.5% of the world's population is descended from Genghis Khan.” However, this may not appear to be an accurate representation of the study. Another source suggests that it is not the case that 0.5% of the world has any of Genghis Khan's chromosomes, but that 0.5% of males in the world have a particular chromosome from him, namely the Y-chromosome. If this is what the study found, then then there are a lot more descendants than just 0.5% of the world's population. Also, these may not necessarily all be descendants of Genghis Khan but of a recent male ancestor of his (i.e., within a few generations).

The people with this chromosome are still concentrated in a large chunk of Asia. I suspect that they will be much less concentrated in a couple of generations.
4:37 PM Apr 13th
Honeywell refers to its machines that pull dust from the air as either air purifiers or HEPA air purifiers. I've also seen the terms air cleaner, ionic air purifier, allergy air purifier, and electrostatic air purifer. They come in all sizes and price ranges and they work very well. They don't eliminate dust, but they do cut down the amount.
3:00 PM Apr 13th
Quick and Dirty Park Factors (1 year).

My method is built from the old team context way or teams runs per game. I take 1 part(.25) the teams runs per game and 3 parts (.75)the league average to form my context. For example, Team A scores and allows 10.0 runs/game, in a league that averages 8.0 r/g when weighted 1:3 it will yield a context of 8.5. If you want the park factor just divide the 8.5 by the league 8.0 and you get 1.06. This method will heavily regress because it is only one year but still tend in the right direction.

BTW, this method will correlate quite well with the published PFs that you see and will give you an R of around .9 if I remember correctly. I used the Sean Lahmans Baseball1 factors when designing it.
1:10 PM Apr 13th
My in-laws gave us Roomba for our birthdays last month. I love the machine. We don't have carpet and run it once a day. We have 2,700 square feet on one floor and the Roomba covers that pretty well.

Yeah, the Roomba does really well on wood floors.
12:02 PM Apr 13th


If one conservatively assumes three generations per century, then we should be on the 24th generation of Genghis Khan descendants, which would indicate that well over 10% of the population would be descended from him if he were average. Someone with a lot more than six offspring should be higher. Instead, DNA analysis shows him at 0.5% of the world's population. This does not mean that the logic or the math is wrong. It only means that the assumption is wrong.

Ghenghis Khan/
could not keep/
All his Kings/
Supplied with sleep.

Never did understand that line. Anyway, this does not seem to advance the discussion. This indicates a SMALL deviation between my assumptions and the real world, not a large one. Given the assumptions of my model, a progenitor would reach 0.5% of the world's population in the 29th generation, and 10% in the 34th generation. A SMALL difference, not a large one.

But we knew anyway that this was a general model and not a specific model, and we would assume that there are some differences between the general model and the complexities of the real world. This is not news.

My model shows that, absent barriers and unknowns, DNA from a progenitor would sweep the world in 31 generations, while it has 84 generations to work with. Your Genghis Khan "exception" says, in essence, that it's not 31 generations; it's more like 40. Well, shit, we knew that, anyway. That doesn't address the central issue.

12:00 PM Apr 13th
My in-laws gave us Roomba for our birthdays last month. I love the machine. We don't have carpet and run it once a day. We have 2,700 square feet on one floor and the Roomba covers that pretty well.
11:54 AM Apr 13th
I checked the weather conditions in Boston in 1955, and I'm starting to wonder whether all that offense was because of weather. I'm seeing an awful lot of 90-degree days and even a 100 during that summer. Alas, I've not been able to track down averages for comparison.
11:35 AM Apr 13th

I should have said “a lot of first-cousin marriages” rather than “some first-cousin marriages.” Given your assumptions, your logic is sound. I did not personally check your math, but it looks plausible, again given the assumptions.

If one conservatively assumes three generations per century, then we should be on the 24th generation of Genghis Khan descendants, which would indicate that well over 10% of the population would be descended from him if he were average. Someone with a lot more than six offspring should be higher. Instead, DNA analysis shows him at 0.5% of the world's population. This does not mean that the logic or the math is wrong. It only means that the assumption is wrong.

What you logic suggests is that, rather than being statistically implausible, pairings between cousins (or between other reasonably close relatives) were quite common. Nearly all cultures have an incest taboo. The nature of this taboo varies by culture. Most cultures ban brother-sister marriages, but they vary on cousin marriages (perhaps even on half-sibling marriages). Even those that ban or strongly discourage first-cousin marriages are often OK with second-cousin marriages. Some cultures were OK with uncle-niece marriages (see Cleopatra's ancestors for an extreme example). Due to infidelities, there have been pairings between people who did not know that they were related. Furthermore, there have always been people drawn to the allure of forbidden fruit, despite or even because of existing taboos.

The next step may be to go backward. Given that the high end of the range after 24 generations is having 0.5% of the world's population as descendants, how common was cousin marriage or marriage between reasonably close relatives? This may be complicated by all the variations (e.g., first-cousin marriages, first-cousin-once-removed marriages, second-cousin marriages, half-sibling marriages). My guess is very common.
11:04 AM Apr 13th
Dear Bill James,

Could you please let us know the name of the other kind of machine, the one you wanted that actually pulls dust out of the air?

Dust is a continual problem for our house, Italy's cities are incredibly polluted by the way, and everything is covered with dust.

This is the first I have heard of a machine that pulls dust from the air in your house. My wife and I would very much like to find one.

10:40 AM Apr 13th
A Park Run Index of 156 and a Park Adjustment Factor of 123 are the same thing. If a park increases offense by 56%, you only adjust the runs by 23% because only half of the games are in the home park and Sullivan doesn't have Fenway in his road parks. Actually 156 is equal to 124. . . there's a little tweak in there somewhere.
9:36 AM Apr 13th
Park Factors: use multi-year, maybe weighted (say, year 0 is full, years -1 and +1 are half, years -2 and +2 are quarter, and so on). Regress to estimated park factor based on park characteristics; see You won't have all those characteristics for 1955, but you'll have many of them.​
9:07 AM Apr 13th
Problem 1.

This sounds a bit like a branching process, like pages 45-49.

Conclusion: say that everyone's number of children has the Poisson distribution.

If the average number of children is anything bigger than 1, the expected number of descendants tends to infinity as the number of generations increases.

If the average is 2, corresponding to a stable population, then Cicero has a 14% chance of having zero children, but only a 6% chance of dying out at all the later generations combined.

If the average is 3 children, then Cicero has a 5% chance of having no children, but only a 1% chance of dying out later.

Average 1.1 children: 33% chance of no children, 49% chance of dying out later.

Average 1 child exactly, or less: extinction is certain.

It doesn't really matter that the earth's population is finite: you've noticed that by the time that we get to about 10% of the planet, the number increases rapidly. Cousin-marrying and immobility would slow things down: they make the model too complex for me to handle.

8:04 AM Apr 13th
Since weather can affect park effects, and weather is not constant, one year factors should be a consideration. However, how about a blend of multiyear and single year park effects? You could try weighting the year in question more heavily or some average of the two, but it seems to me that kind of approach might be the best answer: rather than one or the other, a compromise of both.
8:02 AM Apr 13th
Naturally as soon as I saw the Frank Sullivan book, I checked my own data for my new book. (By the way, it is formally entitled Baseball Greatness: The Best Players and Teams According to Wins Above Average, 1901-2016, and you can easily find it on amazon, but unfortunately, publication is going very slowly and I am not sure when it will be available.)

I was pretty certain I would not find that Frank Sullivan had had the best season of the 1950s and in fact I found that based on my data--which comes from but substitutes DRA for the team fielding stats they use--Sullivan had 3.6 WAA in 1955 compared to 4.2 for Early Wynn and 4.9 for Billy Pierce. So obviously something was wrong somewhere.

The calculations rely on the park factor data from and they are indeed very different from Bill's, showing a one-year park factor of -123 for pitching and a multiyear of -109. I honestly don't know which one they are using in the calculations.

In either case the difference from Bill's figure of more than 150 would go a long way to explaining where Sullivan came out. I don't know if Bill can tell us why his figure differs so much from their one-year figure.

David Kaiser​
7:46 AM Apr 13th
My ancestors are all from Appalachia. The leaves of the family trees there got all tangled up in the roots, and didn't spread very far.
7:44 AM Apr 13th
I don't have a good head for this stuff, but are you sure the logic *is* flawed? I know that when people argue about our most recent common ancestor, the recency is amazing to me, as little as 3,000 years some argue. I can't really wrap my head around it from a historical/movements-of-peoples perspective.
6:26 AM Apr 13th
According to the Seamheads Baseball Gauge site- an excellent site which probably too few know about- in terms of Win Shares (Bill James's measurement of baseball performance, the equivalent of WARs), Frank Sullivan's 1955 season was the 68th best (not a misprint) of any pitcher in the decade 1950-59, with 21.4 Win Shares. It was also his second best season in the decade, behind his 1957 season (23.0 Win Shares). The best season in the 1950s by any pitcher was by Robin Roberts in 1953 (not 1952), with 34.7 Win Shares.
If two cousins marry and produce offspring, that eliminates one-half of the total number of their ancestors, as they had the same two grandparents. Since in pre-modern times most people lived in small villages where their families had lived for generations, there would have been innumerable marriages of first, second, and third cousins who had many ancestors in common.​
12:51 AM Apr 13th
With Problem One, the flaw is not with the logic but with an assumption. Cousin marriages were common, not statistically improbable. If we go back 10-20 generations, there probably will be some first-cousin marriages.

But even assuming this is true, it doesn't materially change the math.
10:52 PM Apr 12th
Bill, wouldn't regional wars and plagues and genetic weaknesses cock up the nice tree chart, and have the effect of creating various Samoas?
10:50 PM Apr 12th

It's a tall order. On the one hand, you have a park that has a 150 park factor, surrounded by by 105s. On the other hand, you can have a park, say old Coors, with the same 150, but surrounded by 130s.

The first was probably a true 115, that had a ton of good luck.

The second was probably a true 140, that had a bit of good luck.

But it sounds like you are saying you want to ignore the surrounding years, in which case, you can't distinguish Coors from the other parks who had park factors that matched. If you do that, then you have to treat all 150 as if they were 125, regardless of their surrounding years. Purposefully losing information to try to get a clean process, I don't think works here.

10:48 PM Apr 12th
I'm not sure of the math, but I've read that everyone on Earth is descended from Queen Nefertiti of Egypt, who lived in the 14th century BCE, so quite a few generations before Ancient Rome. But everyone of European descent is related to Charlemagne, who died in 814 CE.
9:14 PM Apr 12th
Baseball Reference definitely uses multi-year park factors, based only on runs (unless they've changed something recently). Personally, I'm in favor of multi-year factors regressed to the mean, re-starting when there are physical changes to the park. Single-year factors would be regressed to the mean even more.

And that would be my suggestion for Bill: regress all single-year park factors to the mean.
9:01 PM Apr 12th
With Problem One, the flaw is not with the logic but with an assumption. Cousin marriages were common, not statistically improbable. If we go back 10-20 generations, there probably will be some first-cousin marriages. There will likely be plenty of second-cousin marriages or marriages between first cousins once removed and so on. If I go back to Ancient Roman times, I will probably find many individuals who take up 20+ slots in my family tree (and probably more if my eight great-grandparents were not born in six different countries, all of which are on the other side of the Atlantic Ocean).

Also, geography was not the only barrier. Social class probably played a role in separating ancestors, although this line would also be crossed on occasion.

To take a real-life example, there is a suggestion that approximately 0.5% of the world's population is descended from Genghis Khan (c. 1162-1227). Actually, they are descended from a male Mongol who traveled the route that Genghis Khan took, but it is more interesting to say that they were descended form Genghis Khan than from his servant or one of his lieutenants. I have also seen suggestions that 2% of people of European descent are descended from Charlemagne (742-814) or that there are tens of millions of descendants of Muhammad (c. 570-632) alive today. However, these are all unusually large numbers. In other words, we are all a lot more inbred than we like to think.
8:58 PM Apr 12th
Seventy-seven games is a small sample size. When faced with small sample sizes, you regress to the mean. You could use league average, and regress to 154 games. Or you could use the league average for the year and regress to that.
8:01 PM Apr 12th
Nice article....but I got all misty-eyed when I got to the Xmas gifts. You crazy kids...

How about I send you my English Crème Retriever pup for a few weeks, shedding her first winter coat, and I guarantee you dust will not be on your mind.
6:47 PM Apr 12th
If you go back in time, you have 4 grandparents if you go back 2 generations, 16 Gr.-Gr.-Grandparents if you go back 4 generations..... how many generations do you have to go back to pass 50 billion ancestors? Not that many, I'd bet. Thus it can be mathematically proven that every person has more ancestors than people who ever lived.
6:41 PM Apr 12th
shthar "every human being on earth is related to every other human being on earth."

You mean... to tell me...that my cousin? *looks for balcony to jump off of*

6:37 PM Apr 12th
From what I can tell, BBRef uses multi-year park factors, weighted for each pitcher by the number of batters faced in each park. There's no indication they're using anything other than runs for park factors or for player valuation. The factor they use for 1955 Sullivan is 108.

BBRef's single year park factor for pitching in 1955 Fenway is only 123. I'd love to know how to reconcile that with the 156 Run Index above.
6:34 PM Apr 12th
Sounds like the roomba would fall prey to a fish trap....
6:09 PM Apr 12th
On Ballpark Run Index:
Would it be stupid/invalid to simply drop the game with the most runs allowed along with one shutout, sort of dropping the outliers? Or the top and bottom 1% or 10 games on either end or whatever?
6:06 PM Apr 12th
I suggest that our lives would all be immeasurably improved if we - readers here of BJOL - began referring to any Roomba-in-the-corner situation as "the Roomba in the jungle." I definitely wouldn't know how to measure the improvement.
5:21 PM Apr 12th
My Roomba goes into the laundry room and somehow manages to close the door on itself. Happens every time we forget to close the door to the laundry room.
5:10 PM Apr 12th
My advice is to get several of those machines that take the dust out of the air; you need one for each room, basically. You have to change the filters every 3 months or so, but they really work. And change your furnace filters, for goodness sakes. The combination of air handlers, furnace filters, and a Roomba will make a huge difference. One of the three, by itself, doesn't do much.

I don't know the answer to the 1955 park factor problem, but I had an idea. I don't know if this will help or not. In golf, when you figure the handicap of an average player (say, someone who shoots in the low 90s), you don't count any scores over double bogey. So, for example, if a guy shoots a 91 with a triple bogey and a quadruple bogey, it's an 88 for handicapping purposes. Could you build the same type of logic into park factors? In other words, no runs after the tenth run (or the 11th, or the 8th) would count when figuring park factors.
5:09 PM Apr 12th
I think the Roomba was invented by the same people who invented that electronic vibrating football game.​
4:36 PM Apr 12th
Fireball Wenz
OK, I think I understand you to be saying it is impossible for Frank Sullivan to be that much better than every other AL pitcher because by 1955 we would all have the same DNA. But I don't see how the Roomba is relevant, because it wasn't invented in 1955, and even if it was, there is no way that Joe Mooney was going to allow that on the field.
4:32 PM Apr 12th
Also, unless that guy in Rome landed in a spaceship, his DNA was ALREADY in everyone on earth, because every human being on earth is related to every other human being on earth.
4:30 PM Apr 12th
I checked Sullivan's home/road splits and his own park factor was 1.50 (3.53/2.35 home/road era). Whatever was going around, some of it was stowed away in Sullivan's travel bag, too. Peace, my brother.​
4:13 PM Apr 12th
1. It wasn't an overbooking problem, it was a SCHEDULING problem. United had 4 employees they had to get to another airport and somehow screwed up how they usually get em there.

2. No United employee even touched the guy. The COPS beat the hell out of him. But they usually get a pass.​
3:47 PM Apr 12th
Re: Barry and his descendants, are you assuming that each person in each generation has kids or are you adjusting for lines ending?
3:46 PM Apr 12th
By "each other" I am not talking about siblings or first cousins, but relatives who can be very distant cousins.
3:27 PM Apr 12th
I think I am in over my head in problem 1, but I won't let that stop me. Rather than "taking off", doesn't the rate of the increase in (Barry descendants % of population) drastically decrease over time? As they become a larger part of the population, they much more frequently mate with each other, thereby slowing the rate at which they spread his DNA to non-Barry descendants.
3:25 PM Apr 12th
Rob--they're probably using strikeouts and walks and homers, ignoring or putting little value on actual runs. Which works SOMETIMES. Sometimes that doesn't work, either.
3:24 PM Apr 12th
Bill, I sorta doubt this helps much ... but I'll still mention in passing that doesn't seem to get the same nonsensical results with Sullivan, as his 1955 season, in terms of both ERA+ and WAR, look not terribly dissimilar to his 1954 and '56 seasons. And so they do not show his '55 season as being among the very best of the 1950s.

Now, maybe they're using multi-year park factors to figure those things. But I kinda doubt it.
3:15 PM Apr 12th
Re: Ballparks -The only thing I can think of is to collapse any significant changes from one year to the next that are not explained by changes in dimensions. So if in 1954 the park effect was 120 maybe you adjust the 1955 park effect to 138(halfway). The adjustment acknowledges intuitively that a change that drastic must be part real change and part dumb luck.

This would only be done for the most extreme of changes from one year to the next, say the top ten percent. Some of those changes might be explained by changes in the park and thus no adjustments would be made. So it would only change the most extreme unexplained results, and leave the overwhelming majority of one year park effects unchanged.
3:09 PM Apr 12th
©2017 Be Jolly, Inc. All Rights Reserved.|Web site design and development by|Terms & Conditions|Privacy Policy