Statistics can be used to trick or deceive. Statistics can also “prove” things that are not true at all. One of the reasons this can happen is related to an assumption referred to as iid. Iid is shorthand for independent and identically distributed. It is often a necessary assumption for statistical inference. Assuming events are iid when they are not led to trillions of dollars in losses during the US housing crisis.
This is the third article in a series called “Statistics Lie” about how improper analysis can lead you to wrong or dangerous conclusion. Because as Mark Twain said, “There are three kinds of lies: lies, damned lies, and statistics.” If you want to avoid mistakes in your own analysis, or identify flaws in the analysis of others, make sure to keep reading.
The first article in the series can be found here: The Flaw of Averages…or why it takes slightly longer to get your Uber
The second article in the series can be found here: Correlation and Causation…or why vaccines don’t really cause autism
The fourth article in the series can be found here: Normal Distribution…or its totally normal if your data is not normal
The fifth article in the series can be found here: Sampling on the dependent variable…or why waking up at 4am won’t make you successful
What does that mean?
If you have never taken an advanced statistics course you have probably never heard of iid, but when you break things down the concept is actually rather simple.
Independent in this sense means that the probability of the first event occurring does not change given that the second event did, or did not, occur. This is not the same thing as correlation or causation; though if events are independent, they will neither be correlated nor cause each other.
The mathematical way that we show that they are independent is with the below equations. The first one says that “the probability of A and B both occurring is the probability of A occurring multiplied by the probability of B occurring.”
The second is a bit trickier and by saying that, “the probability of A happening given that B already happened is the same as the probability of A happening.” Basically, it does not matter to A if B did, or did not, happen.
Identically distributed means that the first event and the second event are described by the same distribution. This assumption it makes it more difficult to perform statistical inference between something that is normally distributed like annual rainfall totals and something that follows a power law distribution like earthquakes.
A thought exercise
I live in Florida and I have a cousin who lives around 500 miles away in Louisiana. What is the probability of us both getting struck by lightning?
The traditional way to solve this as we mentioned above is by multiplying the probability of me getting struck by lightning by the probability of him getting struck by lightning.
I found a stat that said the odds of getting struck by lightning in one year are 1 in 960,000. So the odds of two of us both getting struck in the same year would be (1/960,000)*(1/960,000). No need to do the math to see it is incredibly small.
But, in order for that statistical inference to hold, the probabilities of us each getting struck must be iid. Are they?
Independent – If my cousin is struck by lightning, does that change the probability of me getting struck by lightning? I would say that they do seem independent. Even Hurricane Katrina did not cause major storms where I live in Florida while it was striking Louisiana. So yes, it would seem to be independent.
Hurricane Katrina via satellite
Identically distributed – Are lightning strikes identically distributed between Florida and Louisiana? They should be as close to identical as natural phenomenon gets. Both are hot and humid states that get numerous thunderstorms from that heat and humidity as well as storms from the Gulf of Mexico. So yes, we can probably assume it would be identically distributed. (or we can look at data, but hey, this is a thought exercise…)
If I am home in Florida, and he is home in Louisiana, we can assume that the events are independent and identically distributed.
Now lets assume that we meet up in Pensacola and go out fishing in the Gulf of Mexico together. The events are no longer independent. If we are on the same boat out on the water and he gets struck by lightning, the odds that I also get struck are much higher than normal. That means the odds of us both getting struck goes from something impossibly small to something that is actually quite realistic.
In other words, the probability of A happening given B already happened is much higher than the probability of A happening on its own.
While my lightning example is a hypothetical, there are real world examples where one rare events makes other events much more likely.
Nassim Taleb in his book “The Black Swan,” gives the examples of 9/11. The odds of one plane hitting the World Trade Center are very small. But once AA11 intentionally hit the North Tower, the odds of another plane hitting the South Tower were a virtual certainty.
Real world implications
The most obvious area where there are real world implications for this phenomenon is in insurance. Insurance works by spreading the risk of damages happening across the pool of insured. Pricing typically assumes that loss events are independent of each other.
Independence does not hold in the case of natural disasters. If a house on my street is destroyed by a wildfire or flood the odds of my house being destroyed by a wildfire or flood are much higher than normal. Because of this there are things like disaster relief funds or federal flood insurance.
However, a very similar effect played out in the housing crash in the US. The lack of independence was one of the many reasons for the severity of the downturn following the price crash.
In the securities markets mortgages are bundled together and sold like a bond. When those bonds are rated for security the raters assumed that a default on one mortgage would be an independent event. This assumption mostly holds in good times.
However, in bad times it turns out that those two events are not independent. If your neighbor is foreclosed on, it increases the chances that your house will be foreclosed on. There are two primary reasons for this; one is a correlation and the other a causation!
Let’s start with the easy one, the correlation. If there is a recession or other economic downturn, it will harm many people. It is reasonable and accurate to assume that in most cases we will both have our odds of foreclosure increased by the economic downturn. In other words, our fates are correlated because they are both affected by an outside force.
The outside force is the “easier” effect to measure and predict. There are all sorts of models that forecast economic conditions and the bonds can be stress tested against these. Whether or not those models are worth anything is a separate question entirely that I am not addressing here.
The much harder effect to predict and quantify is how my neighbors’ decisions cause my odds of foreclosure to change.
Let’s assume that I purchased a house I can afford. I bought the house with little money down, and the payments are at the top end of the range for what I would be approved to pay.
My neighbors on the other hand purchased their house with an interest only loan. Their payments ballooned when the interest rates reset and they are no longer able to afford the house. They are foreclosed upon and the house sits vacant for months or years as it goes through legal proceedings.
While the house next door is going through this process the value of my house falls. First it falls because a similar house to mine can be bought for a lower price in foreclosure. Later, it falls because having a house that is not properly maintained next door hurts my value.
While this process is going on, things may have changed in my life. Perhaps I got married and started a family and need more room. Perhaps I changed jobs and need to move to a new city. Perhaps a relative fell ill and I need to move in with them to be a caregiver.
None of those life changes would be problems if I could sell the house at a profit. But if there are foreclosures in my neighborhood I may be forced to sell for less than what I owe. In that case, especially if I do not have many assets for the bank to come after, it may be easier to be foreclosed on.
If I then choose a foreclosure it further hurts values in my area which further increases the chances of other foreclosures. It is this destructive feedback cycle which made the housing crash of ’08 particularly painful.
This homeowner pain was also felt by bondholders when the “isolated, rare, and independent,” events of foreclosures turned out to be none of the three. One of the many causes for that was failing to recognize that foreclosures are not at all iid.
How wrong could assuming independence make the models? The forecast 5 year default rate was 0.12% but the actual rate turned out to be 28%, two hundred times more likely than S&P predicted!
As I mentioned at the start of this section, those mortgage bonds were sold all over the world, and were held by all sorts of people who may not have known it. They were in money market funds, pension funds, mutual funds…and there were people who bought them with borrowed money, and there were even insurance policies written that would pay off if they defaulted. That is how the contagion in a major portion of the US economy spread to the entire worldwide economy.
Assuming events are independent when they are not is a way to lead to some of the most massive errors possible with statistics. If you are not highly confident that events are independent, tread lightly.
Nota Bene: There were other causes for the financial crisis of course. Anyone who claims there is only one reason for a major event is almost certainly a fraud.
Update: Convicted for Murder by Independence errors
In the late 1990’s a woman named Sally Clark was improperly convicted of murdering her two children due to a statistical reasoning error. She had the extreme misfortune of two children die from Sudden Infant Death Syndrome (SIDS).
A key piece of evidence in her trial was the extreme unlikelihood that two children would both die from SIDS in the same household. The prosecutor’s expert presented evidence that the risk of SIDS was 1 in 8,453 live births for households where none of three identified risk factors are present. That would imply that the risk for two children to both die from SIDS is (1/8453) * (1/8453) which equals 1 in over 70 million. That is so unlikely she must be guilty!
Do you see the problems with that approach? If you said, “but that assumes that SIDS deaths are independent. What evidence do you have that these are independent?”
SIDS is not well understood at all. There are some well known methods for helping prevent SIDS, e.g., “face up to wake up,” but we still do not know why it happens. The experts have identified three different risk factors that significantly increase the likelihood of SIDS occurring. What makes you think those are the only risk factors?
There are almost certainly other risk factors, including genetic, that increase the predisposition to SIDS. Assuming they do not exist, and therefore these two deaths were “independent,” is a tremendous miscarriage of justice via bad math.
There is another way statistics can lie that is present here, an idea called, “The Prosecutor’s Fallacy.” This article is already too long, so I’ll save that for a future edition of Statistics Lie. In the mean time I recommend this video from Peter Donnelly where he discusses this case and some other items.