Correlation is not Causation
Everyone has heard the maxim that correlation does not necessarily imply causation. But how do we prove when there is causation, when they are caused by the same thing (associated), and when it is just a coincidence?
This is the second article in a series called “Statistics Lie” about how improper analysis can lead you to wrong or dangerous conclusion. Because as Mark Twain said, “There are three kinds of lies: lies, damned lies, and statistics.” If you want to avoid mistakes in your own analysis, or identify flaws in the analysis of others, make sure to keep reading.
The first article in the series can be found here: The Flaw of Averages…or why it takes slightly longer to get your Uber
The third article in the series can be found here: Independent Events…or why it was iid and not id that fueled the financial crisis
The fourth article in the series can be found here: Normal Distribution…or its totally normal if your data is not normal
The fifth article in the series can be found here: Sampling on the dependent variable…or why waking up at 4am won’t make you successful
Sometimes random things are perfectly correlated
At first it seems impossible; two random variables which obviously have nothing to do with each other are almost perfectly correlated. When you think about it probabilistically, it is almost certain. With an endless supply of potential variables you will eventually find unrelated ones that are highly correlated.
If you do not believe me, here is an example from Tyler Vigen’s site dedicated to spurious correlations. He has so many of these that he made a book of them.
This one has an impossibly high correlation coefficient, 99.79%. When you realize that they are both going up it makes it a tad less impressive, but there are other correlations on his site where it goes both up and down. See the spelling bee vs. spiders example, that one is really odd.
That is obviously not causal, what about ones that could be?
This is the part where it gets tricky…and sometimes controversial.
Shark attacks are correlated with ice cream consumption. The reason for this is rather simple. Ice cream consumption peaks in the summer. Swimming in the ocean also peaks during the summer months. Swimming in the ocean is a requirement to be attacked by a shark. It doesn’t matter how unlucky you are, you will not be attacked by a shark while walking down the street.
In this case neither ice cream nor shark attacks cause each other. Having an ice cream does not increase the chances of you being attacked by a shark. The probability of both occurring is caused by the same thing, summer. You can think of their probabilities as being associated, e.g., they are associated to each other by the same cause.
Now for the part where it gets controversial. Lots of studies in the social sciences, and even some in the hard sciences, have trouble with this issue.
Unfortunately, sometimes it is malevolent. Andrew Wakefield is a world class piece of trash. He led a study that faked data to show that the MMR vaccine may predispose children to develop autism.
Note: If you believe vaccines cause autism, stop reading here. You have no desire to actually learn from data so please stop reading my content; you are wasting time you could spend chasing contrails or something.
Note 2: Also, the Wakefield study had a sample size of n=12. Yes, you read that right, a study with implications for systemic risk was based on a sample smaller than a baker’s dozen.
But what does fake data have to do with correlation and causation? Even though the study was completely fabricated and in no way accurate, lots of people believe the myth that vaccines cause autism. Sure, some of it is because people want to believe a conspiracy, but it is somewhat believable because there is a lot of correlations between vaccination rates and autism diagnoses.
This is not fake data, this is not made up, AND this does not imply causation. There are tons of other reasons why autism could be increasing. It could be that we are getting better at diagnosing it, it could be a reaction to massive amounts of technology, it could be due to older parents, and it could be due to organic food consumption.
Wait what? If you guessed that organic food consumption is even more highly correlated that vaccine growth, you win a prize (this graph).
Scientific American has a pretty good piece on the increase in autism if you care about the data. It seems the most likely case is that autism and vaccines are associated in that something in modern times drives an increase.
Does that mean there is no way to move from correlation to causation?
No, of course not. There are ways to increase your confidence that that correlation does in fact imply causation.
Note: Remember that in a complex system, it may be impossible to establish causation. With a limitless possibility of variables, you simply cannot study all of them, or all potential combinations of them.
The first, and most reliable, method of establishing causation is through a controlled experiment. Proper experimental design is key to both traditional scientists like biology as well as data science. Both study complex systems that need to eliminate as many causes of randomness or uncertainty as feasible.
Design of experiments is a complicated subject that there are multiple University level classes on. I don’t feel comfortable advising on how to do this, I don’t think I am particularly good at it, so I’ll refer you to my old textbook.
If a controlled experiment is impossible due to circumstances, there are ways to establish causality through observation. The best methods that I can think of come from the realm of biology. The US Surgeon General has published standards for the use of observation studies in epidemiology. They are as follows1:
- Strong relationship: Is it a large predictor or minimal?
- Strong research design: well, yeah.
- Temporal relationship: It is important that the cause must precede the effect; otherwise an associative relationship is much more likely.
- Dose-response relationship: Need to alter slightly for data science, but does the effect to the label increase as you increase the feature
- Reversible association: Also important for eliminating association; does removal of the cause reduces the incidence of the effect.
- Consistency: When other people have looked at this do they find the same thing
- Plausibility: Is there a story that makes sense about why these two are related? Sure there are causes and effects that you may not be able to explain why, but if you cant figure out why they are related its far more likely it is coincidental.
- Coherence with known facts: if you’ve got a great model except that you need to disregard certain known facts or common assumptions…its probably not a great model.
Do controlled experiments and/or observational standards remove the risk of falsely determining causality? No, of course not. But they do significantly reduce the risk of incorrectly assigning causality.
Improperly assigning causality, for instance convincing people to not vaccinate their children, can carry huge risks. Measles kills 90,000 people per year. At least some of those should have been prevented by proper vaccination.
Assigning causality is a large step and should not be taken lightly. There are methods such as controlled experiments as well as standards for observational study to minimize risk of incorrectly identifying a cause.
You should always be skeptical when someone claims causality. Withhold confirming until you understand the methodology and the reasoning.
- Slightly edited to remove biological terms