Sampling on the dependent variable is something you see all the time if you read clickbait articles like the crap in Business Insider. These articles typically start with something like, “things all successful people do…” and then make claims about waking up early, or drinking 3 cups of coffee, etc.
If you are the least bit skeptical or cynical like I am, you probably think these articles sound like BS. That’s because they are. There is actually a mathematical reason why; it is called sampling on the dependent variable.
These articles are also pervasive. I decided to get on twitter while editing this and what did I find from JD Long?
Independent and dependent variables
In any analysis you have an independent variable(s) and a dependent variable. The dependent variable is the thing that you are trying to study. You make changes to the independent variable(s) and see what changes they make in the dependent variable.
In machine learning you’ll hear this same concept referred to as features (independent variables) and labels (dependent variables).
OK, so what does sampling on the dependent variable mean?
When we say they are sampling on the dependent variable we are saying that they are working backwards in terms of experimentation.
What they are doing is saying, “let’s look at the people who have succeeded and see what they have in common. That must be the reason they are successful.” Obviously there are lots of problems that can arise from this line of thinking.
This is the fifth article in a series called “Statistics Lie” about how improper analysis can lead you to wrong or dangerous conclusion. Because as Mark Twain said, “There are three kinds of lies: lies, damned lies, and statistics.” If you want to avoid mistakes in your own analysis, or identify flaws in the analysis of others, make sure to keep reading.
The first article in the series can be found here: The Flaw of Averages…or why it takes slightly longer to get your Uber
The second article in the series can be found here: Correlation and Causation…or why vaccines don’t really cause autism
The third article in the series can be found here: Independent events…or how it was iid and not id that helped fuel the financial crisis
The fourth article in the series can be found here: Normal distribution…or how it is totally normal if your data is not normal
What problems are there?
The first problem that should be obvious if you read my correlation vs. causation article is that this is not how you run an experiment. Experiments are the only real world way to determine causality. You cannot run the experiment backwards by looking at the times it worked.
Let’s take an example of a headline I saw, “the most successful people wake up at 0430!” Are there successful people who wake up at 0430? Of course there are.
I am a huge Jocko fan btw. He is talking about what works for him, not a made up meta analysis like some articles
Plenty of people start out their days by working (or working out) when everyone else is sleeping. I personally do this and think its great for me. I suppose I am reasonably successful, but no one is writing articles about me.
But who else wakes up early? People with super long commutes, sanitation workers, soldiers in boot camp, insomniacs, etc. Are they all super successful? No, of course not. By definition only a few of them can stand above the rest, even though they are all doing the same thing.
The idea that all successful people do anything in common, much less waking up early is absurd. But you probably knew that already.
Is waking up early a predictor of success? No, of course it isn’t. Even if it turns out that “all” successful people do it, it still is not a predictor because there are far more people who are unsuccessful who also wake up early. That does not mean it is not a good idea, just that the rationale makes no sense.
There are also problems due to the concept of silent evidence. We will cover that in a future article.
Why can’t I just study successes?
You can absolutely study features that successes have in common. In fact, you probably should study features that successes have in common. However, you should not take action on that without further testing. Finding a series of features that seem to have success in common is a great reason to run an experiment.
It is only human to look for evidence that supports what you already believe. Affirmation feels good. This creates the risk that when you see something that predicts success you want to find further evidence to support it. The philosopher Karl Popper would refer to this as “sampling on theory affirmation.”
Instead what you should be doing is practicing what Karl Popper called the approach of “falsifiable hypothesis.” State what you think is predicting success and develop a hypothesis as to why. Then set up tests, methods, and experiments around proving your hypothesis wrong. That is the way to true knowledge.
Nassim Taleb titled his book, The Black Swan around an example that Karl Popper gave. If you had the hypothesis that all swans are white you could not prove it true no matter how many white swans you saw. But you could prove it false by observing a single black swan.
Nope, definitely not this black swan. The book is about risk, not ballet
If you have features that seem to predict success, you should set up an experiment to verify. It does not matter if you are trying to determine causation or just test correlation, you need to see what happens when the independent variable goes through its range of possibilities. You need to see what happens it is present AND what happens when it is not.
Correlation and causation can both be important things to know. If you are not sure about the difference, or how to determine causality, please check out the earlier article on correlation and causation.
Ok, I get it, never sample on the dependent variable
No, there are times when it is ok, or even necessary to sample on the dependent variable. The reason is there are lots of statistical techniques to reduce or eliminate outliers, but not a ton to study them.
Note: This is hard to do correctly, and constrains what you can do and the usefulness of the results, so it is not that common.
So why would we sample on the dependent variable? Sometimes you have no choice. Rare things need to be studied as well.
Lets be clear; humanity doesn’t NEED someone to study what made Michael Jordan who he is, regardless of how well it sells. I would argue many of those pieces provide at least as much bad advice as good advice. It is similar to any piece about business success that does not acknowledge that luck was a significant factor.
But you can be the greatest too if you just do these three things…or not.
Humanity does NEED someone studying rare diseases, genetic conditions, and seemingly chance occurrences. Those are the realm of serious science and analysis, not Business Insider articles.
Rather than re-print it, I found a decent article on a way that you can perform analysis while still sampling on the dependent variable. Check it out here: https://journals.sagepub.com/doi/pdf/10.1177/1476127012452820
If you are even remotely skeptical, you probably realized that articles that start “the morning routine all successful follow,” are full of crap. It is crappy methodology even if the conclusion is correct.
It is tempting to fall for these things; I almost did and that is the impetus for this article! Wouldn’t it be great if there was a morning routine that meant you would be successful, or that you could “pick up girls/guys with this one weird trick?”
But the issues of sampling on the dependent variable go way deeper than crappy self help. How many business examples have you seen where they only want to study “what makes for the best clients or customers?” Keep in mind that you need to carefully consider how you select your inputs to study.