The flaw of averages
Most people have heard of the law of averages, but what about the flaw of averages? I would argue that understanding the flaw of averages is at least as important, especially if you don’t want to be fooled by bad analysis.
This is the first article in a series called “Statistics Lie” about how improper analysis can lead you to wrong or dangerous conclusion. Because as Mark Twain said, “There are three kinds of lies: lies, damned lies, and statistics.” If you want to avoid mistakes in your own analysis, or identify flaws in the analysis of others, make sure to keep reading.
The second article in the series can be found here: Correlation and Causation…or why vaccines don’t cause autism
The third article in the series can be found here: Independent Events…or why it was iid and not id that fueled the financial crisis
The fourth article in the series can be found here: Normal Distribution…or its totally normal if your data is not normal
The fifth article in the series can be found here: Sampling on the dependent variable…or why waking up at 4am won’t make you successful
Why averages are often useless…or worse dangerous
In his book “The Flaw of Averages,” Sam Savage expertly describes what it is with a theoretical example. Imagine you wanted to cross a river but did not know how to swim. Someone tells you that on average the river is 3 feet deep. It is safe to cross, right?
If you answered, “no, it’s a trap,” congratulations.
If the average depth of the river is 3 feet, that means that some of it may be less than 3 feet, and some of it deeper. How much deeper is an important question, isn’t it? If there is a large section that is 10 feet deep our river crosser will drown even though the average is 3 feet.
The distribution is far more important
The river crossing example does a good job of showing why the shape, or distribution, of the data is important. The average is only one data point and tells you essentially nothing about the shape of the data.
What about standard deviation? Isn’t that another data point?
The average and standard deviation, often referred to by the Greek letters µ and σ, are typically how summary statistics are represented. That is better than the average alone. For instance, if we knew that the river’s average depth was 3ft, its shape described by a normal (Gaussian) distribution and the standard deviation was 0.001 ft, we’d feel highly confident that the river was crossable.
That example makes sense to us because we all know the “story” of a river. That river sounds to us like it is a shallow river with gently sloping banks. We can intuitively picture that in our heads.
However, summary statistics can get you into trouble. There are two famous examples showing how there can be huge differences in data sets that have the same summary statistics. The first example is Anscombe’s quartet, shown here.
You may have seen Anscombe’s quartet in a statistics textbook, it is a famous example. You can tell that those four graphs would mean very different things if you were trying to do something with the data, yet their mean, standard deviation, minimum and maximum Y are all the same.
While Anscombe’s quartet is famous, there is another example I prefer. The Datasaurus dozen doesnt even make the pretense that it could look like real data. Imagine you were working with data you thought fit a bell curve and got a T-Rex instead!
Most uses of the standard deviation, require assuming that your data is normally (Gaussian) distributed. This can lead to all sorts of issues.
Nassim Taleb rails on many of these in The Black Swan (he does this in pretty much all of his books actually) and shows rather convincingly how assuming normal distributions in stock price movements dramatically underestimates the risk of large price movements. This central premise made him rich when others went bust in the 1987 market crash.
The flaw of averages has many examples in business beyond financial engineering like what Taleb did. I have personally seen numerous bad decisions made due to this issue. Uber recently published an article on how they avoided making a bad decision based on averages.
Uber Quantile Regression
Uber, like most tech companies, constantly experiments with updating the products and algorithms. But how do they judge whether or not an algorithm change improved vs. the status quo?
The beauty, and tragedy, of statistics in the real world is that there often is no “correct” answer. One way to look at evaluating a change vs. control is the average treatment effect (ATE). In Uber’s case this would mean the average estimated time of arrival (ETA) of the ride.
What could possibly go wrong with measuring that way? If you said the flaw of averages…
The data Uber collected showed that on average (ATE), the treatment would get customers their rides faster. Many customers would get rides even faster than currently possible. All good right?
This graph shows that the new data was much more widely distributed. The change, shown in the blue line, had an average expected time of arrival (ETA) that was faster, but there were also many more slow ETAs.
This brings us to the part where data science and statistics cannot always answer the question. Does it matter that there are more slow ETAs?
Uber’s answer was yes, it does matter. Uber leaned upon its own experience and the psychological principle that people remember bad experiences more strongly than good ones. While the new algorithm would make a lot of people marginally happier, it would also significantly increase the number of very unhappy people.
The way Uber solved this problem was to use quantile regression. That is a fancy term for considering how the treatment effects customers across the full distribution.
The PDF graph above of the treatment vs control was rather obvious as to how the treatment differed from the control. Another way to look at it, or to better quantify the differences is plotting the quantile treatment effect (QTE) against the quantiles themselves.
This shows us that up until about the 60th percentile the treatment gets our customer rides faster (the QTE is negative meaning ETAs are lower, e.g. 2 minutes until arrival instead of 3). Above the 60th percentile the treatment gets our customers rides slower, and around the 90th percentile that difference increases exponentially.
The big problem here is that exponential increase for the 90th percentile onward. That means that for a small group of customers the new treatment will make things much worse. Uber takes that information and (correctly) judges that making things somewhat better for the majority of people is not worth making things much worse for a smaller group.
If you are used to getting a ride in 3 minutes, but instead get it in 2 minutes, you might be pleasantly surprised but you also might not notice. If you are expecting a ride in 12 minutes, but instead it takes 20, you will absolutely notice, and you will be upset.
I recommend reading the entire Uber article. They do a good job of laying out what they did with enough math to make it a legitimate example and not marketing trash but not so much that non-nerds can’t read it. You’ll actually need to know a good bit of linear algebra to perform the techniques they discuss however.