Statistics Lie (part 4): Normal Distribution…or its totally normal if your data is not normal

  •  
  • 5
  •  
  •  
  •  
  •  

Statistics Lie, it’s Not Normal: Flaws of Assuming a Normal Distribution

 

The normal distribution is so common it is often taken for granted by non-statisticians.  However, real-world problems often follow when someone assumes their data is “normal” when it is not.  How do we recognize and avoid these mistakes?

 

What is the Normal or Gaussian distribution?

 

The normal distribution, commonly known as the bell curve1, was described in the early 1800s by a combination of people including Gauss and LaPlace.  It is most useful to describe natural phenomenon as many natural phenomenon follow a normal distribution.  If you were to measure the weight of a certain species of bear, the size of a type of leaves, etc., it would almost certainly be normally distributed.

 

The fact that most natural phenomenon are naturally distributed made it very popular in science. Another reason for its popularity is the central limit theorem which essentially says samples of independent variables become normally distributed when the number of observations become sufficiently large.

 

If you aren’t sure of a variable(s) distribution all you need to do is to collect a ton of data and it will converge on normal.  How great is that?

 

It’s a bit more complicated than that, but that is the general idea.

 

The normal distribution is described by µ which is the mean (average) and σ which is the standard deviation.  With just these two numbers you can draw the distribution; both probability density function and cumulative density function.  More importantly, you can also predict how likely, or unlikely, an observation is.

 

Suppose a bald eagle’s adult wingspan could be described by a normal distribution2.  With the mean and standard deviation we could say what % of bald eagles have a wingspan of greater than 7 feet.  Or less than 6 feet.  Or between 6 and 7 feet.

 

The normal distribution is not just present in nature.  Six sigma and other statistical process control methodologies are based on the natural distribution.  This works well in manufacturing because many physical processes are normal or approximately normal.

 

Sounds useful right?  Part of the reason it is so common is that it is so useful and among the easiest statistical distributions to work with.

 

This is the fourth article in a series called “Statistics Lie” about how improper analysis can lead you to wrong or dangerous conclusion.  Because as Mark Twain said, “There are three kinds of lies: lies, damned lies, and statistics.”  If you want to avoid mistakes in your own analysis, or identify flaws in the analysis of others, make sure to keep reading. 

The first article in the series can be found here: The Flaw of Averages…or why it takes slightly longer to get your Uber

The second article in the series can be found here: Correlation and Causation…or why vaccines don’t really cause autism

The third article in the series can be found here:  Independent events…or how it was iid and not id that helped fuel the financial crisis

The fifth article in the series can be found here: Sampling on the dependent variable…or why waking up at 4am won’t make you successful

 

What happens when distributions are not normal?

 

There are lots of statistical distributions.  If a variable is not normal that doesn’t mean it is bad or difficult to predict, it is just from a different distribution.  What is bad is when you think something is normal and make decisions on that assumption, but it turns that it isn’t.

 

If you read any of Nassim Taleb’s books, especially The Black Swan and Fooled by Randomness (both of which I highly recommend) you will see he has lots of examples of terrible consequences.  In fact, he made a fortune betting on rare events.  Even though these events are rare, predicting them with a natural distribution significantly underestimated the probability.

 

This phenomenon is often referred to as tail risk or the problem of fat tails.  Lots of people have written good material on this (especially Taleb) so I will share a personal example instead.

 

“Proving” untruths with the normal distribution

 

One job I had in the US Navy was working as an analyst for a large staff.  There was a process where we had significant issues for the first time anyone could remember.  This particular process needed to complete something within 30 days.  The faster the better, but past 30 days was really bad.  There was a massive difference between 30 and 31; we could classify anything 31 days and greater as a failure.

 

After years of zero failures we had three failures within 6 months.  The Navy hired a consultant to study the issue and they came to a conclusion that was precisely wrong.  Do you want to guess why?

 

If you said, “because they assumed the process followed a normal distribution but it actually didn’t…”

 

 

The consultant’s report looked at several thousand observations and calculated the µ as ~6 days and σ as ~4.5 days.  This would mean that probability of failures was negligibly small so that these were clearly outliers that wouldn’t happen again.  If you want to calculate it for yourself you can learn how here.

 

No problem, nothing to see here.

 

 

How did I fix it?

 

Luckily for the Navy there was a senior officer in that meeting who had studied industrial engineering years prior and the details didn’t sit right with him.  He knew my background and immediately brought the report to me.

 

It took me all of ten minutes to identify the problem.  When I graphed it, the data looked like this. I’d need some time to verify this followed a gamma distribution, but it was very easy to see right off the bat that this was not a normal distribution.

 

 

 

If you utilize the mean and standard deviation to predict observations in a gamma distribution your results are not just wrong, they are meaningless.  You aren’t saying 2 + 2 = 5, you are saying 2 + 2 = orange.  The gamma distribution has a couple of ways to describe it, none of which are the mean or standard distribution.

 

Without going into the details, the failures were due to something new being asked of the process.  It was supplying a different output than it had in the past which led to these failures.  That was fairly obvious by the way; this consultant “proved” that what everyone saw happening was not really happening.

 

When someone tells you something cant happen, and then it does, this is a probably an example of someone not understanding their statistical model or not understanding the system.  That is not the same as something that is unlikely happening, e.g., Trump’s election in 2016.  The difference is akin to flipping a coin and getting heads twice in a row vs. flipping a coin and getting heads two hundred times in a row.  One of those is unlikely, the other is essentially impossible (with a fair coin).

 

How can we tell if it is a normal distribution?

 

It is always a good idea to graph your data.  In this example it would have made all the difference.  But a histogram on its own is not enough to verify normality.  There are other graphing methods in any statistical software3 that should do the trick.  The QQ plot is my personal favorite.

 

There are also several tests for normality such as Pearson’s Chi Square and the KS test.

 

Conclusion

 

You should always check your distribution before assuming normality.  If you are required to make an assumption explicitly note that and pay attention for any failure modes if that assumption does not hold.

 

The normal distribution is a powerful and often useful tool.  However, you can get yourself into trouble if you assume normality when it is in fact a different distribution.  Don’t be like the consultant who looked like a fool and could have caused serious problems if his error went undetected.

 

 End Notes:

  1. There are several distributions that follow a bellow shape, e.g., student’s t, Poisson, Cauchy, etc.
  2. The wingspan length probably follows a normal distribution, but I have no idea if it actually does. I am not a biologist or naturalist, I just think eagles are cool. 
  3. Excel can actually do some of these including the QQ plot. For me, I try to avoid doing any real statistical work in Excel because it is slow and confusing for that type of work IMO. 

NB: Sorry for not giving any details about what the process was or how it failed, but I am not sure that would be wise.  I’ve been out of the Navy for 6 years now so I have no idea if any of this is, or is not, important anymore so I will default to saying nothing identifiable.  The data in this article is completely fabricated. 

NB 2: There are details of the history of the normal distribution that I did not cover which may cause some of the misunderstanding around it.  For many years it was considered a natural law that phenomenon followed it which is simply not true.