Which data skills do you actually need? This 2×2 matrix will horribly mislead you…

  •  
  • 1
  •  
  •  
  •  
  •  

Chris Littleton’s Harvard Business Review (HBR) article does not identify the skills you need to get into data science.  He lays out this 2×2 matrix of skills according to whether they are useful on one axis, and whether they are time consuming to acquire on the other.

This type of matrix is interesting and a potentially good way to prioritize learning objectives…if the data is correct.  It is in that respect that this article fails spectacularly.  Many core competencies for higher level skills fall into the “not useful” brackets while the higher level skills that require them are in the “useful” ones.

Here are some of his “not useful” skills that are absolutely mandatory for data science:

Data cleaning – Real world data is messy, plain and simple.  I’ve heard that some digitally native companies like Facebook and Amazon have typically clean data, but everywhere else has significant issues.  Dirty, messy, and disorganized data is unable to be properly analyzed.  Errors, potentially massive ones, will occur early and often if a functional model is even able to be written.

Some people, including me, lump much of data preparation into data cleaning.  In many cases data needs to be scaled, transformed, categorized, etc., in order to be useful.  I am willing to wager that even the tidiest digital data often needs this.  Data preparation obviously continues in model development and tuning as well.

Mathematics – Data science and machine learning are applied math with lots of data and computing power to work with.  Perhaps the author assumes a base competency level above which it is time consuming and not that valuable to go above?

Here is the thing; the math for data science is not easy.  I am good at math (A’s in calculus and differential equations) and linear algebra gave me a really tough time.  Linear programming and optimization doesn’t look hard at first, but the difficulty level of real world problems escalates very quickly.

Statistics – Statistics are incredibly important for data science.  Statistics are the core of our methods for determining if something is signal or noise.  Is it an important change or random variation?  While we use qualitative methods as well to determine this, we need to use mathematics as well.  T-tests, Chi square tests, Kolmogorov Smirnov tests, etc., are all statistical tests to determine importance.

Data science community criticism

There has been a lot of good that has come out of this article.  The data science community has quickly and uniformly condemned these ideas.  While there is not always consensus on what skills are important to learn, there is an instant consensus that this list is dangerously wrong.

You do not have to take my word for it, check out some of the criticism from industry leaders who know much more about data science than I do.

I have seen literally endless posts on this, and could go on, but I think you get the point.

Harvard Business Review needs to return to its normal standards

HBR is generally known for well thought out content and high editorial standards.  There is a tremendous amount of interest in the skills needed to become a data scientist, and no real authoritative list, so I understand why they wanted to publish this.  However, this article not only fails to provide good advice, it provides terrible guidance for aspiring data scientists.

HBR has published a lot of good content from data science leaders.  In fact, the term “data scientist” was coined in an HBR article by DJ Patil (link).  My friend Jordan Levine’s article on the need for analytics translators is excellent.  Avi Goldfarb’s interview on “Prediction Machines” is thought provoking.  My point is they have tons of good ideas, and they should be able to move on from this quickly.

Update:

I found some comments from the author where he explained his points.  This matrix was intended to represent the needs of his current team.  His current team is very strong in mathematics and statistics so further improvement in those skills would provide minimal value and be time consuming to achieve.  It is essentially classic diminishing returns.

This update means his article isn’t wrong, but it is still highly misleading.  This was not made clear in the text.  I read it a couple times to make sure.

The idea for using a framework to determine what you should and should not invest your time in is sound.  There is nothing wrong with the idea of a useful vs. time consuming matrix, the issues are with the exact placement of subjects and failing to note that they considered high levels of initial skill.

In short, this means that HBR failed in their editorial review.