I am often asked about a specific data science degree program or masters in data science in general. One recent example made me think it was time to weigh in on the subject and help others make good decisions about their future.
Data science is hot, and has been hot for a couple of years. When there is a hot new thing that everyone wants to learn there will be some places that are essentially stealing your tuition. I wrote this to help you avoid that fate.
Most data scientists do not have data science degrees
The first thing to note is that most working data scientists do not have degrees in data science. There are a few academic fields that supply the overwhelming majority of working data scientists. Computer science, mathematics, engineering, and hard sciences (particularly physics) make up the majority. There are also data scientists that come from other quantitative backgrounds like economics.
And there are people who come from non-quantitative backgrounds and learn the techniques on their own.
The point is do not do a data science degree because you think that is the only way to work in data science. There are many other paths, and it will stay that way for the foreseeable future.
So why talk about masters degrees in data science?
A friend was looking to hire someone for a data science role. Recruiting found him a candidate who was three classes from completing her masters.
The candidate interviewed well, and had some industry relevant work experience before her masters. The interview hit a snag when my friend asked her what languages she had used in her coursework. “Tableau and SQL,” she responded, “but we are going to use R next semester.”
This is a massive red flag about that program. How exactly are they learning and performing analysis if they were not using an analytical platform or language? Programming is one of the two most fundamental building blocks to data science and analytics.
This example caused me to re-think staying out of the debate. It feels like it is becoming a similar situation to MBA programs; some MBAs bring outsized value while others bring next to none.
Here is my take on what a data science program must include. I’ll also add some red flags to look out for.
What classes must a data science masters include?
Reputable programs should teach programming and statistics first. All other classes should use, and build off of, the fundamentals taught in those two classes.
The overwhelming majority of academic data science utilizes Python or R. Python and R are gaining popularity in business. Python is the most popular language in data science at the moment. SAS is still present in most large companies and could be worth learning.
SQL is a must for data science, but I don’t think it needs to a University course. I learned the basics via a free online tutorial on a weekend. The rest is mostly on the job training for the specific schemas and connection methods used at your employer.
Statistics are vital to data science. I cannot stress this enough. Do not believe people who say it is possible to perform data science without statistics. There are no shortcuts. A professor at Harvard wrote a great answer on how data scientists use statistics here. I cannot top that, so I will not try.
Statistics classes may or may not require programming. The first semester of my masters in Industrial and Systems Engineering included a difficult probability and statistics course. Until some modeling at the very end, it was doable with calculators or excel. While the course was called “Applied Probability Methods in Engineering” it always felt more theoretical.
The theoretical nature of that course was actually a great thing. Professor Momcilovic taught us to think about what the data really meant, and how it would interact with other data. It brought to life concepts like independence, Bayesian thinking, and Markov chains.
Design of experiments
Data science is referred to as a science because when done properly the practitioner sets up experiments the same way a chemist would. Proper research methods and experiments are vital to measuring signal instead of noise.
Statistical hypothesis testing answers the question “are these things different?” That is what accepting or rejecting the null hypothesis means. However, it takes proper experiment design to answer the more relevant question of, “is this important,” or, “is this question relevant?”
Correctly accepting or rejecting the wrong null hypothesis is type three error. It can be a much bigger problem than type one or two error. Listen to Google Cloud’s Chief Decision Scientist Cassie Kozyrkov’s interview on the Data Framed podcast for more on this.
Another crucial part of design of experiments is how to identify the problem before you begin. Data scientist must work with the business to identify the correct problem, and identify a practical solution.
Sampling is also something that is important to master for real world use.
NB: Design of experiments is harder than it sounds in the real world.
Data engineering fundamentals
While I do not believe that all data scientists need to be experts in data engineering, a baseline of knowledge is important. Also, there is a need for data scientists with strong data engineering skills.
Many organizations’ data architecture is a mix of local, remote, cloud based servers, and even spreadsheets. Poor data engineering often makes good solutions impractical.
It was recently made clear to me in a meetup with Ravi Nair that my data engineering knowledge and skills are not where they need to be.
Unless you are working in a very small company you will have partners in IT and outside vendors to help with the engineering aspect. It is helpful to understand the background to ease working with them.
If you need to make progress quickly, the data science team may need to build applications on their own. Getting them to work requires engineering knowledge.
Data visualization and presentation skills
After the project is finished, data scientists need to drive others to take action on their findings. This is where data visualization and presentation skills come in handy.
Note 1: I do not think many programs incorporate this…yet.
Note 2: Data visualization is also quite useful in the problem solving process.
Note 3: Very few data scientists are good at presenting their findings and driving others to take action. I wrote a course on how to do this. I am now offering it for free to anyone who subscribes to my mailing list. Click here if you want to learn more.
I am not sure how many programs have ethics of data science as a mandatory (or even optional) course yet. I believe that it is necessary and extremely important. Ethics will become more important over time as algorithms make more decisions. Algorithms often make decisions without any transparency. For more info, I strongly recommend reading Cathy O’Neil’s book, Weapons of Math Destruction.
Algorithm / machine learning fundamentals
This is where the rubber meets the road. Students need to be able to determine which model fits their data, problem, and potential solution best. There are real world limitations to consider. A student cannot only know how to use XGBoost because it works well on Kaggle.
What other classes should a data science masters include?
The other, more specialized, coursework should be up to the student in my opinion. There are dozens of other topics including:
· Computer vision
· Text analytics
· Natural language processing (voice)
· Data visualization
· Machine learning (classification, regression, time series algorithms)
· Deep learning
· Linear programming / non-linear programming / optimization / operations research
· User interface / experience design
…and so on. There is a common misconception that to be a data scientist you need to know all of them, but that is not true. Very few people know most of them and next to no one is proficient in all of them.
“if you pick one of these programing languages, R, SAS, Python, C++, or Java, you can spend a lifetime and never reach the end; not to mention Machine Learning, Deep Learning and AI. As the matter of fact, when I see candidates put all of these programming languages and all of the predictive buzzwords on their resumes, I know they are bullsh!tting.” – Data science hiring manager I asked to review this article
Red flags in a data science masters program
It goes without saying that not all data science programs are equal. There are some that you should probably stay away from. Here are some red flags to help you recognize them.
Lack of pre-requisites
A master’s degree is a graduate degree so it should build upon subjects that were learned in undergraduate (bachelors). A lack of pre-requisites for people with non-technical undergraduate degrees should be considered a red flag.
When I told the University of Florida that I wanted to do a masters in engineering despite having an undergraduate degree in history, they had some reservations. When they stopped laughing and looked at my transcript they saw it was possible. The United States Naval Academy made us take a heavy technical course load; I took Calculus I, II, III, Physics I & II, Chemistry I & II, Electrical Engineering I, Thermodynamics, Systems Engineering I & II and so on. To complete the Industrial and Systems Engineering masters program I needed three pre-requisites. I needed to take a programming class, differential equations, and linear algebra.
If there is a data science program that tells you that you need no knowledge or background in programming, math, statistics, and linear algebra…run. Masters programs need a plan to bring non-technical students up to speed.
UC Berkeley does a good job elaborating on the pre-requisites for their program. They also offer suggestions for how to meet them with some free resources on their admissions requirements page.
Lack of programming in coursework
It seems too obvious to mention, yet my friend’s hiring example shows that I must. The classes in the masters should be taught in the coding language. If you are going to perform analyses and write models in a certain language, when you are taught about those models you should do it in that language.
Imagine you were going to school to be a carpenter. There are classes on how to use hammers and saws. There are classes on how to build things like decks and bookshelves. If you never actually build them with a hammer and nails, how good of a carpenter would you be? The same thing applies for data science.
Modern libraries and packages for data science in R and Python, make coding models significantly less difficult than in the past. Many models can be written with literally 2 or 3 lines of code. 90+% of the work is still experiment design, data preparation, and results interpretation.
One last consideration
Where is the data science degree located inside the University? Is it part of the business school or is it inside the school of engineering (computer science is typically inside engineering)?
Neither type is better than the other, but each will be better for different things. If you want to focus your education on learning to apply data science in business, you may be better off going to a program hosted by the business school. If you want to work for a tech company, you may be better off going to a program hosted by the computer science or engineering departments.
The best way to determine what a school is good for is by asking for a list of where recent graduates were hired.
I recommend students pick a school that fits into one of two categories. They are best off with an academic powerhouses or a top tier state engineering school.
This category includes schools like MIT, Stanford, Northwestern, and Columbia. These schools prize their reputation and are unlikely to put together a substandard program. Their big reputations attract top tier students and company recruiters.
Top tier state engineering schools:
This category includes flagship state schools that are known for their engineering programs. Georgia Tech, UC Berkeley, Purdue, and Texas A&M are examples. Top tier state schools, especially in technical fields, draw very strong students and faculty. They also have large alumni networks which create other benefits for graduates.
Repeated for emphasis: Data science is hot, and has been hot for a couple of years. When there is a hot new thing that everyone wants to learn there will be some places that are essentially stealing your tuition. I wrote this to help you avoid that fate.
I have watched this happen over and over again to people who pay tens of thousands of dollars to get MBAs from for profit schools. They only to realize later on that it was a waste (or live in denial). I do not know if my friend is going to hire that candidate or not. But I do know she would be better off if she picked a program that was better at preparing her for the work she wanted to do.