We are living in a golden age of analysis programs. There are more of them than I can possibly describe in the article, but I will try to give you an overview of the most popular ones that I have seen.
Minitab – Minitab is a statistical software package that is primarily used for six sigma and other process improvement efforts. It is a bit old school, but there is solid information available on how to use it to solve specific problems.
JMP – JMP is similar to Minitab, but is created by the same people who make SAS so it integrates well with SAS. It has a graphical user interface (GUI) so there is no coding required and its graphics can be presentation ready.
Note: Graphical user interface is a method of interacting with a computer via icons and menus, e.g., clicking with a mouse) vs. typing commands. Windows is a GUI, DOS is a command line interface, e.g., you have to type a command from memory.
SQL – SQL is not an analysis program per se, it is the method that analysts often use to get their data. SQL allows users to query relational databases and combine data from multiple locations. This sort of thing is useful to do things like combine customer data with transactional data if you wanted to analyze purchase patterns. Many analysis programs have a built in SQL querying tool where you can use SQL code to pull or organize data.
Tableau – Tableau is primarily used for data visualization and building interactive reporting dashboards, but it is capable of much more than that. Tableau has functionality for using R and is capable of performing complex analysis, but is most commonly used to visualize the outputs of analysis in R. Many large companies have Tableau licenses which allow them to have dashboards that can be seen via a web browser by anyone operating on a company machine.
Alteryx – Alteryx is a user friendly analysis tool with visual workflows that make multi step analysis easy to follow. Alteryx is essentially a GUI built on top of R so it retains most of the power of R but is far more user friendly. Alteryx is a great tool for consultants because it is extremely flexible in terms of input/output data. It is capable of taking input in almost any format and then providing output in just as many formats.
Teradata – Teradata markets itself as a full service analytics suite, but I have never seen it used that way. The most common uses for teradata that I have seen involve exploring available data in SQL servers. Teradata has some highly useful tools that can save tons of time when you are exploring a database and unsure of the quality, quantity, and usefulness of the available data.
SAS – SAS is a full service analytics suite capable of performing many types of analytics as well as what we call “production” functions. SAS has add-on software such as enterprise guide and enterprise miner which can add on a GUI or perform machine learning type functions. SAS is dominant in banking and also highly used in many larger companies.
SPSS – SPSS is similar to SAS but with some different strengths. SPSS is most common in marketing and customer research areas because it makes it easy to perform conjoint analysis. SPSS is also common in academia because student licenses are easier/cheaper than SAS.
So what are R and Python then?
R – R is a statistical programming language that is incredibly powerful and also free. R is capable of just about every type of analysis from simple regressions to “black box” machine learning algorithms. It is hugely popular in both business and academia in the US. There is tons of material on how to learn R available on the web, but the quality and usability of it varies significantly.
R is not beginner friendly because it requires you to learn to actually code to use it. The outputs can be somewhat confusing for someone whose knowledge of statistics is not strong. If you would like to be a data scientist, knowledge of R (or python) is almost a requirement, but I highly recommend you learn the basics in other programs like excel first.
Python – Python has much of the same statistical usage as R, but it is also a general use language, e.g. you can write web programs in it, whereas R is a purely statistical language. Python is what programmers call an object oriented language, so if you have prior experience in an object oriented language such as C++ or Java the learning curve will be easier.
Which to choose, R or Python?
Neither R nor Python is objectively better than the other, they are just different. To say one is better than the other is equivalent to saying a Corvette is a better than an F-150. They each have their uses that they are better for. The people at datacamp made a great infographic to highlight the differences here. The short story is that if your data is high quality and/or you need the most intense analysis possible, R is probably a better choice. If your data is coming from multiple sources, and messy such as from a web scrape, Python is typically superior.
But what about Hadoop, isn’t that the newest and coolest thing out there?
Hadoop is often what people really mean when they are talking about “big data.” To be honest, I think 90% of the people who use that term don’t know what it means. One of the reasons for that is that Hadoop is not the easiest thing to explain; here is my best effort and I am so sorry if I fall short.
Hadoop is a system of distributed computing that can quickly and cheaply process huge amounts of data that is stored in multiple places. It uses the Hadoop distributed file system to store data in multiple places without prior structuring and then utilizes MapReduce to break processes down to smaller subsets and distribute them.
Have you ever seen computer processors that are dual or quad core? Hadoop is the same general concept, but with multiple computers instead of multiple processor cores. Instead of a computer program telling individual processors what they should be working on Hadoop lets you writethe program that split tasks.
Hadoop has many advantages, but these are probably the three biggest:
- It can handle massive amounts of data. An excel spreadsheet above 5 MB is almost unworkable, and most analysis programs will have trouble with data sets above 1GB. Hadoop on the other hand can crunch through TBs without breaking a sweat.
- Hadoop is able to handle unstructured data such as images, video, or text. Facial recognition features on the web are run on some form of Hadoop.
- Hadoop also allows cheap and flexible data storage. Typical data warehousing requires structuring the data and preprocessing the data so it is indexed and usable. Hadoop does not require that and thus can keep data in multiple locations in its exact original format.
Hadoop also has some drawbacks and limitations including these:
- Hadoop is not efficient for all types of analysis, especially interactive or iterative ones.
- It can be difficult to find people to work with Hadoop because it requires several additional skills besides programming.
- Data governance is not at the level that will make established companies comfortable in relying on it. For normal data warehouses, there is metadata and management or governance documentation that can help a user sort out what is on the system. In a normal data warehouse environment if the data administrator got hit by a bus an experienced admin should be able to sort the system out and keep it running. In a Hadoop environment that is a much dicier situation.
The list above is just a sampling of the most popular options for businesses to analyze their data. Looking at tons of different options can be intimidating. I understand why many executives do not actively participate in the analytics processes at their firms, but I do not think that trend will continue.
For right now, you don’t have to worry; you have me to help walk you through what you need to know in order to get started! If you are just getting started in analysis keep checking back to the Start Here page for further content.
If you are interested in all of the options out there, Gartner produces an annual report called the Gartner Magic Quadrant. They are the foremost authority in weighing the capabilities and usability of programs and a great place to look for more information.
Plus as a BCGer I love the fact that they use a 2×2 matrix. I feel almost obligated to share it: