The ability to use SQL to query data is table stakes for analysts. You need to learn SQL if you want to work as an analyst, plain and simple. Thankfully it is easy to learn.
In order to perform analysis you need data. Not just any data, you need the right data (and only the right data), you need it together in the same file, and you need it in a format you can work with. That is where SQL comes in. SQL is by far the most popular method of pulling and joining data.
What is SQL?
SQL stands for Structured Query Language. You can pronounce it either ess-cue-el or sequel, I have heard both used in industry. SQL is the primary way that a user communicates with a database.
But I have a degree, why do I need to learn SQL?
In school when you are performing analysis you are typically given the data set. The real world does not work that way, you will need to pull your own data most of the time. In order to pull data for an analysis, you will write commands in SQL that tell the database what it is you are looking for and how to arrange it. If you currently work with a system that allows you to pull data, such as Microsoft Access, the program is using SQL to communicate with the database for you.
The beauty of SQL is that it is used everywhere. There are countless programs that use SQL, and they are almost identical across industry and function. If you learn how to use SQL for your role in banking, you can use that knowledge in a supply chain role, at an industrial goods manufacturer, in a hospital, etc. It is ubiquitous; you risk nothing by learning it.
SQL is also embedded into almost all full service analytics suites. SAS queries databases using Proc SQL, R uses SQLDF to write queries, Azure and Aletryx have a SQL editor, etc. Regardless of how you will process the data, you will almost certainly pull it with SQL.
SQL is necessary because data storage depends on relational databases. Here is how techtarget.com describes a relational database:
Relational database – a collection of data items organized as a set of formally-described tables from which data can be accessed or reassembled in many different ways without having to reorganize the database tables. The relational database was invented by E. F. Codd at IBM in 1970.
In plain English that means it is a way of collecting data in multiple tables that allows you to select specific data and combine it with data from other tables. If that still does not make sense, how about an example?
Suppose I own a national business and want to evaluate a direct mail marketing campaign in Florida. I have information about my customers stored in a customer table. I have data about my sales in a transactions table, and data about the mail sent in a marketing campaigns table. If I look at individual tables I can look at only one at a time, there is no way to tell if the marketing is working. If I wanted to analyze the effects of the direct mail on sales in Florida, I need to combine all three. That is where SQL comes in.
This infographic shows the data available in each table and a way we could combine them to get the data that we need (in brown):
In the above example we only need the data in brown, one item per table. We could use more of course, but let’s keep this simple for the time being. We make sure that we are getting the correct by joining specific variables, sometimes called unique or primary keys.
First we want to make sure we are analyzing every piece of mail we sent. Each piece of mail will have a unique value; here I have it based on sales leads so I call it lead_ID. Not all sales leads will be customers, but we need to find out which ones purchased our product. To do this we join the marketing_campaign table to the customers table based on address, city, and state. We need to use all three to make sure we are getting a one-to-one match.
Matching is an important concept. We want this to be a one-to-one match meaning that there is one lead_ID per customer_ID. If we do not use all three rows, (city, address, and state) we could have a one-to-many match. When you are expecting a one-to-one match and you do not do your join correctly and get a one-to-many match, your results will be crazy. Every analyst has had times when they expected a query to return 5000 rows (results) but it returned 30 million instead. Oops, better look at those joins again.
Second we will join the data we already have with the transaction table by the customer_ID we got in the customer table. This may be a one-to-many join. If a customer has multiple transactions it would be a one-to-many.
Note: If one-to-one and one-to-many matching is confusing you, don’t worry. It becomes obvious with practice. Writing any sort of code requires practice, and the more you practice the easier it gets.
We now have data that is usable for analysis. We can improve its usability by doing things such as adding all the transactions for a particular customer together, or grouping by people who did not buy vs. customers who did. All of this, and more, is possible with SQL.
Why do we need to do this, why can’t the business just keep all the data in one table? In this simple example they probably could, but as the amount of data grows that becomes impossible. The larger a data set is, the longer it takes to work with, and the more problems it may cause. There could also be legal reasons for keeping the tables separate. Suppose the customer table had social security numbers and you did not want to compromise your customer’s personal information? You could have it in a separate table and only certain people able to query the social security number field.
It sounds powerful, is it hard?
SQL is easy to learn. When you know it but others do not it makes you look like a rock star. Why not spend the effort to learn something so easy and so powerful? You can learn it in a day or two, and it is totally free.
When I left the Navy and started applying for jobs I did not know SQL. A recruiter told me I was a great fit for a job but I needed to know SQL. I told her, “Just put down that I know it and I will learn it this weekend.” I went home and googled “free online SQL training.”
I learned how to use SQL from the following online course http://www.sqlcourse.com/intro.html . It is totally free, walks you through the basic functionality, and has exercises where it requires you to write the code and tells you if it is correct. The entire curriculum, regular and advanced, took me about 3 hours and taught me ~75% of what an analyst needs to know. The rest is stuff you can learn on the job, typically by modifying experienced analysts’ code.
Note: Copying code is totally legitimate; coding is not like schoolwork. Plagiarism is encouraged when you are learning, everyone starts by copying. The first time someone asks to see your code so they can learn how you did something is a great feeling. Imitation really is the sincerest form of flattery.
There is a much more to SQL in terms of being a database architect. I suppose it would be useful for an analyst to know that, but I would put it far down the list. If you are trying to get a job in analytics or build analytical skills in your current role stick to the standard queries. You should build your skills pulling and analyzing data before you start working on structuring data bases (unless of course you want to work as a data base administrator).
Edit: A senior analytics professional sent me this link for how he learned SQL http://www.auwalgene.com/@mystudents/lecturenotes/teach_urself_sql.pdf
 It would actually be a left join, but I am keeping things super simple for now