KNIME tutorial: Dummy variables with one hot encoding

  •  
  •  
  •  
  •  
  •  
  •  

 

Dummy variables are needed when you want to perform a regression but have categorical data.  The easy way to create those dummy variables is with a technique called one hot encoding.  KNIME performs one hot encoding with the one to many node.

What is a dummy variable?

 

Dummy variables are how we convert categorical variables to either 0 or 1.  When we are performing a regression we are multiplying various columns of our data by coefficients the program creates.  That works great when the data is numerical, particularly when it is continuous data.

 

But what do we do when the data is “New York” or “Red”?  You will have categories, like NY or Red, that you will want to represent.  If you are trying to predict rent, your prediction will be terrible if you do not know if the apartment is in NY or Toledo.  If you want to predict a Ferrari’s value you probably want to know if the car is red or if the eccentric previous owner had it painted to look like the South African flag.

 

How do we use those dummy variables?

 

Linear regression makes predictions by multiplying data by coefficients that are calculated by the model.  If the dummy variable for NY = 1, then we can multiply by the coefficient for NY.  All other location dummy variables will = 0.  Anything you multiply by 0 also equals 0 so it does not matter how large the other coefficients are or how many zeros there are so long as we have a single variable that = 1.

 

How does one hot encoding work?

 

We could create dummy variables manually if we wanted to.  But why on earth would we want to?  You could do it with heaps of “if – then – else” statements but that would be a lot of work.

 

One hot encoding simply means a function that takes a column with categorical variables and creates as many columns as there are variables and sets them to 1 or 0 based on the data.  It is that simple, easy, and powerful.

 

Please watch the above video for how to do this in KNIME.  You should be able to do the same thing in any analytics platform or analytical language.

 

Note: I don’t like the term dummy variables.  To me it makes them feel like they are placeholders, not something that actually affects the results.  But they do, and you cannot ignore them.