KNIME tutorial: Feature engineering to improve Kaggle Titanic random forest performance (part 3)



Feature Engineering

Feature engineering is an important step to taking models from just OK to great.  If you have not heard the term feature, we essentially mean the data that goes into the model.  You can think of it the same way you would think of an independent variable in statistics.


What is Feature Engineering?


Wikipedia defines it as “the process of using domain knowledge of the data to create features that make machine learning algorithms work.”  What that literally means is doing something to that data to make the model perform better.


This could mean regularizing the data, or applying it to a logarithmic scale instead of an absolute one.  This could mean creating new features that represent the way two other features interact with each other.


Example of Real World Feature Engineering I worked on

I helped build a model for a retailer to predict when product was not available on the shelves.  One of the features that turned out to be important was when the inventory was a multiple of the pack size.  So let’s say product A came in packs of 6.  If the inventory was 15 yesterday, but 12 today, the likelihood it was not on the shelf increased.  Why, what often happens is that there were three items on the shelf and two full packs somewhere else.  Those three items on the shelf sold, but no one unpacked the two full boxes.


Obviously that is not the case every time, but adding that feature into the model made it perform better than simply looking at sales velocity and inventory alone.


Why is Machine Learning so hard in practice?

Stanford Professor and machine learning guru Andrew Ng is quoted as saying, “Coming up with features is difficult, time-consuming, and requires expert knowledge. “Applied machine learning” is basically feature engineering.”


So, if you were wondering why the data scientists you work with cannot produce a model as quickly as what we did in parts one and two…there you go.  You can probably imagine it took quite a bit or work to realize the pack size example.


New automated feature engineering tools

There is something new and exciting in feature engineering called automated feature engineering.  There are tools like and Datarobot that can automate much of the process.  This can save a ton of time and manual effort for data scientists and allow them to try out more options much quicker.  I do not think of those tools as a replacement for needing data scientists, but rather a tool for data scientists to work faster.  I see it kind of like giving a chainsaw to a lumberjack who was previously just using a big axe.