KNIME tutorial: Random forest machine learning model to predict Kaggle Titanic (part 2)

  •  
  •  
  •  
  •  
  •  
  •  

 

Random Forest Models

The random forest model is easy to execute in KNIME.  It is a popular model because it is easy to implement, adaptable, and robust to overfitting.  Random forests are a common way for new people to get started with machine learning.

 

How do they work?

 

The random forest works by creating a large number of decision trees.  These decision trees work on random samples created by the model.  The model allows all of these trees to vote on the prediction which creates an accurate result.

 

The model creates those samples with replacement.  This means it is essentially creating its own partitioned data sets, over and over again.  This means that technically we do not need to partition our data into training and test sets.  However, it is still a good practice to do so.

 

Why Random Forest?

 

There are a lot of strengths to random forests.

 

First, random forests are probably the most adaptable and easy to use algorithm.  The steps to setting it up really are as simple as I’ll show.  Of course the data prep isn’t easy, but that is another story.

 

Second, it is a highly accurate method.  You’d think being easy and adaptable would make it sacrifice capability, but it is actually good.

 

Third, because it performs random sampling, random forests are robust against overfitting.  Overfitting is one of the things I fear most in modeling.  Anyone who has worked with me has heard me say that I would rather be generally right than precisely wrong.

 

And fourth, it is somewhat interpretable because you can get relative feature importance.  That means that while you don’t know exactly how it came up with its decision, you know generally speaking which variables contributed more than others to the outcome.

 

Weaknesses

 

The only big weakness is that it is a slow model to run.  If you have large data sets, or want huge numbers of trees it can be impractical to run because every tree needs to vote every time.  Even when it is obvious from the start it needs to complete the entire process.

 

A smaller weakness is what I said earlier that it is only somewhat interpretable.  Interpretability is a hot term these days in data science.  The question of, “is a model good if we don’t know why it says what it says” is hotly debated.  I stand firmly on the side of needing somewhat interpretable models.  I’ve seen enough ludicrous results from black box models.

 

And so now that you know about the algorithm, watch the video to see how to create the model in KNIME.

 

Other resources

 

There is a lot of info out there on decision trees and random forests.  Here are some that I think are good and worth your time if you want more info:

Article covering random forests and decision trees with examples in Python

Decision tree theory video from KNIME data scientist Rosaria Silipo

Decision tree practice video from KNIME data scientist Rosaria Silipo

Random forest usage and theory from KNIME data scientist Rosaria Silipo