KNIME tutorial: GroupBy node to group your data



The KNIME GroupBy node is how the platform handles grouping.  Grouping data is a powerful technique that I use often in the data exploration phase of machine learning and modeling.


Almost every data set has variables that you can think of as categories.  Grouping based on those categories and then seeing what happens to the numerical variables gives you a great view into what things may drive your analysis.


KNIME GroupBy node


In KNIME, grouping is handled by the GroupBy node.  It is an incredibly powerful node that is capable of doing dozens of numerical transformations on the grouped data.  But first, you need to group your data.


Selecting the columns to group on is an easy click to move them from the left menu (excluded) to the right menu (included).  The tricky part is that there is no way to re-order the grouping columns inside the node.  The earlier column in your data will be the first column it is grouped by.  You need to re-order them beforehand with a column re-sorter node to get them in the order you want to sort by.


Getting the sum, percentage, mean, median, etc.


The easiest way to use the mathematical operations is through the manual aggregation tab.   Once you have picked which columns you want to group on (the categories), then you go to the manual aggregation tab and pick which columns you want to see the math done to.


Some of the most common and useful math operations are:




Percentage – % of the total for that column inside that category




I typically look at those to explore my data and understand what it looks like.  Sometimes that inspires me to dive further, and that is the essence of data exploration.




KNIME’s GroupBy is an amazingly powerful node that is easy to use.  It could be easier if they add the functionality to re-order the columns inside the GroupBy node, but we can work around that for now.