Data Categorization

In this guide, we're going to be doing a quick overview of data categorization. Outside the data science world, categorization and classification are used interchangeably but for machine learning developers, there's a big difference. I know we haven't talked about classification yet but for us, classification is one of the primary methods that we'll be using for predictive modeling. But for now, you just need to know that classification works by sorting data into classes or subgroups based on similarities.

Categorization, on the other hand, is more of an organizational structure that we set before we do any modeling. I'd say the biggest benefit that I get from categorizing data is that it almost ensures that I use the right scaling and processing method for each dataset.

The way I like explaining the difference between categorization and classification is the same way that we talk about the animal kingdom. If we have a mole-rat, a turtle and an assassin bug, they all belong to their own class. A mole-rat's a mammal, a turtle's a reptile and the assassin bug is an insect. Then each class belongs to the same phylum, which is vertebrae. So we consider the phylum to be the category. And similar to how we break down the animal kingdom, each phylum or category needs to have some set of shared attributes and then once you progress into the class, it gets even more specific.

There's a bunch of different ways that we can categorize data but usually, I like to start by deciding if I want to do categorization by input or output. If I choose to categorize by input, I'll look to see if the data is labeled. If it is, I know I can apply supervised learning. But if it's unlabeled, it'll be an unsupervised approach. If I end up categorizing by output, and the output's a number, then it's a regression problem. If the output is a class, then it's a classification problem.

medium

And you can see what I mean by using this DataFrame. We can see that every entry is a numerical output. And what I mean by that is that we can find out what the a_students have in common with each other, what the b_students have in common with each other and so on. What ended up deriving from that is that it turns out that students who work between 10 and 20 hours a week and got higher than a B in their prerequisite course ended up with either an A or a B in this course. Students who worked more than 20 hours and got a C or higher in the prereq, ended up with an even spread of Bs, Cs and Ds. Then finally, the students who didn't work at all and got a C in the prereq, usually ended up with either a C, D or E.

Knowing that, we can actually categorize incoming students, based on their chance for success. So depending on their group, we can apply different algorithms to help predict what areas they should actually focus on to give them their best overall result.

There's really a countless number of ways to actually categorize data. So having a good understanding of the data and what result you're trying to achieve will have the biggest effect on how you decide to categorize everything.

This isn't necessarily a fact but to me, it seems like data categorization is kind of an overlooked area. But I'm also not going to lie and say that it can make or break a company. The reality is is that properly categorizing data can end up saving you a lot of time and help give you a much better long-term result.