Setting Up Test Validation and Training Sets of Data

In this guide, we're going to be digging into how supervised learning algorithms are structured and evaluated through the use of training, validation and testing data sets. When you start working on a supervised learning build-out, at the minimum, you'll be breaking down your data frame into two sets, training and testing. Then at other times you'll be using three sets and those are training, validation and testing.

But for both the general rule is that the majority of the data is dedicated to the training set and the remainder is allocated based on sample size and the actual model itself. If you have a robust data frame, about 80% of your data will be used for a training set. Then depending on the model and the number of hyperparameters, the validation and testing set will have a 10%, 10% split, but keep in mind if there aren't many hyperparameters, you can probably shift some of the data into the test set or not use a validation set at all.

Now let's talk about why any of this is important to you as a machine learning developer. By now, we should all know that supervised learning requires prior knowledge. And what I mean is that in order for supervised learning to work, we have to know what the output should be based on an input. The reason this has to be the case is because the goal of supervised learning is to approximate a relationship between the input and output of the data.

To better understand this, the analogy I like to use is when a parent tells their child not to touch a stove top, since the parent has previous knowledge from using the stove, they're able to approximate how hot it might be. Then in the future, after passing the information on their child can decide what course of action they want to take based on what they've been taught. If they decide to touch the stove, they will quickly realize with a high level of certainty, their parents were right.

In machine learning we use the training set to feed a supervised learning algorithm information, but to be even more specific, training sets are samples of data used to fit the model. If you haven't already, you're going to be hearing the term model fitting or best fit model a lot. But the reality is, it's a pretty simple concept that people like to overly complicate. Basically, when an algorithm is a good fit, it will make a good prediction.

But to get back to our analogy, the training set is made up of all the experiences the parent has had with their stove. They know every time they turn the stove on it gets hot, thus turning the stove on is the input and heat is the output. The parent also knows that the stove isn't limited to blazing hot or room temperature, instead there's a range of temperatures based on different factors.

And that's where the validation set comes in. Validation sets are used to provide an unbiased evaluation of how well the model is fit based on the training data. In order to produce a more precise fit, variation has to be reduced by fine tuning the hyperparameters. So if the parent in our example decided they wanted to ignore the laws of thermodynamics and conclude stoves have to be hot when the burners are on, and cool when the burners are off, we would probably end up with a poorly fit model and a family with really sore hands.

But thankfully for us, the parents considered heat dissipation and tuned their internal hyper-parameters allowing their training set to more accurately predict the output. Because of these modifications the uncertainty in the model was reduced, which presumably decreases the likelihood of the child burning their hand. With the well-fit model in place, the algorithm can be tested and that's obviously where the testing data comes into play.

There's really only one main purpose of the testing set and that's to evaluate the model. Applying this back to our analogy, the test set is made up of every moment the child decides to touch the stove top. Let's say the burner has been off for about 20 minutes and the child finally wants to validate their parent's advice. So they touch the stove top and it's only lukewarm, based on the information they've been given it kind of makes sense, but how do they know for sure their parents weren't just wrong? And that's actually the hottest the stove gets. Being the curious child they are, they turn the burner back on and let it heat up.

Anyway, I think you get the point. Every time the child decides to touch the burner, they are acting as the test set by evaluating the information their parents have given them. I hope all of this made enough sense because it's definitely something that you'll be dealing with as you move into the workplace. And when handling real world data, it's not always this cut and dry, at some point you'll probably see coworkers mixing things up like using validation sets as their test set, which is definitely not a good practice.

And that about brings us to the end. So that'll do it for now, and I'll see you in the next guide.