Overview of Feature Scaling

In this guide, we'll be discussing feature scaling, which on the surface, this may seem like an intimidating topic, especially if you weren't familiar with some of the terms that we'll be talking about. But the reality is, it's a pretty straightforward concept once you're familiar with it.

So without further ado, let's go over some exciting definitions. According to Wikipedia, the definition of feature scaling is a method used to normalize the range of independent variables or features of data. Which of course leads us to what normalization is. For that, Wikipedia says that normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging. In more complicated cases, normalization may refer to more sophisticated adjustments, where the intention is to bring the entire probability distributions of adjusted values into alignment.

Now personally, I think they made it sound a little more complicated than it really needs to be. To me, normalization gives us a way of comparing apples to oranges. An easy example of this is comparing temperatures from cities around the world. In Detroit, Michigan, temperatures are typically measured on the Fahrenheit scale. But in Moneygall, Ireland, temperatures are measured in Celsius.

To normalize the data from each city, we'd convert the temperatures to a neutral or standard scale. So think something like the Kelvin. And that's pretty much why scientists around the world use an SI unit. It's essentially their own built-in way of establishing a standard unit of measure.

And that leads us to the first method that we'll be covering, and that's standardization. This method works by creating a neutral scale that refers to a feature vector's distance from the mean, in terms of the standard deviation. When we standardize data, the mean is always set to zero, and all the feature values are replaced with how many standard deviations they are from the mean. A lot of you might have seen this method used before, because it's also referred to as the Z-test or Z-score in statistics. And oftentimes, you'll choose to use this method when you want to limit the effect that outliers might have on the overall result.

To show you how this works, we're going to use Scikit-learn to create a standardized grading scale to mimic a couple of my old physics courses. And to save us a little bit of time, I already created a couple of randomly generated arrays that we'll be using, along with the histogram that we'll use as our visual. If you're curious about some of the NumPy functions that I use to create the arrays, I'll run through a couple of the features really quick.

medium

So the first thing I did was use NumPy's random generator. And I chose to use normal distribution. So the location is essentially where the mean is going to be located. And for the first course, it's at 85. And the second course was at 55. Scale really just represents the variance in the data. Then the last attribute is size, which is just the number of samples in our array.

I also used NumPy's sort function, which arranges the data from low to high. And then in there, I also added the new axis function, which we're going to need anyway. Oh, also in the sort function, if you notice where it says axis equals zero, that's just signaling to the sort function, which indexed column I actually want to sort.

Then if you look down a little bit for the histograms, I used something new called subplot. And what this does is that it lets us graph multiple histograms in the same pane. So the first input just references to how many rows we're going to be using. The second is the number of columns. And the third is just the location of the histogram.

Now, to get back to the standardization process, it's a pretty easy implementation. The first thing we're going to do is import the standard scalar from the Scikit-learn pre-processing library. Then similar to what we've done in the past, we'll create our scalar, and have that equal standard scalar. Then we can go ahead and create our two variables. The first is going to be standardized physics 150. And that's going to equal scaler.fit_transform. And then we want to pass in which array we want to fit and transform, and that's going to be phy 150. Then for the second variable, we're going to pass in standardized phy 314. Then we'll pass in scaler.fit_transform again. And this time we'll use physics 314.

medium

Now, before we do anything else, let's move down and pass this in to the data frame that we're going to be creating. Now when we run this, you'll notice that all of our original data is sorted from low to high. The standardized data ranges from anywhere around negative two to two. And that's going to be different for all of us, because every time that we use the random generator, it's going to give us a new array every time. And then the last little piece is that if you look for the zero in your standardized data, that's going to be equivalent to the mean of the original data.So like I said, it's a pretty easy implementation. It's just more or less getting used to the process and knowing exactly what the data is telling you.

medium

The next few methods that we're going to be going over are a little different than the first because none of them use standard deviation. So instead of standardization which uses standard deviation, these will all be considered normalization methods.

The first normalization method that we'll talk about is called mean normalization. The name is pretty self explanatory, but here, the distance between each feature vector and the mean is in terms of the range. By doing it this way, we automatically set the mean at zero. Then by putting it in terms of the range, we make sure the normalized data has distribution limits of negative one and one. And to the best of my knowledge, Scikit-learn still doesn't have a function for this method, so we won't actually be going through this in the course.

Next up is probably the most commonly used method and the simplest method, which is the min-max normalization. It's pretty similar to our last example, but instead of using the average, we're going to be using the minimum. So the distance of each feature vector from the minimum value is in terms of the range. When we normalize data using this method, the interval is bound by zero and one.

The implementation of min-max normalization is pretty easy as well. We're going to follow the same basic steps that we did for standardization. So we'll start by importing the MinMaxScaler. And that's going to be from Scikit-learn's pre-processing library. Then just like before, we'll create our scalar and assign it to the MinMaxScaler. And then we'll create our two variables. So the first will be normalized phy 150. Then we can go ahead and pass in the MinMaxScaler.fit_transform physics 150, then normalized physics 314, MinMaxScaler.fit_transform and physics 314. Then we're going to pass this into our pandas data frame again. And with that all set, we can go ahead and run that. And you'll notice that everything is ranged from zero to one just like we expected.

medium

We won't be doing an example, but the same rescaling method can be used without zero and one as the boundary. Instead, rescaling can be set to any interval by using any arbitrary value for a and b.

The last method that we'll cover is scaling to unit length. Some of you might already be familiar with this method, because it's the same thing as a unit vector in physics and vector calculus. And this is an important method to understand for machine learning, because it's going to be used when we get to gradient descent, when we start to optimize algorithms. You don't need to know how to do the actual math, but having a solid understanding of gradients and partial derivatives will go a really long way when we start to build our own neural networks. The equation for unit length is pretty straightforward. It's just the feature vector in terms of the magnitude of the feature vector.

The purpose of this method is to scale the components of the feature vector, so the complete vector has a length of one. And this is probably a little bit easier to visualize. So if we have a feature vector X, and it contains the values two and three, we can calculate the length or magnitude by using the Pythagoras Theorem. Then to create a unit vector which has a total length of one, we just divide each component by the magnitude that we just calculated. It's really easy to verify too, because we just take the square root of the sum of each squared component.

The reason I saved scaling to unit length until the end is because compared to our other methods, the result is a little bit wonky. The implementation is still the same, so we'll start by importing normalizer from Scikit-learn's pre-processing library. Then we'll assign the normalizer to the scalar. The first variable is unit length phy 150. That's assigned to unit_length_scalar.fit_transform. And I'll pass in physics 150. And as you might expect, the next variable is unit length physics 314. And that is assigned to unit_length_scalar.fit_transform physics 314. And we'll add that to our pandas data frame.

medium

And now when we run this, you'll notice that all of the elements in the unit length column are just one. This might make sense to some of you already, because what we did was scale everything to one. But if you don't know how vectors work, this can be a little bit confusing.

So one thing you need to know is that vectors have both a magnitude and a direction. And what we have here is the magnitude. So going back to the example that I used, if the components are two and three, and we scale that to the unit length, we end up with two over root 13 and three over root 13. And these components are actually the direction the vector is pointing in.

This doesn't really mean much to us right now, but again, when we get to optimization and start using gradient descent, this comes into play a little bit more. We're not going to have to do the actual math for any of this, but you should know that the gradient vector is going to point in the direction of the steepest descent.

Feature scaling is a key component to data processing. Not only does it give us a standard scale to work with, but it also saves us a lot of computational time. Instead of your program trying to iterate through values in the thousands or sometimes millions, we're able to scale our data down to a number range from zero to one or negative one to one. In contrast, not scaling or doing it incorrectly, can also produce some pretty awful mistakes.

So I'm going to finish this guide with a quick story. And this isn't really machine learning related, but back in 1999, NASA worked with an aerospace company named Lockheed Martin. And their mission was to launch a space probe that was intended to orbit Mars and take climate measurements. Unfortunately, the ground software that Lockheed created, provided all their impulse measurements in pounds seconds, while the software that NASA built expected that information to be measured in Newton seconds. So ultimately, NASA ended up losing communication with their probe along with $125 million worth of development.

That's crazy to me, because even the best engineers in the world can sometimes forget something as simple as feature scaling and having a standard scale.

Anyway, that's going to wrap it up for now and I will see you in the next guide.