Replacing Nan in Machine Learning Projects

In this guide, we'll be discussing a few different ways, of handling incomplete numerical data, in a data frame. Like we've talked about before, whenever you have a data frame, with missing numerical data, NaN will act as the placeholder, for the missing element. There's a bunch of different ways to deal with NaN values. But, specifically, in this guide, I'll be showing you how to replace it, with either the mean, median, or mode. And the example that we'll be going through, uses raw data, given to us by the university's biology department. And the file name should be "Biology Raw Data."

To begin, you can go ahead, and import NumPy, Pandas, and the CSV file that we'll be using. Once you're done with that, drop down into the Console, and pass in df.info. Followed by parentheses. And, what this tells us, is that the data frame has 20 index rows, and six columns. Along with the label name of each of those columns. It also looks, like we could be missing for numerical values, in the average grade column.

medium

And, now that it looks like we have some cleaning to do, let's go ahead and segment the data frame, and convert the variables into NumPy arrays. Just like the other examples we've gone through, average grade will be the dependent variable. Then the other five columns, will make up the features variable.

To replace the NaN values, we're going to use a new library called scikit-learn. This is another incredibly powerful tool, that we can use for data mining, and analysis. And, for now, we'll only be using a single class from the library. So, we'll pass in, from sklearn.inpute import SimpleImputer. If you haven't heard of imputation, it's actually a statistics process, that replaces missing data, with a substituted value.

To get this going, let's create a variable named imp_mean. And, assign it to the imputer. After we pass in the parentheses, let's take a look at the attributes that we'll need. The default for missing values is a NumPy NaN value, and the default strategy is mean. So, for right now, we won't have to pass in anything else.

Now, let's move down the line and create another variable, named mean dependent variables. The next step, is to fit the imputer, to the data that we want to work with. Then transform the NaN values, into numerical values. To do that, we'll pass in, imp.fit_transform. Then, inside the parentheses, we'll pass in, dependent variable. Then we can go ahead and run it, and make sure we don't have any errors.

medium

Okay. So when we run this, you should be getting an error. Basically, what this means, is that the imputer needs a one dimensional array, converted to a two dimensional array. This part is a little confusing at first, but I'll show you what this means. The way we have this structured, is a one dimensional, 20 element array. If we pass in np.reshape, followed by dependent variable, and finish with a tuple, containing negative one and one... We get a two dimensional array, with one column. The only noticeable difference in this shape, is that now there's a one, instead of a blank space. And, that doesn't really seem like much of a difference. But, when we return both of the arrays, you'll notice how the reshaped array, actually runs vertical. That indicates to us, that the array is being handled as a two dimensional array, with columns and rows.

So, going back to the code, we can either pass in the reshape function that we were just using, or we can pass in, the new access function. And, now when we run this, we can see that all the NaN values, have been replaced by the mean of the numbers in the column.

medium

Now, to replace NaN with the median of the dependent variable column, we're going to run through the exact same steps, with one small change. We'll start with a variable named imp_median, and assign it to the simple imputer. We're still replacing NaN. So we don't need to do anything with the missing value attribute. But, this time, we're going to change the strategy, by passing in median. Then we'll finish by creating a median dependent variable. Then we can go ahead and add the fit transform function. Then we'll finish by converting the one dimensional array, into a two dimensional array.

We'll be doing this one more time. But, this time, we'll use the mode. So, imp_mode = SimpleImputer. Then we'll pass in, strategy='most_frequent'. Then to finish, we'll create the mode dependent variable. Pass in the fit transformation function. Followed by the conversion of the one dimensional array, into a two dimensional array. And, when we run this, there should be no errors. And, so far, so good...

medium

Before we finish up, let's make all of this a little bit easier to look at. We'll do that by joining all three arrays, and round all the new values to the closest integer. To join the arrays, we're going to create a replacing NaN variable. Then we're going to use the horizontal stack or Hstack function, to join the arrays. Inside the parentheses, we have to pass in a tuple, containing the array names.

And, while we're doing that, we're also going to include the around function from NumPy. This allows us to round the float values, to whatever decimal place that we want. The default is zero. So, for this example, we don't need to pass anything in, other than the array that we want to round. When we run it again... It looks much cleaner already.

medium

Since we're done working with the array, we can go ahead and convert it back to a Pandas data frame. So let's name our last variable, clean_df. Then we'll pass in, pd.data frame. Followed by parentheses. Inside the parentheses, we'll create a dictionary. Where the column names are the keys, and the values are slices of the array. We'll pass in, Mean, first. And that's mapped to all of the rows of the first column. Next up, is Median. And that's going to be mapped to all of the rows of the second column. Then, Mode, is mapped to all of the rows of the third column. And, we'll run this one last time. And, everything looks to be clean, and organized.

medium

These were just a couple of the different ways that you can clean your data. But try to keep in mind, there's not a one size fits all approach to any of this. So keep trying techniques until you find something that works best for you. And then, once you do, stick with that routine, and stay consistent in your methodology.