- Read Tutorial
- Watch Guide Video
In this guide, we're going to briefly rehash NaN values and then expand upon a couple of the concepts that we talked about earlier. A couple of guides ago, I mentioned that the majority of the data that we'll be working with is going to be given to us. So there's no real guarantee it's going to be accurate or complete. And in the case of missing data, we know it's automatically replaced by an NaN value which acts as the primary placeholder.
Now, to save us a little bit of time, I already imported the libraries and segmented the data frame that we used in the earlier guide. One of the underlying problems with forgetting to handle the NaN values is that they take precedence over all the other data types. To show you what I mean, let's perform a couple of basic operations.
First, let's just try to sum all the elements in the array. What we get back is an NaN value. Next let's try to calculate the mean. And just like before, we get another NaN value. As you can imagine, this is gonna be the case no matter what mathematical operation we use. And while that's going to be a problem, the real issue arises when the NaN value eventually starts to propagate throughout the data.
So try to think about it this way. If we have an NaN value in our original function, we know that that's going to produce an NaN value for the output. But what happens when we use that function in another function and that function feeds into three or four other functions? That's when things start to get screwed up really, really quickly. And you'll notice how this can be such a big problem when we start to build up more complicated systems, because there's going to be a lot of crossover between different algorithms.
But thankfully there are a ton of ways that we can avoid this annoyance. We already went through how to search and replace an NaN value by using the argwhere function followed by the SimpleImputer. So now let's go through another really cool option that you can use. While the masked array is technically a subclass of an NDRA, it comes from a different NumPy module.
So, in order for us to use it, we'll actually need to import it first. What makes a masked array so great is that it works almost identically to a normal array, but it gives us a really, really easy way of ignoring or masking elements so we don't have to include what we don't want. Take for example the array that we were just working with. If we want to access that array with all of the NaN values hidden, there's an incredibly easy way of doing that.
Let's start by creating a masked array variable then pass in ma.masked_invalid followed by the name of the original array. Now let's go ahead and see what we have returned. And you'll notice really quickly that there's two parts to this; the data and the mask. The data is just a numerical array with dashes in place of all of the NaN values. And the mask array is a bunch of booleans. So, whenever an element is masked, you'll see true, and for any element that's unmasked, you'll see false.
Now, using the masked array, let's go ahead and try to calculate the sum and the mean of all the elements again. And as you probably expected, it worked both times. We're not going to go into any more detail about the masked array, but the reason I wanted to show them to you is because I find them really helpful whenever I want to do just a little bit more exploration. And that's because unlike the SimpleImputer where we have to decide right away how we want to replace the NaN values, the masked array doesn't care. It just lets us bypass that step altogether.
Like everything, there are downsides to the masked array, but mainly when it comes to broadcasting. And that's a topic for a future guide.
The last little piece that I'd like to add has to do with placeholders other than NaN values. Now, NumPy offers a feature that allows us to create an array or a matrix that's made up of ones or zeros. And this doesn't mean much to us right now, but further into the course, we're going to be using this concept again to assign weights or bias to an array which will then be modified by an algorithm while it's being trained. But you don't need to worry about any of that for right now.
And that about wraps everything up for now, so I will see you in the next guide.