Missing Value Management

In this guide we're going to continue our discussion on how missing values can be managed. In the last guide we replaced every not a number element with either the mean, median, or mode. Now we're going to expand on that and introduce a couple of new methods such as global constants, data mining algorithms, and then finish with when it's appropriate to just ignore missing data.

If you remember back to the last guide, when we calculated the mean we used every available sample in the average grade column, which in that instance worked just fine, but think about how the data frame is going to change over time as more and more data comes in. Eventually there will be hundreds if not thousands of different courses and professors. So using every sample might not make as much sense, instead calculating the mean based on relevant, or similar courses might suit us a little better.

Let's say we're missing data for a couple of different biometry courses. It probably wouldn't make much sense for us to include university biology in our calculation of the mean. Not only is there a massive discrepancy and difficulty, there's a huge difference in sample size, because university biology is offered way more frequently than biometry, it could potentially skew our results leading to misinformation.

So instead the mean should be calculated by only using samples that are relevant. In our example, we may only use rows containing other biometry courses or possibly expand that to all the rows containing the advanced 500 level courses. Ultimately there's no universal solution to this. It all depends on what will represent your data most accurately.

Another option for managing missing data is the use of a global constant value where it's something like unknown or N/A. A lot of times this technique is used because sometimes it's just easier and makes more sense than trying to predict a missing value.

Going back to our example, if there was a row where fall or spring semester was missing, our time would probably be better spent adding a global constant, instead of trying to predict what semester it was, especially if the semester doesn't have a big influence on the overall result of the algorithm.

The next option that we could choose from is a data mining algorithm. Here missing values are predicted by using regression, decision trees, or clustering algorithms. So unlike using the mean, median, or mode to determine an average data mining algorithms allow us to make predictions with a level of certainty. This is an option that I love because it uses the most available information in the data set to actually predict a missing value. Instead of just taking inputs from the grade column, we can include the course, professor, semester, and dynamic learning to actually predict grades, and generate a less biased result. But a big downside is that time and processing power are valuable resources. Mining algorithms can be a lengthy process, especially when we compare it to replacing Nan values with something like the mean.

Now the last option that we could choose from is simply ignoring missing data. And while this isn't generally the preferred method, sometimes it really does make sense to just cut your losses. This method is usually used when a class label is missing, or a bunch of the attributes are missing from a row. The downside to this method is that you reduce your sample size every time you ignore a row. If you're already working with a massive data frame, it might not be that big of a deal. But if the data frame is already on the small size, shrinking it more might allow outliers to have a larger influence, which will increase the range of error in your results.

Well, that brings us to the end of this guide. And as you can see, there are plenty of available options when you want to manage missing values in your data frame. And as I've said, and I'll probably say again, there is no clear cut way of handling it. It comes down to what method works best for you to achieve the most effective result.