Script Development for Cleaning Data

Resources

Sample CSV data file

Over the next few guides, we're going to spend some time talking about script development for data cleansing, as well as a couple different ways to easily search for different tools we can use to complete our projects.

Overall, data cleansing is a fairly straightforward concept that can be viewed as a two-step process. The first step is to detect any missing, incorrect or irrelevant data. Then the second step is to simply replace or remove the bad data. Even though it's so straightforward, like most things in machine learning, the actual process is a bit more convoluted and that's because datasets are rarely the same, which also means there's not a cookie cutter solution. So the complexity of the cleansing process will most likely depend on the field you're working in or the project you're working on.

Now, when I started working in the private sector, it was a pretty eye-opening experience. I'd say most companies did a pretty good job with their data collection but what was really surprising to me was all the different ways people came up with to actually mes up their data. I expected mistakes like missing data or decimal errors because those things happened. What I didn't expect was data manipulation. I don't want it to sound like it happened all the time but I was always a little taken back whenever I found out that it was happening. I'll also add that not a single person did it maliciously or to be deceitful. Most of the time, they either thought the data was irrelevant so they deleted it or if data was missing or they thought it was recorded incorrectly, they'd change it.

So right now it's going to be my official warning. Never under any circumstance manipulate raw data even if you know for certain it's garbage. It's one thing if you don't know any better, like the cases with my clients but in my eyes, one of the most unethical things a person can do is knowingly alter data to align with a desired result. A result should never dictate what data is collected or how you collect it. I know this is a lot easier said than done but to the best of your ability, the conclusion should always be derived from the data, even if that means your hypothesis is completely wrong and you have to start from scratch.

Just to be clear, I'm specifically referring to the raw or original data. Rarely, if ever, will the data be so perfect and precise that you don't have to remove, ignore or do some sort of data manipulation when you're actually building your model or training an algorithm.

Now, before we move on to the next guide, let's take a gander at what some missing data looks like and some of the steps you can take when you get new DataFrames. The first thing we're going to do is import NumPy, Pandas and the CSV file. Once you're done with that, go onto the console and pass in df.size. This DataFrame isn't very big, so if we wanted to, we could easily call it directly and look at all the actual data. And you might have noticed already but in the last column, there's a few elements with NaN values. NaN is short for not a number and it's used to replace missing data for Pandas and NumPy. And for right now, I'm just going to show you a couple ways to search for NaN values. But in an upcoming guide, we'll talk about different ways of actually replacing NaN values. Then further down the line, we're going to talk about the effect NaN values can have on your result if you forget to take care of them during the pre-processing stage.

medium

The first way I'll show you is by passing in df.isna, followed by parentheses. What we get is a DataFrame where false is returned in place of every non-NaN value and true for every NaN value. A method almost identical to isna is notna. This returns a DataFrame where true is in place for every non-NaN value and false for every NaN value. Then using either of those methods, you could then iterate through the data and return the index locations of all of the NaN values.

medium

This last method is the method that I prefer and it comes in handy when you're working with larger DataFrames and it's impractical to try to manually scroll through all of it. For this, we're going to start by converting the Pandas DataFrame into a NumPy array. We can either convert the whole thing into a matrix or just use the average grade column. Personally, I always segment the data before making any modifications. And you're going to hear me say this a lot but it's incredibly important to create your own repeatable routine. By sticking to the same routine, I can almost guarantee you, you'll mitigate any possible mistake. But again, this will be different for everyone. So find what works best for you and stick with that.

When I'm working on my own project, I like to start by establishing all of my objectives. And I do that before I even look at any of the data. I want to know all of the specific goals of the project and what needs to get done. Then I'll go through all of the relevant data and organize it based on a bunch of different factors that we'll talk about later. Then I'll segment between features' independence. Once I have everything organized, I'll go in and actually start to clean it all up.

Now that I have the data segmented, we're going to use a NumPy function called argwhere. This returns an array with the index location of every nonzero element. But what we're going to do is leverage it to only return the location of the not a number elements. To do that, we're going to create a nan_location variable. Then we'll pass in np.isna, followed by the parentheses. Then inside that, we're going to pass in the array we want to look through, which in this case is going to be lowercase y to represent the dependent variable array.

Let's go ahead and run this and then move over into the console. And let's take another look at the dependent variable DataFrame. According to this, if we did everything correctly, the argwhere function should return an array containing the index locations one, seven, 14, and 17. So when we pass that in, it looks like everything turned out the way we expected.

medium

Now that's going to wrap things up for this guide but I do want to reiterate one last time the importance of establishing a system that works best for you. When you first start, you'll probably have an overly methodical approach. And that's because this is definitely a skill that evolves over time. And I promise, if you stay patient and follow your steps, success is almost guaranteed.

Code

import numpy as np
import pandas as pd

df = pd.read_csv('YOUR FILE PATH')
x = df.iloc[:, :4].to_numpy()
y = df.iloc[:, 5].to_numpy()

nan_location = np.argwhere(np.isnan(y))