Upcasting in Machine Learning Projects

In this guide, we'll be discussing a concept known as upcasting, and how it works in NumPy. If you're unfamiliar with casting, it's really just changing an expression from one data type to another. That means there's two types of casting, upcasting and downcasting. The difference between the two is that upcasting calls for a higher level of precision, while down casting causes you to lose precision.

In NumPy and in Python upcasting is done implicitly. So let's take a quick look at an example. We'll start with an array containing five integers and a second array with five float elements. Now, let's run this and check the Variable explorer pane. As expected, we have one float64 array and one int64 array.

medium

But now let's take a look at what happens if we join the two together. To do that, we're going to use the concatenate function. And when that's ready to go, we'll go ahead and run it. The first two arrays that we created maintained the original data type, but the new array has a data type of float64. And if you notice all the elements that were once integers have been upcast to floats. And again, that's because NumPy upcasts implicitly. So unless instructed otherwise, it will always take on the higher level data type.

medium

As you might've expected upcasting can also be done explicitly, and I'll show you a couple different ways of doing that. We'll start by creating another array, and this time the data type will be an unsigned integer. To change the data type or explicitly upcast, we can use NumPy's astype function. But this time instead of creating a new variable, let's just do a reassignment.After we pass in the variable name, we'll pass in astype followed by a string with a data type that we want to convert to. When we run it again, you'll see the array is now a float64.

medium

Now, this last method comes with a bit of a caveat, and as we move through the example, I'll explain what I mean. In one of our very first guides, we discussed how to create a matrix by using nested arrays. Here, we're going to be doing something very similar, but with a small change.

So let's start by creating a new array and we'll call it structured_array. Then we'll call the array function from NumPy and pass in our parens and squared brackets like we would normally do for an array. In the past when we nested an array, we would pass in another set of squared brackets, but instead we're going to use regular parentheses in place of the brackets.

And this might not seem like it's a big deal, but there actually is a subtle difference. By passing in the parentheses in place of the squared brackets, we are actually nesting a tuple, which NumPy recognizes as a sequence. And so by doing it this way, we're going to be creating a matrix with nested sequences instead of nested arrays. And each sequence is going to have three fields. The name of the professor, how many students were in their class and the average class grade.

medium

After all of that is passed in, we can run it and take a look at the data type. So U5 stands for unicode with a length sequence of five. This basically means that the data type is a string and the longest string is five characters long. Oh, and I forgot to mention this in the other guide, but do you see that little sideways caret? Well, that references something called little-endian, which indicates the bit order of the data type object.

medium

I grabbed this image off Wikipedia to help with the explanation. So if a dtype's by order is little-endian that means it stores the least significant byte at the smallest address, and the most significant at the largest. Then in contrast, a big-endian system stores the most significant byte at the smallest memory address, and the least significant byte at the largest.

Now, if you remember, when we talked about row major order, the example that I showed you was big-endian where the most significant byte was stored in the smallest memory address. If it was little-endian, it would look something like this with the least significant byte at the smallest memory address.

But anyway, to get back on track, we obviously don't want to keep all of the elements of strings. So what we're going to do is assign a title and data type to each field, and we'll do that by creating our own custom data type object. Let's name the data type object structured dtype. Then inside the list, we're going to nest three sequences, which are essentially tuples. And each sequence will contain their own field title and data type.

So for the first sequence, we're going to pass in professor and U10, which is a unicode string. The title for the second sequence will be class size and the data type is going to be unsigned integer. Then in the last sequence for the field title will pass in average grade and the data type is going to be a float. Now, let's go back to the original array and we're going to pass in a comma between the bracket and the parentheses. Then we'll finish that up with dtype equals structured dtypes.

medium

Now, run this one last time and take a look at the variable pane. The data type returned is a void400, which in NumPy is essentially a flexible data type that allows us to handle fields with different data types. One of the really big benefits of a structured array is that it gives us a little bit more flexibility to extract information, so we can either do it by index or by field. So we can pass in zero index to return a tuple which is going to contain all the elements of the first row. Or we can pass in the professor field, and what that's going to do is return a list of all of the professors.

medium

There's a pretty decent chance that most of you won't be working extensively with structured arrays. And that's because they're really meant for interfacing with C code and for low-level manipulation. When you're working with tabular data, like a CSV file, Pandas is still the recommended tool because of its high-level interface.

And I think that about wraps things up for now. So I will see you in the next guide.