- Read Tutorial
- Watch Guide Video
In this guide, we'll be discussing our final linear regression related topic, and that's polynomial regression. Unlike simple and multi-variable linear regression, polynomial regression fits a nonlinear relationship between independent and dependent variables. Because of this nonlinear relationship, is classified as a special case of multi-variable regression. And some of the more common examples include population growth, half-life or decay rate, stock pricing, and even force fields like gravity.
Some of you might be thinking back to the guide, where we covered the five assumptions of linear regression, because one of the assumptions that we had to make, was that there had to be a linear relationship between the input and the output. But just by looking at an example of polynomial regression, it's pretty obvious that the regression line isn't straight. And it might be a little bit surprising, but what makes a polynomial regression linear, doesn't have anything to do with the feature variable. What makes it linear are the parameter coefficients. And because all of the parameters remain linear, we consider the function to be a special case of multi-variable regression. And due to the fact that it is a special case, the equation of a polynomial regression line is very similar to that of a multi-variable regression line.
In the previous guide, we discovered that exam one had the most significant impact on the model, when trying to predict a final exam score. So, in this guide I thought it would be a good idea to look at the relationship between exam one and homework. To get started, you'll need to import NumPy, pandas, matplotlib, the train and test split function, the linear regression function, and the same data frame as before. We're also going to introduce a new class named polynomial features, which gives us the ability to convert a linear function into a polynomial. Then, once you have everything imported, you can move forward with segmenting and assigning the featured and target variables, then finish by converting the two variables into a matrix, by utilizing the new axis or reshape function.
And as I said earlier, the feature variable will be homework average and the target variable is going to be exam one. Relating this back to simple and multiple variable regression, we're going to follow very similar steps. We'll start by creating and defining our training and testing variables, then we'll allocate how much of the data we want to split between the training and testing set. Now, before at this point, we'd normally create the regressor and then fit the model. But instead, what we're going to do is creating objects named poly one, and assign it to polynomial features. Then inside the parentheses, we're going to pass in degree equals one, which means we're going to be converting the linear function into a first degree polynomial.
Next, we're going to create two more variables, poly one features train and poly one features test. And both of those will be assigned to the fit transform function in polynomial features. By passing in the original training and testing set, what we're doing is transforming both sets into a polynomial and then using them to create a model in polynomial form.
Now that we're done transforming and fitting the polynomial function, we're going to continue as if this is a normal linear regression model. So, we're going to build our regressor, then fit the model, and as a side note, if you're wondering why we didn't have to transform the target variable, is because unlike the inputs that all belong to some continuous function, all the outputs are discrete values that are dependent on that function. So, there isn't really anything to transform.
Anyway, to continue, in this example, we're actually going to be graphing the result, so, we'll need to build the regression line. And to do that, we'll create an array object by assigning it to the regressor's predict function, and then we'll pass in the polynomial test set. Then the last thing that we'll do here, is get some quantitative information by making an r squared variable. We'll pass in the score function, followed by the polynomial test set, and the target test set. Now there is one thing that I would like to do before we graph this. I'd like to go through what is actually going on with the regression line.
So, let's go ahead and run this and jump over to the variable pane. And I'd like you to open the feature test object along with a polynomial feature test object. When we compare the two they're almost identical with the exception of ones in the polynomial set. And the reason they're so closely related, is because first degree polynomials are still linear.
Now, let's look at this by using an example. When the regression line is in linear form, B not represents the y intercept. B sub one is the first parameter coefficient, and X sub one is the first input. In polynomial form, it's very similar. That extra column of ones, is basically acting as a placeholder, that doesn't change the B not when they're multiplied together. B sub one is still the first parameter coefficient, and X sub one is raised to the first degree, which is basically the same as before. When we put all of this together, it really means that the linear function will produce the same result as a first degree polynomial. And this might not seem important now, but it's going to make a lot more sense as we move on.
So, going back to our code, I've already added in everything that we're going to need to graph our result. When we run this, you'll see that the regression line is straight, but it's a pretty poor fit. And if I run it a couple more times, you'll notice the r squared value ranges anywhere from point six to point seven. And the reason the coefficient to determination is so low, is due to the fact that we've underfit the model.
To avoid underfitting, we need to create a higher order equation, which adds complexity to the model. And to do that, we're going to use polynomial features again, to transform feature train and test into a cubic polynomial. To make this a little bit quicker, I'm just going to copy and paste everything and then replace every poly one with poly three. And most importantly, inside polynomial features, we're going to pass in degree equals three.
When we're done with that, let's run it again, and then open the poly three feature train object. And by looking at it, it appears as though everything worked, and our feature array was successfully transformed into a cubic polynomial. When we compare the three forms together, it is a little bit easier to see what is actually happening when we apply polynomial features. As we expand the function to higher degrees to produce a better fit, we maintain linearity between parameter coefficients.
Now, when we go back to our example and try to graph the cubic polynomial, we run into a bit of a problem. Instead of a nice smooth fit line, we get this mess. And if you're having this issue, it's due to the feature array. What's going on is that when we use the plot function, the array is plotted sequentially, meaning we're going to need to sort the array in ascending order, to get a nice uniform line. The quickest workaround is by turning the regression line into a scatterplot. And to me, this is an acceptable solution, because in the greater scheme of things, plotting the regression line isn't really that important. But I will say that it adds a nice visual, especially when you're presenting your work to an audience.
So, if you'd like to maintain the actual line, my recommended method involves NumPy's argsort function. Some of you have might already considered using the sort function, but the problem that you'll run into, is that you'll also need to map it to the corresponding target variable, or else your results will get all kinds of messed up. By using the argsort function, it gives us a quick and easy way of making sure the feature and target variable maintain their proper relationship.
Let's say this is our original feature and target array. If we were to simply sort the feature array, the original mapping doesn't hold true. Instead, we can use argsort, which returns an array of index elements used to sort the original feature array. And, notice how the argsort indices array gives the index location of how the original array, is going to be sorted in ascending order.
Now, the last step is to use the indices array to sort the original feature and target array, without losing their original mapping. And to do that, we're just going to create a new variable, assign it to the original array, followed by square brackets, and the argsort indices array. If we were to return both arrays, they'd look something like this. and the argsort target array, is sorted by index location, in order to maintain the original mapping.
Applying this to our data, let's go ahead and take a look at what exactly we'll need to do. Since we're only using the feature test set to make the regression line, we won't really have to do anything to the training data. So, let's first create the argsort indices object, and assign that to the argsort function. Then inside the parentheses we'll pass in the feature test array, followed by the axis we want to sort. And for that, we can use the variable explorer to check, and in this case it's zero. The next step is to either create a new object or do a reassignment. And to make this a little bit easier, we'll just do a reassignment by passing in feature test, followed by square brackets and the argsort indices array. Then we'll do the exact same thing for the target test object.
Once that's all done, we can can go ahead and run it, and it looks like we messed something up. Oh, I know what it is that we need to do. If you check the variable explorer, take a look at the feature and target test set. When we use the argsort function, it looks like it tacked on an extra dimension. So, we'll need to go back and use NumPy's squeeze function, which is kind of the inverse of the new axis function. This allows us to eliminate every dimension that's not being used, or we can choose which dimensions that we want to get rid of. So, for this, let's just get rid of the last dimension that was added on. Then we can go ahead and do the exact same thing for the target object. We'll run it again, and you can see in the variable explorer pane that both array objects had the third dimension removed. Now, let's move down and check the plot pane, and it looks like everything worked perfectly.
Now, I know this has already been a fairly long guide, but to finish off, we're going to make one more regression line. But this time, we're going to be making a quintic polynomial. So, go ahead and copy all of this again, then paste it below, and just like before, we're going to change the degree from one to five, along with every other one. To plot it, I'm going to copy everything there, and then change the threes to fives, then update the label and then change the color to firebrick. We'll run it one last time, and there you have it.
Now in theory, we could keep increasing the degree of the polynomial, to account for higher levels of variation in the data. But, there does come a point when we begin to overfit the model. And we're going to be talking about overfitting a lot more later on in this course. But the main problem with overfitting, is that it potentially gives too much significance to the outliers which will eventually decrease the accuracy of prediction in the model. But all in all, polynomial regression is a great tool because it accounts for a wide range of curves, and provides us with one of the best approximations for relationships between dependent and feature variables.
And with that, it finally brings us to the end of this guide, so, I will see you in the next one.