- Read Tutorial
- Watch Guide Video
In this guide, we're going to expand on simple linear regression and get into multiple variable regression. As you'd probably expect from the name, multiple regression attempts to explain a dependent variable by using more than one feature variable. In the last guide, we looked at the linear relationship between final exam scores and course grades. So in this guide, I thought we should expand on that a little bit and analyze the effect that different feature variables have on final exam scores.
We didn't talk about it too much but in the last guide the equation that we used for the regression line was in slope intercept form, but generally speaking that's not the format that's normally used in machine learning. Now there isn't a determined standard but typically you'll see people use Y for the output, X for the inputs and either beta, theta or omega for all of the parameters.
In terms of the model, we already know what an input and output are, but the parameters are equally as important and I say that because those are what are adjusted during training to actually make sure the regression line has its best overall fit. Let's say that we have some random data set and the regression line has a parameter coefficient of one, to fit the line a little bit better, we need to adjust the parameter to something closer to two. Another important characteristic of the parameter coefficient is that it provides insight on the relationship between the feature variable its associated with and the output. And I'll expand on that as we work through the example.
So to kick this off, we're going to need to import numpy, pandas, mat plot lib, the test train splitter, the linear regression function and the same data set that we used in our last example. Then once the data frame is imported, we can move forward with segmenting the data. The feature variable in this example will be exam one, two and three. And the target variable is the final exam.
If you remember back to the last guide, the first thing that we're going to need to do is create a feature train and feature test subset, then a target train and target test subset. We'll do that by assigning it to the train test splitter function, then parse in the feature array followed by the target array and finally how much data that we want to have allocated to each subset. Next up we'll assign the regressor to the linear regression function, then use the fit function to create the regression model. And the last thing we'll do is set up a few variables to help analyze the model. The first is going to be in r squared variable to check how well the model actually represents the data, and intercept variable, and finally a parameter coefficient variable.
Let's go ahead and move over to the console and check out what we have returned. So the first thing we can check is the coefficient of determination, and for this you should get something around 0.91 or 0.92 and again this all depends on how your system actually split your data. Going back to the last guide, we talked about how the r squared value can range anywhere from zero to one and it's used to measure how well the regression line approximates the real data. Because our value is pretty close to one, it suggests that we have a fairly well fit model.
And real quick, I'll give you another option that you can use to determine the r squared value and this comes from Scikit-learn's metric module. And inside this module, you'll see that we have a bunch of options for different types of algorithms. So I already imported the regression score function, so all we need to do is create a new variable which we can call r2_score. From there, we'll call the r2_score function, followed by parentheses and this part is going to be a little bit different because instead of parsing in the features and target test set, we're going to use the target test set and compare that to the prediction set. I'm not really sure how you feel about it but intuitively what we're parsing in for this method, actually makes a little bit more sense to me because we're comparing the actual result to the predicted result. But either way when we run it, you should get something around 0.91 or 0.92.
So moving on we can parse in for the intercept, and that should be right around zero, and the parameter coefficients should be around 0.4, 0.3 and 0.2. Now looking back at the simple regression example, we use the slope to describe the relationship between the input and the output. There we had a slope of one, meaning there's a one-to-one relationship between the input and the output. So if the input is 80, the output should also be about 80, if the slope was something different like two, then there'd be a two to one relationship meaning an output will be twice as big as the input, so if the input was 80 our output would be 160.
Carrying that concept over to multiple regression, the parameter coefficient indicates which variable has the largest influence or the closest relationship to the output. Basically the variable with the largest parameter coefficient will have the largest effect on the output. When we apply that to our data, we can see that exam one has the largest coefficient, exam two has the second largest coefficient and exam three has the third largest. Knowing that we can make the assumption that the most important predictor of the final grade is the first exam and the least influential is exam three.
The last and most important thing we can do with the model is use the predict function to forecast a final grade. We could very easily do this manually by inputting the exam grades and calculating it by hand, but when we have a bunch of data, the quicker way is just to use the predict function. All we're going to need to do is parse in a numpy array with three exam grades.
And if we run it now, I'm pretty sure that we're going to be getting an error. Yep, and that's because we need to make sure that the array is set up, so there's only one row with three columns. So let's go back and fix that really quick. There's a couple different ways that you can do this, the first is by using the reshape function and parsing in one common negative one, or one comma three. The second way is using the transpose function. Now when we run it we get an estimated final grade.
And that about wraps the multiple regression guide up, so I will see you in the next guide.