Simple Linear Regression

In this guide we're going to be covering the basics of simple linear regression and then use Scikit-Learn to build a regression model. And before we start I do have a warning. There's a lot to cover in this guide so it's going to be a little bit longer than what we're used to.

Simple regression is the most basic form of linear regression and I've even heard some people refer to it as the hello world of machine learning. But the reason we consider this method to be simple is because it only analyzes the influence of a single feature variable on the output. I know we've already talked about a few of the common examples of simple linear regression, but those include height and weight, job experience and annual salary or square footage and house pricing and obviously a whole bunch more.

Now before we get into the actual code, let's do a quick rehash of how algorithms and modeling works. We're going to be starting with our training data. Then we'll feed the training data into an algorithm. And then from that we have a model that we can use to make predictions.

Now that we have that out of the way, let's take a look at an actual data frame. What we're going to do is use the dataset to see if there's a linear relationship between students' course grade and their final exam score. And to do that, we're going to need to import NumPy, pandas, matplotlib and the data frame. Then we'll need to segment the data frame between feature and dependent variables. In this example the only feature is final exam grade and the target variable is course grade. Then once that's done we're going to finish by using matplotlib to create a scatter plot.

If you want to go ahead and give this a try on your own, this would be a great time to pause the video.

To start this off I'm going to import the libraries and the data frame. Next I'll move over into the console and check the shape. And it looks like it's a pretty decent size. And then depending on the function that we decide to segment with, either loc or iloc, I'll get the column title or index location that we'll need.

medium

Now moving back over, let's make the feature variable first. I could be wrong, but I don't think I've used the regular loc function in one of our examples yet so this time I'll pass in a colon to access all the rows, followed by the label. Then we can move down and make our target variable. And this time we'll use the final grade column.

Before we do any plotting, let's run this just to make sure everything was done correctly. I'm adding this step in just to make sure they both have the right label name and length. So according to this, it looks like we are good to go.

medium

I don't think we've built a scatter plot yet, but these are really simple. The first thing we're going to do is pass in plt.scatter. Then pass in the features variable followed by target variable. And matplotlib gives us a whole bunch of colors to choose from so I'll go ahead and choose royal blue this time. To finish up I'm going to add the title and the access labels. Oh, and actually I'm going to add one more piece, but this is really just a personal preference of mine. I really disliked the way that it looks without having the top and right hand spine.

medium

Just by looking at the data, it's pretty obvious there's a linear relationship between course grades and final exam grades, but without a quantitative measurement, we can't really say with any level of certainty how strong the relationship really is. So that's where we can utilize Scikit-Learn and the simple regression function.

To do that we're going to have to do another import. The first is for a splitter function called train-test split which lets us split the feature and target arrays into train and test subsets. Then the second import is the actual algorithm that we're going to use to create our model.

The first step to all of this is splitting the data into a testing and training subset so we'll make our four variables. And as a heads up, order does matter when you're setting all of this up because when we're assigning this to our splitter function, the first array that we pass in is mapped to the first two variables and the second array is mapped to the last two.

The next thing that we need to do is declare how much data is distributed to each group. If you remember back, typically we want 80% of the data dedicated to training and then the other 20% for testing. It doesn't really matter which way you choose, but you can either pass in train size equals 0.8 or test size equals 0.2. Then by default the split function applies a random shuffle to the data to make sure you're not training with the same data every time, but there is also a random state attribute that lets you control the shuffling before the split.

And now that we have the data properly split, we can move forward with actually building the linear regression model. For this we're going to create a regressor object and assign it to the linear regression function. Next we're going to use the regressor object to call the fit function. Then pass in the array we're going to use to train the linear regression algorithm.

The last thing that we need to do before we can plot this again is to create an object to test the model. We can say course grade prediction is equal to the regressor's predict function. And because we're testing to see how accurate the regression line is, we want to use an input we already have the output for. That way we can see how far off the prediction is from the actual data. So here we'll pass in final exam test array. Unless I forgot something, we should have everything we need to graph the result.

Now let's keep going with this and build out the regression line. To make the regression line, we just need to call the plot function from matplotlib. Then pass in the final exam test array. And the target variable is going to be the final grade prediction object that we just created. Then when we run it, we have all of our data plotted along with the regression line.

medium

Now all of this is really great, but we still don't know how accurate the model actually is. So let's run through how we can get some quantitative information about the regression line. We'll be talking about this a lot more in some of the upcoming guides, but for now let's just run through some of the basics.

The first thing that we're going to to need to do is retrieve the coefficient of determination or R squared value. Using the regressor again, we're going to pass in score. Then inside the parentheses we'll pass in the final exam test array followed by the course grade test array. Let's really quickly move into the console and pass in R squared. And we're all going to get a different number back, but it should be something around 0.94 or 0.95.

medium

I don't really want to spend too much time on this because the interpretation and what an acceptable R squared value should be is really open for debate depending on what you're modeling. But regardless of the model, R squared always ranges from zero to one. And in theory, as the value gets closer to one, the more accurately it represents the data. So if you were ever to get an R squared value of one, it should predict the exact output every single time.

The two other pieces of information that may have value for us in the future are the y-intercept and the slope of the regression line. Starting with the slope, we're going to use the regressor again. Then pass in COEF which is short for coefficient followed by an underscore. Then to get the y-intercept, we pass in intercept and an underscore. Then we can go ahead and check those in the console. So the slope is right around one and the regression line crosses the y-intercept at about zero.

medium

The last thing I want to cover in the guide is how to actually generate new predictions using the linear model. And like the others this is pretty easy. All we have to do is pass in the regressor followed by predict. And based on the documentation, we need to pass in an array. So let's go ahead and do that and we'll use the final exam scores of 70, 80 and 90. Whoops, I forgot that we need to change the shape of the array first. Okay, there we go. And those are going to be our three course grade predictions that are based on the model.

medium

All righty, that finally wraps this guide up so I will see you in the next one.