The Five Assumptions of Linear Regression

In the last guide, I listed a few examples when we can utilize linear regression. Unfortunately, linear regression doesn't always work for every data set. In fact, for a linear regression algorithm to work properly, it needs to meet the five basic assumptions.

The first, we already talked about a bunch in the last guide, which is, there needs to be a linear relationship, or to be a little bit more technical, linearity is the property of a mathematical relationship that can be graphically represented as a straight line.

medium

The second assumption that needs to be met, is that the data follows the pattern of homoscedasticity, which, when you break the word down, it means the same variance. This one is a lot easier to explain with a diagram. So to be homoscedastic, the data needs to look something like this. You'll notice that all of the data points are pretty much the same distance from the regression line along the entire length of the line. This concept is important, because if it's not homoscedastic, data points that don't represent the data very well are weighted with the same importance as the points that can actually generate an accurate prediction.

medium

The third condition that needs to be met is Multivariate Normality. And all this really means, is that the distribution of Y is normally distributed at every value of X along the entire regression line. So at this value of X, the Y values are normally distributed about the regression line. Then at this value of X, the values are again normally distributed about the same fit line. So along the entire fit line, all the Y values are normally distributed.

medium

The fourth assumption is the statistical independence of the errors. In stats, error is considered to be the difference between a predicted and observed value. So in linear regression, it's how far a data point is from the actual regression line. If all the errors are independent from each other, the data will have the appearance of being randomly scattered along the entire regression line. But if the errors are correlated in some way, the data may end up looking something like this. Just by comparing the placement of the two regression lines, you can easily see how it affects the outcome.

The last assumption is that the data lacks multicollinearity and multicollinearity occurs when independent variables in a regression model are correlated. The correlation is a problem, because by definition, independent variables should be independent. If the degree of correlation between variables is high enough, it can cause some pretty big problems when you try to fit the model and try to interpret the results.

And with that, it brings us to the end of this guide. So I will see you in the next one.