Key Assumptions in Linear Regression

Htoo Latt
5 min readNov 12, 2020

--

When building a linear regression model there are several assumptions we have to make. We can only rely on the outputs of a regression model if the assumptions hold therefore it is necessary to check these assumptions. If the assumptions do not hold, we might be misled into coming to the wrong conclusion and interpreting unreliable regression outputs. In this blog post, I would like to go over several assumptions for linear regression that are needed to be made and what are the problems if the assumptions do not hold.

The key assumptions

1. Linear Relationship between target and features

2. Homoscedasticity

3. Normal Distribution of Residuals

4. No or little multicollinearity (The variables are independent of each other.)

Assumption 1 — Linearity

The first assumption is linearity meaning that the target and the variable must have a linear relationship. If you have chosen linear regression as your model you are already assuming there is a linear relationship and trying to build a model for it.

Ways to detect

  1. Scatter plot
  2. Residual plots

The figure above shows four different scatterplots with different levels of linearity and correlation between the two variables. The first graph shows a strong positive linear relationship between the two variables, you can easily imagine where the best fit line would go. In the second figure, the relationship is negative and weaker than the relationship in the first. In the third figure, it is hard to identify a relationship. It’s extremely difficult to imagine where the best fit line would go. The fourth figure shows a strong non-linear relationship that is most likely a polynomial relationship.

When plotting out a residual plot we want to see a nice even spread to ensure that there is a linear relationship. A residual plot is also useful in detecting other assumptions.

Solutions

We can deal with this problem by transforming the data by taking logs or square roots of the variable, target, or both. It is also possible to add polynomials of the variable to the data so that there is a linear relationship.

(Adding a polynomial of the variable to the model is fine because it is the relationship between the target and variable, in this case, the polynomial, that must be linear.)

Problems if the assumption does not hold

If the linear assumption is violated, both the coefficients and the standard errors in the output become unreliable making the regression useless.

Assumption 2 — No Heteroskedasticity

The technical definition of heteroskedasticity is given as — When the variance of Y given X is not constant.

When there is heteroskedasticity in the data the standard errors in the output become unreliable.

Ways to detect

Residual plot showing heteroskedasticity

1. Residual plots — Once again you want to see a nice even spread in the residual plot showing homoskedasticity in the data. If the residual spread is uneven it shows that the variance is not constant throughout the predicted data.

2. Run a Goldfeld-Quandt test.

Steps

  • Order the data by the offending variable.
  • Split the data into two segments and then run separate regression on each -segment.
  • Compare the variance estimates between the regression.
# How to run Goldfeld-Quandt test in python using statsmodel libraryimport statsmodels.stats.api as smsname = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(results.resid, results.model.exog)
list(zip(name, test))

The null hypothesis for the Goldfeld_Quandt test is homoscedasticity. The larger the F-statistic the more evidence there is against homoscedasticity assumption and the more likely we have heteroscedasticity in our model.

Solutions

One of the easiest possible solutions to this is to just log the variable and the target and then checking for heteroskedasticity again.

Assumption 3 — Normality of Error Terms (Residuals)

The assumption that the residuals follow a normal distribution. There is a misconception that the target and features variables have to be normally distributed but more often you will find that they will be neither be identically distributed or normal.

Ways to detect

1. Look at the histogram to see if the distribution follows a nice bell-shaped curve.

2. Use a QQ-Plot

When the residuals are close to being perfectly normally distributed

When looking at a residual plot you should see that there are more points around the line and the points get sparser as you get farther from it.

The normality of error terms can be considered as a weak assumption for regression since it is not usually a problem particularly if you have a large amount of observation. The reason for this is if you have a large amount of observation the central limit theorem will apply, and the true relationship will come out. Even though the residual plot might look strange it is highly possible that we are being given the correct coefficient and standard errors.

The problem is only significant if you only have a small number of observations, then the standard errors in the output are affected and become unreliable.

Solutions

1. The best method would simply to get more observations.

2. Try log transforming the data by trial and error to see if you can get a normally distributed error term.

Assumption 4 — No multicollinearity between the variables.

Multicollinearity occurs when the x variables themselves are related. Since the point of regression is to isolate the individual effects of a variable on the target and the coefficient is interpreted as the expected effect on the target if the variable increase by holding all other variables constant. If there is multicollinearity, the last part of the sentence becomes impossible. The model will have a hard time identifying which of the variables made the difference in the target.

The issues that occur when there is multicollinearity includes once again standard errors and coefficients becoming unreliable. When you have a very high degree of multicollinearity your model will start shutting down and when it is equal to 1 (Perfect Multicollinearity) your model will not actually run at all.

This is the reason why when creating dummy variables for categorical data it is necessary to drop one of the columns. If one category is not dropped you can predict the value of one categorical column based on the other columns, thus having multicollinearity.

Ways to detect

1. Look at the correlation between the variables

Solutions

1. The best method would be to simply get rid of one of the variables

Sign up to discover human stories that deepen your understanding of the world.

--

--

No responses yet

Write a response