Linear Regression Assumptions in Python IN SHORT
If the assumptions are not satisfied, the interpretation of the results will not always be valid.
Assumptions are mainly divided into 5 parts. They are as follows
- Definition of the assumption
- Why it can happen?
- What it will affect?
- How to Detect?
- How to Fix?
So overall we have 5 assumptions in Linear Regression (MANHL)
Assumption 1: Multicollinearity (M) [Third explanation]
Assumption 2: Autocorrelation (A) [Fourth explanation]
Assumption 3: Normality (N) [Second explanation]
Assumption 4: Homoscedasticity (H) [Fifth explanation]
Assumption 5: Linearity (L) [First explain this, because it is the easiest one]
ASSUMPTION 1: Linearity
Definition (Linearity Assumption):
This assumes that there is a linear relationship between the predictors. For example, independent variables or features & the response variable.
Why it can happen (Linearity Assumption):
There may not just be a linear relationship among the data. Modelling is about trying to estimate a function that explains a process. Linear regression would not be a fitting estimator if there is no linear relationship.
What it will affect (Linearity Assumption):
The predictions will be extremely inaccurate because our model is underfitting.
How to detect it (Linearity Assumption):
If there is only one predictor, test with a scatter plot. Most cases aren’t so simple, so we’ll have to modify this by using a scatter plot to see our predicted values versus the actual values (in other words, view the residuals). Ideally, the points should lie on or around a diagonal line on the scatter plot.
How to fix it (Linearity Assumption):
Either adding polynomial terms to some of the predictors/applying nonlinear transformations. If those do not work, try adding additional variables to help capture the relationship between the predictors and the label.
Python Code and Inference
ASSUMPTION 2: Normality of the Error Terms
Definition (Normality Assumption):
More specifically, this assumes that the error terms of the model are normally distributed. Linear regressions other than Ordinary Least Squares (OLS) may also assume the normality of the predictors or the label, but that is not the case here.
NOTE: In general, it is said that Central Limit Theorem “kicks in” at an N of about 30. In other words, as long as the sample is based on 30 or more observations, the sampling distribution of the mean can be safely assumed to be normal. So Normality assumption is not implemented if the sample size is greater than 30.
Why it can happen (Normality Assumption):
This can actually happen if either the predictors or the label are significantly non-normal. Other potential reasons could include the linearity assumption being violated or outliers affecting our model.
What it will affect (Normality Assumption):
A violation of this assumption could cause issues with either shrinking or inflating our confidence intervals.
How to detect it (Normality Assumption):
We’ll look at both a histogram and the p-value from the Anderson-Darling test for normality.
How to fix it (Normality Assumption):
It depends on the root cause, but there are a few options. Nonlinear transformations of the variables excluding specific variables (such as long-tailed variables) or removing outliers may solve this problem.
ASSUMPTION 3 — No Multicollinearity among Predictors
This assumes that the predictors used in the regression are not correlated with each other.
Why it can happen (Multicollinearity Assumption):
A lot of data is just naturally correlated. For example, if trying to predict a house price with square footage, the number of bedrooms, and the number of bathrooms, we can expect to see a correlation between those three variables because bedrooms and bathrooms make up a portion of square footage.
What it will affect (Multicollinearity Assumption):
- Multicollinearity causes issues with the interpretation of the coefficients.
- Specifically, you can interpret a coefficient as “an increase of 1 in this predictor results in a change of (coefficient) in the response variable, holding all other predictors constant.”
- This becomes problematic when multicollinearity is present because we can’t hold correlated predictors constant.
- Additionally, it increases the standard error of the coefficients, which results in them potentially showing as statistically insignificant when they might actually be significant.
How to detect it (Multicollinearity Assumption):
There are a few ways, but we will use a heatmap of the correlation as a visual aid and examine the variance inflation factor (VIF).
How to fix it (Multicollinearity Assumption):
This can be fixed by other removing predictors with a high variance inflation factor (VIF) or performing dimensionality reduction.
ASSUMPTION 4: No Autocorrelation of the Error Terms
Definition (Autocorrelation Assumption):
This assumes no autocorrelation of the error terms. Autocorrelation being present typically indicates that we are missing some information that should be captured by the model.
Autocorrelation refers to the degree of correlation of the same variables between two successive time intervals. In another way, autocorrelation refers to the degree of closeness or correlation between values of the same variable or data series at different periods. The result can vary from -1 to 1, and if the value is -1 or close to -1, it is a negative correlation. If the value is 1 or close to 1, it is a positive correlation.
For example in the stock market, it is applied by stock traders and analysts to understand the degree of similarity and moving patterns using charts, comprehend the impact of past prices and predict future prices.
Why it can happen (Autocorrelation Assumption):
In a time series scenario, there could be information about the past that we aren’t capturing. In a non-time series scenario, our model could be systematically biased by either under or over-predicting certain conditions. Lastly, this could be a result of a violation of the linearity assumption.
What it will affect (Autocorrelation Assumption):
This will impact our model estimates.
How to detect it (Autocorrelation Assumption):
We will perform a Durbin-Watson test to determine if either a positive or negative correlation is present. Alternatively, you could create plots of residual autocorrelations.
How to fix it (Autocorrelation Assumption):
A simple fix of adding lag variables can fix this problem. Alternatively, interaction terms, additional variables or additional transformations may fix this.
ASSUMPTION 5: Homoscedasticity
Definition (Homoscedasticity Assumption):
This assumes homoscedasticity, which is the same variance within our error terms. Heteroskedasticity, the violation of homoscedasticity, occurs when we don’t have an even variance across the error terms.
Homoskedastic (also spelt “homoscedastic”) refers to a condition in which the variance of the residual, or error term in a regression model is constant. That is, the error term does not vary much as the value of the predictor variable changes. Another way of saying this is that the variance of the data points is roughly the same for all data points.
This suggests a level of consistency and makes it easier to model and work with the data through regression; however, the lack of homoskedasticity may suggest that the regression model may need to include additional predictor variables to explain the performance of the dependent variable.
Oppositely, heteroskedasticity occurs when the variance of the error term is not constant.
Examples of homoscedasticity and heteroskedasticity
For example, suppose I wanted to explain student test scores using the amount of time each student spent studying. In this case, the test scores would be the dependent variable and the time spent studying would be the predictor variable.
The error term would show the amount of variance in the test scores that was not explained by the amount of time studying. If that variance is uniform, or homoskedastic, then that would suggest the model may be an adequate explanation for test performance explaining it in terms of time spent studying.
But the variance may be heteroskedastic. A plot of the error term data may show a large amount of study time corresponded very closely with high test scores but that low study time test scores varied widely and even included some very high scores.
So the variance of scores would not be well-explained simply by one predictor variable the amount of time studying. In this case, some other factor is probably at work and the model may need to be enhanced in order to identify it or them.
Why it can happen (Homoscedasticity Assumption):
Our model may be giving too much weight to a subset of the data, particularly where the error variance was the largest.
What it will affect (Homoscedasticity Assumption):
Significance tests for coefficients due to the standard errors being biased Additionally, the confidence intervals will be either too wide or too narrow.
How to detect it (Homoscedasticity Assumption):
Plot the residuals and see if the variance appears to be uniform.
How to fix it (Homoscedasticity Assumption):
Heteroscedasticity (can you tell I like the scedasticity words?) can be solved either by using weighted least squares regression instead of the standard OLS or by transforming either the dependent or highly skewed variables. Performing a log transformation on the dependent variable is not a bad place to start.
Conclusion :
We can clearly see that a linear regression model on the Boston dataset violates a number of assumptions which cause significant problems with the interpretation of the model itself.
GitHub Code:
If it was helpful, please give a thumbs up. Thank You and Please follow:
Medium: https://medium.com/@sandipanpaul
GitHub: https://github.com/sandipanpaul21