If you are working on a regression based problem and you are dealing with multiple explanatory/independent variables then probably you are well aware of the term multicollinearity.
DEFINITION:
Let’s try to understand what is multicollinearity and why it is considered as a problem. Multicollinearity arises when you have more than one explanatory variable (X variable) in your regression model and these explanatory variables are interrelated among themselves.For that reason you will not be able to entangle separate effect of each of these variables on your dependent variable(Y variable).More precisely; the coefficient estimates corresponding to those interrelated explanatory variables will not be accurate in giving you the actual picture of the predictive power of the variables.In more statistical term,multicollinearity generates high variance of the estimated coefficient and will consecutively affect t-ratios leading to insignificant coefficients.Also, from a more practical point of view or from business perspective there is no point in keeping 2 very similar variables in your model.As simple as that!
DETECTION:
Now if you have a plenty of explanatory variables then you will want to get rid of a few based on Variance Inflation factor (VIF). The formula for VIF is
VIF = 1/ 1- Rj2
Now one important point to keep in mind that the Rj2 is not our usual adjusted R2. Rj2 is the coefficient of determination for the regression of Xj on the other explanatory variables (excluding the dependent (Y) variable). (Tips:this is a very popular interview question)
In any textbook you will find that VIF <10 is fine to ignore the presence of multicollinearity. There is no strict rule though; the lower the better. Based on the VIF you can drop a few variables from your list.
SOLUTION:
Now comes 2 very important questions. If you identify multicollinearity in your model, how you are going to remove it. Also, if two explanatory variables are very similar and highly correlated ; how you are going to decide which one to keep.
- One of the easiest ways to overcome the problem is to transform the variables. It is highly likely that the intensity of multicollinearity will get reduced when the variables are transformed(e.g. first difference,lags etc).Another rather impractical way is to increase sample size,only if you are lucky. Data constraint is always an issue for researchers but if you have the luxury to increase your sample size; variance of the estimated coefficients are most likely to fall.
- Another very popular way to deal with multicollinearity is to combine these correlated variables through Principal Component Analysis (PCA). I will talk about PCA in detail in another post.
- We can also combine the Correlated Variables. If two variables measure the same thing, we can average them or create an index.
Multicollinearity doesn’t always require fixing unless you need interpretable coefficients. Use VIF, correlation analysis, or PCA to diagnose and apply removal, regularization, or transformation to resolve it.

Trackbacks/Pingbacks