1.What are the assumptions of Classical Linear Regression model?

  • The model is linear in parameters i.e. the dependent variable (Y) should be expressed as a linear combination of the explanatory variables (Xs) and error term.
  • The number of observations in the linear regression model is not lesser than the number of explanatory variables and any two explanatory variables do not have exact linear relationship between them.
  • The explanatory variables are independent of the error term. E(e|X) =0
  • The error term needs to be i.i.d ( independently and identically distributed).
  • The error term is normally distributed with mean zero and constant variance.

2. What is R sq?

R-sq basically measures the proportion of total variation in the dependent variable that is explained by all the explanatory variables together. For further details, please check my post here.

3. What is the Difference between R sq and Adjusted R sq?

The very common problem with R-sq is that it is an increasing function of number of explanatory variables in a model. So if we build to models to predict the target variable using different number of explanatory variables it will not be prudent to compare the R-sq values to understand the goodness of fit of both the models. Hence, to compare the two R-sq values we need to take into account the number of explanatory variables in the model.

So, if our Rsq = 1 – ∑ei2/∑Yi2

Then the adjusted R sq will be

Rsq = 1-(((∑ei2)/(n-k))/(∑Yi2))/(n-1)

Both ei2 and Yi2 have been divided by their degrees of freedom, n-k and n-1 respectively. Please note here ei2 is residual square sum and Yi2 is total square sum.

4. What are the properties of OLS estimator?

The four properties of OLS estimator is Unbiasedness, Minimum Variance, Efficiency and Consistency. For further details about each of the properties please see this post.

5. What is Degrees of freedom?

Degrees of Freedom are the number of values that are allowed to vary in the data set. It is that piece of information that went into calculating the estimate, however, they are independent and free to vary. For example, if somebody asks you to give them 4 numbers whose mean is 6, you can say {3,6,6,9} or {2,4,8,10}. Note here that you can choose whatever you want for the first 3 numbers but the last one is fixed one in order to ensure that the mean is 6. So, the first 3 values are free to vary. Usually , the formula for degrees of freedom is as follows –

Degrees of Freedom = n-k (where, n = sample size and k = number of parameters)

6. How to compute t test values for coefficient estimates in Regression Analysis?

The value of the ‘t’ statistic is calculated by dividing the individual coefficient estimate (β) by the standard errors(SE) of those estimates respectively. Here, the degrees of freedom is equals to n-(k+1) or n-k -1 where k+1 is the number of parameters in the model including intercept.

7. What are the null hypothesis for F test and t tests?

The difference between F test and t test in the context of null hypothesis is that F test checks the overall significance of the model fit and the t test checks if the individual coefficients are significantly different than zero.

So, for t-test the null hypothesis and alternative hypothesis are as follows-

  • H0 : βi = 0
  • HA = βi ≠ 0

And, for F -test, the null hypothesis and alternative hypothesis are as follows-

  • H0 : β1 = β2 = β3 = …=  βn =0
  • HA : Not all βs are simultaneously zero

8. What happens when p value for f test is lower than alpha i.e. what do you conclude?

When p value is lower than alpha which is the chosen level of significance we reject the null hypothesis and conclude that model is overall significant. In other words, we say that not all the coefficients are simultaneously zero.


9. What happens when p value for t test is lower than alpha i.e. what do you conclude?

When p value is lower than alpha which is the chosen level of significance we reject the null hypothesis and conclude that the coefficient estimate is statistically significant to interpret the variation in the values of the dependent variable.

10. What is a residual? What are the assumptions regarding error term in a linear regression model?

Residuals or error terms are the difference between the actual values of the dependent variable and the predicted values of the same.

For the assumptions on the error term please see this post.


11. Is there any difference in interpret the slope parameter in simple linear regression vs multiple linear regression?

In a simple linear regression model with one X variable, if the slope parameter(β) is statistically significant then for every unit increase in x there is an average increase in y by that estimated coefficient estimate(β). However, when we include more than one variable in our regression model i.e., in case of multiple linear regression it is very likely that the value of the β will change and it is also possible that it might become statistically insignificant in the presence of other explanatory variables. The interpretation differs as well. If βi is significant this means that for every 1 unit increase in Xi , while keeping all other explanatory variables constant, there is an average increase in y by βi.


12. How do you interpret the intercept term in Linear regression?


The intercept term gives us the expected mean value of the dependent variable when all the other explanatory variables become zero. Often intercept term is basically meaningless but it is critical to include them in our regression equation.

13. What is Multicollinearity and how do you solve the problem of multicollinearity?

Multicollinearity arises when 2 or more explanatory variables present in the model are correlated amongst themselves. Hence, we cannot disentangle the separate effect of each of the explanatory variables on the dependent variable. Basically, presence of multicollinearity increases the variance of the coefficient estimates and hence, the standard error increases which correspondingly inflate the t- statistics. So, all in all the coefficient estimates may not be reliable in the presence of multicollinearity in the model.

For further details please check out my post here.

14. What is VIF?

VIF is known as Variance Inflation Factor which is a measure of multicollinearity. The formula for VIF is –

VIF = 1 / 1-Rj2

where Rj2 is the R squared value of the model of one individual explanatory variable against all the other explanatory variables present in the model.

15. What is autocorrelation? what are the impacts of autocorrelation on the estimates?

Absence of Autocorrelation is another assumption for linear regression models. When the residuals of the model from period t is correlated with residuals from period t-1,t-2,… and so on then it is known as negative autocorrelation. On the other hand, if the residuals of the model from period t is correlated with residuals from period t+1,t+2,… and so on then it is a case of positive correlation. Such correlation often arises from the correlation of the omitted variables that the error term captures. The correlation between Ut and Ut-1 is the first order autocorrelation. Similarly, the correlation between Ut and Ut-2 is called the second order autocorrelation. The effects of autocorrelated errors on least square estimators are

  • If there are no lagged dependent variables among the explanatory variables, the estimators are unbiased but inefficient. However, the estimated variances are biased.
  • If there are lagged dependent variables, the least square estimators are not at all consistent.

16. How can you test autocorrelation?

There are multiple tests for detecting autocorrelation as well. The most popular ones are Box-Ljung, Durbin Watson (DW) test. Usually for DW test the resultant value ranges from 0-4. A value of 2 means no autocorrelation whereas a value greater than 2 and closer to 4 implies strong positive autocorrelation. On the other hand, a value lesser than 2 and closer to 0 implies strong negative autocorrelation.

Box-Ljung test is a hypothesis test where the null hypothesis implies that there is no autocorrelation in the model.

17. What is heteroscedasticity? How do you test it?

One of the crucial assumptions of linear regression model is that the residual has constant variance which is also known as the assumption of homoscedasticity. When the residual variance is not constant it is a case of heteroscedasticity.

There are multiple tests to check the presence of Heteroscedasticity. The most used is Breusch Pagan test and White test. Heteroscedasticity can also be visually identified from the residual vs fitted plot. If the pattern shows the shape of a cone or a fan it is said to have non- constant residual variance.

18. How do you test a model’s performance?

There are multiple ways of testing a linear regression based model’s performance. For example testing Root Mean Square Error ( RMSE), Mean Absolute Error (MAE), Chi-sq test (Goodness of fit), Spearman Rank correlation between the actual vs predicted etc. For further details, please check the post here.

19. Difference between Linear and Logistic Regression?

Linear Regression is used to predict continuous values of a dependent variable. Logistic Regression is used to predict binary dependent variable(e.g. a variable which assumes 0 and 1).

To perform Linear regression we require a linear relationship between the dependent and independent variables. But to perform Logistic regression we do not require a linear relationship between the dependent and independent variables. Linear Regression fits a straight line in the data while Logistic Regression fits a curve to the data through a logit function. A linear regression based model assumes Gaussian (or normal) distribution of the dependent variable (Y) but a logistic regression based model assumes a binomial distribution of the dependent variable (Y).

20. How do you include qualitative information in your regression model?

To include qualitative information in terms of a variable in the model we need to create Dummy variables which are a binary indicator (0 or 1). If we have n number of categories to capture the qualitative information we should create n-1 dummy variables for our model. For further details please check my post here.

Discover more from SolutionShala

Subscribe now to keep reading and get access to the full archive.

Continue reading