Sometimes in a regression model our explanatory variables may not be quantitative or a number in particular. However, you probably have some qualitative information about your dependent variable.How are you going to use it in your model? Here comes the dummies.These are artificial variables which modelers construct to quantify an attribute or a quality and essentially these variables are binary i.e, they can be either 1 or 0.
DEFINITION
A dummy variable is a numerical variable used in statistical modeling and regression analysis to represent categorical data by assigning binary values (0 or 1) to different categories.
- Dummy variable = 1 ; implies presence of the attribute
- Dummy variable =0; implies absence of the attribute
EXAMPLE
Let’s try to understand this with the help of an example. Suppose we are trying to predict the impact of having a PhD on salary of a professor. Therefore,
- X = 1 if the professor has a PhD
- X = 0 if the professor does not have a PhD
The formal model can be represented in the following way Yi = α + β * Xi+ εi
Now E[ Yi| Xi =0] = α implying that the intercept is the mean salary of the non PhDs.
E[ Yi| Xi=1] = α + β implying that the slope is basically the difference of mean salaries between PhDs and non PhDs.
In the above case we considered a case where the qualitative variable had only 2 possible values.But, we can also develop a model where the qualitative information has more than two possible values/characteristics.For those cases we need to create more than one dummy variable.When the explanatory variable has ‘n’ possible characteristics/values, we should create ‘n-1’ variables to represent the information.
Let’s extend the previous example.Suppose we are trying to predict the impact on salaries of the final education level of teachers in a school.They can have 3 possible degrees : Bachelors, Masters or PhD.
For this case, the model will be
Yi = α + β1 X1 + β2 X2 + εi
Where X1 = 1 If the person is a PhD
= 0 ; otherwise
And, X2 = 1 If the person has a masters
= 0 ; Otherwise
Therefore, the mean salaries will be
E[ Yi | X1 =1 , X2 =0] = α + β1
E[ Yi | X1 =0 , X2 =1] = α + β2
E[ Yi | X1 =0 , X2 =0] = α
HOW TO AVOID the Dummy Variable Trap:
The dummy variable trap is a common mistake in regression analysis where multicollinearity occurs due to improper use of dummy variables.
Why is it a Problem?
- Regression models assume predictors are not perfectly correlated.
- If dummy variables sum up to 1 (due to the intercept), the model cannot distinguish their individual effects → coefficients become unreliable.
- If a categorical variable has n categories, we use n-1 dummy variables to prevent multicollinearity.
- Example: For “Season” (Spring, Summer, Fall, Winter), we create 3 dummy variables (one category is the reference).
Applications:
Dummy variables are pretty useful in setting up any econometric model. We will discuss some real-life examples using dummies in a separate post.

Trackbacks/Pingbacks