Dummy variables are incorporated in the same way as quantitative variables are included (as explanatory variables) in regression models. For example, if we consider a Mincertype regression model of wage determination, wherein wages are dependent on gender (qualitative) and years of education (quantitative):
ln
wage
=
α
0
+
δ
0
female
+
α
1
education
+
u
where
u
∼
N
(
0
,
σ
2
)
is the error term. In the model, female = 1 when the person is a female and female = 0 when the person is male.
δ
0
can be interpreted as: the difference in wages between females and males, holding education constant. Thus, δ_{0} helps to determine whether there is a discrimination in wages between males and females. For example, if δ_{0}>0 (positive coefficient), then women earn a higher wage than men (keeping other factors constant). Note that the coefficients attached to the dummy variables are called differential intercept coefficients. The model can be depicted graphically as an intercept shift between females and males. In the figure, the case δ_{0}<0 is shown (wherein, men earn a higher wage than women).
Dummy variables may be extended to more complex cases. For example, seasonal effects may be captured by creating dummy variables for each of the seasons:
D
1
=
1
if the observation is for summer, and equals zero otherwise;
D
2
=
1
if and only if autumn, otherwise equals zero;
D
3
=
1
if and only if winter, otherwise equals zero; and
D
4
=
1
if and only if spring, otherwise equals zero. In the panel data, fixed effects estimator dummies are created for each of the units in crosssectional data (e.g. firms or countries) or periods in a pooled timeseries. However in such regressions either the constant term has to be removed or one of the dummies has to be removed, with its associated category becoming the base category against which the others are assessed in order to avoid the dummy variable trap:
The constant term in all regression equations is a coefficient multiplied by a regressor equal to one. When the regression is expressed as a matrix equation, the matrix of regressors then consists of a column of ones (the constant term), vectors of zeros and ones (the dummies), and possibly other regressors. If one includes both male and female dummies, say, the sum of these vectors is a vector of ones, since every observation is categorized as either male or female. This sum is thus equal to the constant term's regressor, the first vector of ones. As result, the regression equation will be unsolvable, even by the typical pseudoinverse method. In other words: if both the vectorofones (constant term) regressor and an exhaustive set of dummies are present, perfect multicollinearity occurs, and the system of equations formed by the regression does not have a unique solution. This is referred to as the dummy variable trap. The trap can be avoided by removing either the constant term or one of the offending dummies. The removed dummy then becomes the base category against which the other categories are compared.
A regression model in which the dependent variable is quantitative in nature but all the explanatory variables are dummies (qualitative in nature) is called an Analysis of Variance (ANOVA) model.
Suppose we want to run a regression to find out if the average annual salary of public school teachers differs among three geographical regions in Country A with 51 states: (1) North (21 states) (2) South (17 states) (3) West (13 states). Say that the simple arithmetic average salaries are as follows: $24,424.14 (North), $22,894 (South), $26,158.62 (West). The arithmetic averages are different, but are they statistically different from each other? To compare the mean values, Analysis of Variance techniques can be used. The regression model can be defined as:
Y
i
=
α
1
+
α
2
D
2
i
+
α
3
D
3
i
+
u
i
,
where
Y
i
=
average annual salary of public school teachers in state i
D
2
i
=
1
if the state
i is in the North Region
D
2
i
=
0
otherwise (any region other than North)
D
3
i
=
1
if the state
i is in the South Region
D
3
i
=
0
otherwise
In this model, we have only qualitative regressors, taking the value of 1 if the observation belongs to a specific category and 0 if it belongs to any other category. This makes it an ANOVA model.
Now, taking the expectation of both sides, we obtain the following:
Mean salary of public school teachers in the North Region:
E(Y_{i}D_{2i} = 1, D_{3i} = 0) = α_{1} + α_{2}
Mean salary of public school teachers in the South Region:
E(Y_{i}D_{2i} = 0, D_{3i} = 1) = α_{1} + α_{3}
Mean salary of public school teachers in the West Region:
E(Y_{i}D_{2i} = 0, D_{3i} = 0) = α_{1}
(The error term does not get included in the expectation values as it is assumed that it satisfies the usual OLS conditions, i.e., E(U_{i}) = 0)
The expected values can be interpreted as follows: The mean salary of public school teachers in the West is equal to the intercept term α_{1} in the multiple regression equation and the differential intercept coefficients, α_{2} and α_{3}, explain by how much the mean salaries of teachers in the North and South Regions vary from that of the teachers in the West. Thus, the mean salaries of teachers in the North and South is compared against the mean salary of the teachers in the West. Hence, the West Region becomes the base group or the benchmark group,i.e., the group against which the comparisons are made. The omitted category, i.e., the category to which no dummy is assigned, is taken as the base group category.
Using the given data, the result of the regression would be:
Ŷ_{i} = 26,158.62 − 1734.473D
_{2i} − 3264.615D
_{3i}
se = (1128.523) (1435.953) (1499.615)
t = (23.1759) (−1.2078) (−2.1776)
p = (0.0000) (0.2330) (0.0349)
R^{2} = 0.0901
where, se = standard error, t = tstatistics, p = p value
The regression result can be interpreted as: The mean salary of the teachers in the West (base group) is about $26,158, the salary of the teachers in the North is lower by about $1734 ($26,158.62 − $1734.473 = $24.424.14, which is the average salary of the teachers in the North) and that of the teachers in the South is lower by about $3265 ($26,158.62 − $3264.615 = $22,894, which is the average salary of the teachers in the South).
To find out if the mean salaries of the teachers in the North and South are statistically different from that of the teachers in the West (the comparison category), we have to find out if the slope coefficients of the regression result are statistically significant. For this, we need to consider the p values. The estimated slope coefficient for the North is not statistically significant as its p value is 23 percent; however, that of the South is statistically significant at the 5% level as its p value is only around 3.5 percent. Thus the overall result is that the mean salaries of the teachers in the West and North are not statistically different from each other, but the mean salary of the teachers in the South is statistically lower than that in the West by around $3265. The model is diagrammatically shown in Figure 2. This model is an ANOVA model with one qualitative variable having 3 categories.
Suppose we consider an ANOVA model having two qualitative variables, each with two categories: Hourly Wages are to be explained in terms of the qualitative variables Marital Status (Married / Unmarried) and Geographical Region (North / NonNorth). Here, Marital Status and Geographical Region are the two explanatory dummy variables.
Say the regression output on the basis of some given data appears as follows:
Ŷ_{i} = 8.8148 + 1.0997D_{2} − 1.6729D_{3}
where,
Y = hourly wages (in $)
D_{2} = marital status, 1 = married, 0 = otherwise
D_{3} = geographical region, 1 = North, 0 = otherwise
In this model, a single dummy is assigned to each qualitative variable, one less than the number of categories included in each.
Here, the base group is the omitted category: Unmarried, NonNorth region (Unmarried people who do not live in the North region). All comparisons would be made in relation to this base group or omitted category. The mean hourly wage in the base category is about $8.81 (intercept term). In comparison, the mean hourly wage of those who are married is higher by about $1.10 and is equal to about $9.91 ($8.81 + $1.10). In contrast, the mean hourly wage of those who live in the North is lower by about $1.67 and is about $7.14 ($8.81 − $1.67).
Thus, if more than one qualitative variable is included in the regression, it is important to note that the omitted category should be chosen as the benchmark category and all comparisons will be made in relation to that category. The intercept term will show the expectation of the benchmark category and the slope coefficients will show by how much the other categories differ from the benchmark (omitted) category.
A regression model that contains a mixture of both quantitative and qualitative variables is called an Analysis of Covariance (ANCOVA) model. ANCOVA models are extensions of ANOVA models. They statistically control for the effects of quantitative explanatory variables (also called covariates or control variables).
To illustrate how qualitative and quantitative regressors are included to form ANCOVA models, suppose we consider the same example used in the ANOVA model with one qualitative variable: average annual salary of public school teachers in three geographical regions of Country A. If we include a quantitative variable, State Government expenditure on public schools per pupil, in this regression, we get the following model:
Y_{i} = α_{1} + α_{2}D_{2i} + α_{3}D_{3i} + α_{4}X_{i} + U_{i}
where,
Y
_{i} = average annual salary of public school teachers in state i
X
_{i} = State expenditure on public schools per pupil
D
_{2i} = 1, if the State i is in the North Region
D
_{3i} = 1, if the State i is in the South Region
Say the regression output for this model is
Ŷ_{i} = 13,269.11 − 1673.514D_{2i} − 1144.157D_{3i} + 3.2889X_{i}
The result suggests that, for every $1 increase in State expenditure per pupil on public schools, a public school teacher's average salary goes up by about $3.29. Further, for a state in the North region, the mean salary of the teachers is lower than that of West region by about $1673 and for a state in the South region, the mean salary of teachers is lower than that of the West region by about $1144. Figure 3 depicts this model diagrammatically. The average salary lines are parallel to each other by the assumption of the model that the coefficient of expenditure does not vary by state. The trade off shown separately in the graph for each category is between the two quantitative variables: public school teachers' salaries (Y) in relation to State expenditure per pupil on public schools (X).
Quantitative regressors in regression models often have an interaction among each other. In the same way, qualitative regressors, or dummies, can also have interaction effects between each other, and these interactions can be depicted in the regression model. For example, in a regression involving determination of wages, if two qualitative variables are considered, namely, gender and marital status, there could be an interaction between marital status and gender. These interactions can be shown in the regression equation as illustrated by the example below.
With the two qualitative variables being gender and marital status and with the quantitative explanator being years of education, a regression that is purely linear in the explanators would be
Y_{i} = β_{1} + β_{2}D_{2,i} + β_{3}D_{3,i} + αX_{i} + U_{i}
where
i denotes the particular individual
Y = Hourly Wages (in $)
X = Years of education
D
_{2} = 1 if female, 0 otherwise
D
_{3} = 1 if married, 0 otherwise
This specification does not allow for the possibility that there may be an interaction that occurs between the two qualitative variables, D_{2} and D_{3}. For example, a female who is married may earn wages that differ from those of an unmarried male by an amount that is not the same as the sum of the differentials for solely being female and solely being married. Then the effect of the interacting dummies on the mean of Y is not simply additive as in the case of the above specification, but multiplicative also, and the determination of wages can be specified as:
Y_{i} = β_{1} + β_{2}D_{2,i} + β_{3}D_{3,i} + β_{4}(D_{2,i}D_{3,i}) + αX_{i} + U_{i}
Here,
β
_{2} = differential effect of being a female
β
_{3} = differential effect of being married
β
_{4} = further differential effect of being
both female
and married
By this equation, in the absence of a nonzero error the wage of an unmarried male is β_{1}+ αX_{i}, that of an unmarried female is β_{1}+ β_{2} + αX_{i}, that of being a married male is β_{1}+ β_{3} + αX_{i}, and that of being a married female is β_{1}+β_{2}+ β_{3} + β_{4}+ αX_{i} (where any of the estimates of the coefficients of the dummies could turn out to be positive, zero, or negative).
Thus, an interaction dummy (product of two dummies) can alter the dependent variable from the value that it gets when the two dummies are considered individually.
However, the use of products of dummy variables to capture interactions can be avoided by using a different scheme for categorizing the data—one that specifies categories in terms of combinations of characteristics. If we let
D
_{4} = 1 if unmarried female, 0 otherwise
D
_{5} = 1 if married male, 0 otherwise
D
_{6} = 1 if married female, 0 otherwise
then it suffices to specify the regression
Y_{i} = δ_{1} + δ_{4}D_{4,i} + δ_{5}D_{5,i} + δ_{6}D_{6,i} + αX_{i} + U_{i}.
Then with zero shock term the value of the dependent variable is δ_{1}+ αX_{i} for the base category unmarried males, δ_{1} + δ_{4}+ αX_{i} for unmarried females, δ_{1} + δ_{5}+ αX_{i} for married males, and δ_{1} + δ_{6}+ αX_{i} for married females. This specification involves the same number of rightside variables as does the previous specification with an interaction term, and the regression results for the predicted value of the dependent variable contingent on X_{i}, for any combination of qualitative traits, are identical between this specification and the interaction specification.
A model with a dummy dependent variable (also known as a qualitative dependent variable) is one in which the dependent variable, as influenced by the explanatory variables, is qualitative in nature. Some decisions regarding 'how much' of an act must be performed involve a prior decision making on whether to perform the act or not. For example, the amount of output to produce, the cost to be incurred, etc. involve prior decisions on whether to produce or not, whether to spend or not, etc. Such "prior decisions" become dependent dummies in the regression model.
For example, the decision of a worker to be a part of the labour force becomes a dummy dependent variable. The decision is dichotomous, i.e., the decision has two possible outcomes: yes and no. So the dependent dummy variable Participation would take on the value 1 if participating, 0 if not participating. Some other examples of dichotomous dependent dummies are cited below:
Decision: Choice of Occupation. Dependent Dummy: Supervisory = 1 if supervisor, 0 if not supervisor.
Decision: Affiliation to a Political Party. Dependent Dummy: Affiliation = 1 if affiliated to the party, 0 if not affiliated.
Decision: Retirement. Dependent Dummy: Retired = 1 if retired, 0 if not retired.
When the qualitative dependent dummy variable has more than two values (such as affiliation to many political parties), it becomes a multiresponse or a multinomial or polychotomous model.
Analysis of dependent dummy variable models can be done through different methods. One such method is the usual OLS method, which in this context is called the linear probability model. An alternative method is to assume that there is an unobservable continuous latent variable Y^{*} and that the observed dichotomous variable Y = 1 if Y^{*} > 0, 0 otherwise. This is the underlying concept of the logit and probit models. These models are discussed in brief below.
An ordinary least squares model in which the dependent variable Y is a dichotomous dummy, taking the values of 0 and 1, is the linear probability model (LPM). Suppose we consider the following regression:
Y
i
=
α
1
+
α
2
X
i
+
u
i
where
X
= family income
Y
=
1
if a house is owned by the family, 0 if a house is not owned by the family
The model is called the linear probability model because, the regression is linear. The conditional mean of Y_{i} given X_{i}, written as
E
(
Y
i

X
i
)
, is interpreted as the conditional probability that the event will occur for that value of X_{i} — that is, Pr(Y_{i} = 1 X_{i}). In this example,
E
(
Y
i

X
i
)
gives the probability of a house being owned by a family whose income is given by X_{i}.
Now, using the OLS assumption
E
(
u
i

X
i
)
=
0
, we get
E
(
Y
i

X
i
)
=
α
1
+
α
2
X
i
Some problems are inherent in the LPM model:
 The regression line will not be a wellfitted one and hence measures of significance, such as R^{2}, will not be reliable.
 Models that are analyzed using the LPM approach will have heteroscedastic disturbances.
 The error term will have a nonnormal distribution.
 The LPM may give predicted values of the dependent variable that are greater than 1 or less than 0. This will be difficult to interpret as the predicted values are intended to be probabilities, which must lie between 0 and 1.
 There might exist a nonlinear relationship between the variables of the LPM model, in which case, the linear regression will not fit the data accurately.
To avoid the limitations of the LPM, what is needed is a model that has the feature that as the explanatory variable, X_{i}, increases, P_{i} = E (Y_{i} = 1  X_{i}) should remain within the range between 0 and 1. Thus the relationship between the independent and dependent variables is necessarily nonlinear.
For this purpose, a cumulative distribution function (CDF) can be used to estimate the dependent dummy variable regression. Figure 4 shows an 'S'shaped curve, which resembles the CDF of a random variable. In this model, the probability is between 0 and 1 and the nonlinearity has been captured. The choice of the CDF to be used is now the question.
Two alternative CDFs can be used: the logistic and normal CDFs. The logistic CDF gives rise to the logit model and the normal CDF give rises to the probit model .
The shortcomings of the LPM led to the development of a more refined and improved model called the logit model. In the logit model, the cumulative distribution of the error term in the regression equation is logistic. The regression is more realistic in that it is nonlinear.
The logit model is estimated using the maximum likelihood approach. In this model,
P
(
Y
=
1

X
)
, which is the probability of the dependent variable taking the value of 1 given the independent variable is:
P
i
=
1
1
+
e
−
z
i
=
e
z
i
1
+
e
z
i
where
z
i
=
α
1
+
α
2
X
i
+
u
i
.
The model is then expressed in the form of the odds ratio: what is modeled in the logistic regression is the natural logarithm of the odds, the odds being defined as
P
/
(
1
−
P
)
. Taking the natural log of the odds, the logit (L_{i}) is expressed as
L
i
=
ln
(
P
i
1
−
P
i
)
=
z
i
=
α
1
+
α
2
X
i
.
This relationship shows that L_{i} is linear in relation to X_{i}, but the probabilities are not linear in terms of X_{i}.
Another model that was developed to offset the disadvantages of the LPM is the probit model. The probit model uses the same approach to nonlinearity as does the logit model; however, it uses the normal CDF instead of the logistic CDF.