(Twenty-second in a series)
To date, all of the independent variables we have used in our regression equations have been quantitative; they could easily be counted. Sometimes, it is important to understand the relationship between a dependent variable and a categorical, or qualitative, variable. For example, an economist who is developing a model to predict salaries for persons in a particular profession might want to see if salaries are different if the employee is male vs. female; white vs. non-white; or college-degreed vs. non-degreed. Whereas quantitative independent variables can be continuous (any value from negative to positive infinity), qualitative variables like those the economist is considering have only two values; that is, they are discrete.
Qualitative Variables in Regression Analysis
Heather Hanley is a high school physics teacher who is interested in predicting student scores on the final exam. Heather has a strong hunch that the midterm score is a good predictor of performance on the final, but she is also concerned that female students are not performing as highly on the final as males. Heather wants to test this hypothesis to see if she needs to adjust her teaching style so that she could help the girls in her class prepare better for the final.
Since Heather teaches her class more or less the same way each year, she pulls the midterm and final scores for the thirty students in her class last year, and notes their gender. She has the following data:
Student # |
Gender |
Midterm |
Final |
1 |
Male |
91 |
98 |
2 |
Male |
51 |
59 |
3 |
Female |
56 |
53 |
4 |
Male |
79 |
84 |
5 |
Female |
74 |
77 |
6 |
Female |
91 |
90 |
7 |
Male |
65 |
69 |
8 |
Male |
88 |
97 |
9 |
Female |
69 |
73 |
10 |
Female |
84 |
84 |
11 |
Female |
79 |
75 |
12 |
Female |
53 |
59 |
13 |
Male |
85 |
91 |
14 |
Female |
97 |
97 |
15 |
Male |
91 |
93 |
16 |
Female |
81 |
84 |
17 |
Male |
86 |
90 |
18 |
Male |
84 |
89 |
19 |
Male |
79 |
87 |
20 |
Male |
70 |
77 |
21 |
Female |
82 |
85 |
22 |
Female |
82 |
86 |
23 |
Male |
70 |
80 |
24 |
Male |
62 |
80 |
25 |
Female |
77 |
78 |
26 |
Female |
79 |
85 |
27 |
Male |
81 |
90 |
28 |
Female |
85 |
88 |
29 |
Female |
84 |
86 |
30 |
Male |
91 |
91 |
“Male” and “Female” are not quantitative states of nature, so how does Heather create a variable that accounts for gender? She can create a dummy variable. Heather creates a variable called FEMALE. If the student is a girl, then FEMALE=1; otherwise, if the student is a boy, then FEMALE=0. So Heather’s new table looks like this:
Student # |
Gender |
FEMALE |
Midterm |
Final |
1 |
Male |
0 |
91 |
98 |
2 |
Male |
0 |
51 |
59 |
3 |
Female |
1 |
56 |
53 |
4 |
Male |
0 |
79 |
84 |
5 |
Female |
1 |
74 |
77 |
6 |
Female |
1 |
91 |
90 |
7 |
Male |
0 |
65 |
69 |
8 |
Male |
0 |
88 |
97 |
9 |
Female |
1 |
69 |
73 |
10 |
Female |
1 |
84 |
84 |
11 |
Female |
1 |
79 |
75 |
12 |
Female |
1 |
53 |
59 |
13 |
Male |
0 |
85 |
91 |
14 |
Female |
1 |
97 |
97 |
15 |
Male |
0 |
91 |
93 |
16 |
Female |
1 |
81 |
84 |
17 |
Male |
0 |
86 |
90 |
18 |
Male |
0 |
84 |
89 |
19 |
Male |
0 |
79 |
87 |
20 |
Male |
0 |
70 |
77 |
21 |
Female |
1 |
82 |
85 |
22 |
Female |
1 |
82 |
86 |
23 |
Male |
0 |
70 |
80 |
24 |
Male |
0 |
62 |
80 |
25 |
Female |
1 |
77 |
78 |
26 |
Female |
1 |
79 |
85 |
27 |
Male |
0 |
81 |
90 |
28 |
Female |
1 |
85 |
88 |
29 |
Female |
1 |
84 |
86 |
30 |
Male |
0 |
91 |
91 |
Now, with FEMALE as a quantitative representation of gender, Heather can run her regression just as easily, using FEMALE and MIDTERM as her independent variables and FINAL as her dependent variable.
Regression Results
Heather runs her regression and obtains the following results:
Heather’s R^{2}=0.913, which means that her model fits the data quite well; her t-values for MIDTERM and FEMALE are 16.38 and (4.02), respectively, both very significant; and her F-statistic is a strong 142.23, suggesting a valid model.
The regression results confirm Heather’s concern: female students tend to underperform male students on the physics final by an average of 5 points. This does not mean that female students underperform male students on the final because they are female; remember, regression analysis cannot prove causality. As indicated earlier, Heather feels her style for prepping students for the final might favor boys over girls (unintentionally, of course)!
In addition, each one-point increase in the midterm score is associated with an average increase of 0.89 points on the final.
Interpretation of the Dummy Variable
Essentially, the model tells us that if two of Heather’s student’s this year, say Joe (Male) and Shari (Female) each score a 60 on the midterm, Joe’s expected score on the final will be 68.93 (0.89*60 – 5.00*0 +15.53), while Shari’s will be 63.93 (0.89*60 – 5.00*1 + 15.53).
Notice that although we are using multiple regression, we only have a single slope: 0.89. The coefficient for FEMALE has no effect if the student is male; yet the coefficient for MIDTERM applies to all students, male or female. Hence, this dummy variable affects the intercept of the regression equation.
You can visually depict this as follows:
Essentially, the dummy variable captures a structural shift (like a difference in gender) that might shift the intercept of the Final exam score model lower. Since the value of FEMALE is zero if the student is a male, then the intercept is 15.53; but if the student is female, the value of FEMALE is 1; multiply that by its coefficient, and you get -5.00. When applied towards the intercept, it’s as if female students’ intercept is 10.53.
Multiple Dummy Variables
Could Heather have instead created a variable MALE, and gave it a value of 1 if male and 0 if female? Most definitely. In that case, the parameter estimate would still have a value of 5.00, but would have a positive sign.
What if Heather had has sophomores, juniors, and seniors in her class? Can she do a dummy variable for those three outcomes? She sure can. However, she would need to create two dummy variables. Perhaps one for SENIOR and one for JUNIOR. Or she could do SOPHOMORE and JUNIOR. It doesn’t matter. In the former, a Senior would be flagged as 1 for SENIOR and a 0 for JUNIOR. A junior would be flagged as 0 for SENIOR and 1 for JUNIOR. A sophomore would be flagged as 0 for both.
Why can’t Heather still keep just one dummy variable and label sophomores 0, juniors 1 and seniors 2? Or why couldn’t she label FEMALE as 1 for female and 2 for male? Because both of these would cause serious statistical problems. Doing it these ways would imply that these qualitative, discrete variables are indeed quantitative, and imply an ordering that is somehow linearly related to the dependent variable, which is incorrect.
The number of dummy variables you use is always 1 less than the number of possible outcomes you’re trying to classify. So, if you want to compare five categories, you would need four dummy variables.
Why One Less Dummy Variable Than Number of Outcomes?
Why don’t you use a dummy variable for all categories? Very simple. Let’s go back to our example of FEMALE. If we also create a dummy variable MALE, it is going to cause a problem. All the time FEMALE=1, MALE=0; all the time FEMALE=0, MALE=1. Notice the problem: perfect multicollinearity!
Can Dummy Variables be Overdone?
Absolutely. Remember, each dummy variable you add reduces your degrees of freedom. On a small data set, several dummy variables could lead to insignificant t-values. Also, if you have two dummy variables for three outcomes, there’s always the possibility that one or both may not produce t-values that are significant. At which point, you may need to collapse the categories into fewer classifications. So dummy variables should be used sparingly, and for qualitative variables with as few possible classifications as possible.
Next Forecast Friday Topic:
Testing For Seasonal Effects with Dummy Variables
In this example, dummy variables helped Heather Hanley determine that the gender of her students is correlated to final performance on the final. Yet dummy variables can be used for much more, like testing for seasonal effects. Next week, we will discuss how to test for seasonality using dummy variables.
*************************
If you Like Our Posts, Then “Like” Us on Facebook and Twitter!
Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.
Tags: Analysights, dummy variable, Forecast Friday, Forecasting, multicollinearity, predictive modeling, qualitative variables, regression analysis
Leave a Reply