## Forecast Friday Topic: Dummy Variables

(Twenty-second in a series)

To date, all of the independent variables we have used in our regression equations have been quantitative; they could easily be counted. Sometimes, it is important to understand the relationship between a dependent variable and a categorical, or qualitative, variable. For example, an economist who is developing a model to predict salaries for persons in a particular profession might want to see if salaries are different if the employee is male vs. female; white vs. non-white; or college-degreed vs. non-degreed. Whereas quantitative independent variables can be continuous (any value from negative to positive infinity), qualitative variables like those the economist is considering have only two values; that is, they are discrete.

Qualitative Variables in Regression Analysis

Heather Hanley is a high school physics teacher who is interested in predicting student scores on the final exam. Heather has a strong hunch that the midterm score is a good predictor of performance on the final, but she is also concerned that female students are not performing as highly on the final as males. Heather wants to test this hypothesis to see if she needs to adjust her teaching style so that she could help the girls in her class prepare better for the final.

Since Heather teaches her class more or less the same way each year, she pulls the midterm and final scores for the thirty students in her class last year, and notes their gender. She has the following data:

 Student # Gender Midterm Final 1 Male 91 98 2 Male 51 59 3 Female 56 53 4 Male 79 84 5 Female 74 77 6 Female 91 90 7 Male 65 69 8 Male 88 97 9 Female 69 73 10 Female 84 84 11 Female 79 75 12 Female 53 59 13 Male 85 91 14 Female 97 97 15 Male 91 93 16 Female 81 84 17 Male 86 90 18 Male 84 89 19 Male 79 87 20 Male 70 77 21 Female 82 85 22 Female 82 86 23 Male 70 80 24 Male 62 80 25 Female 77 78 26 Female 79 85 27 Male 81 90 28 Female 85 88 29 Female 84 86 30 Male 91 91

“Male” and “Female” are not quantitative states of nature, so how does Heather create a variable that accounts for gender? She can create a dummy variable. Heather creates a variable called FEMALE. If the student is a girl, then FEMALE=1; otherwise, if the student is a boy, then FEMALE=0. So Heather’s new table looks like this:

 Student # Gender FEMALE Midterm Final 1 Male 0 91 98 2 Male 0 51 59 3 Female 1 56 53 4 Male 0 79 84 5 Female 1 74 77 6 Female 1 91 90 7 Male 0 65 69 8 Male 0 88 97 9 Female 1 69 73 10 Female 1 84 84 11 Female 1 79 75 12 Female 1 53 59 13 Male 0 85 91 14 Female 1 97 97 15 Male 0 91 93 16 Female 1 81 84 17 Male 0 86 90 18 Male 0 84 89 19 Male 0 79 87 20 Male 0 70 77 21 Female 1 82 85 22 Female 1 82 86 23 Male 0 70 80 24 Male 0 62 80 25 Female 1 77 78 26 Female 1 79 85 27 Male 0 81 90 28 Female 1 85 88 29 Female 1 84 86 30 Male 0 91 91

Now, with FEMALE as a quantitative representation of gender, Heather can run her regression just as easily, using FEMALE and MIDTERM as her independent variables and FINAL as her dependent variable.

Regression Results

Heather runs her regression and obtains the following results:

Heather’s R2=0.913, which means that her model fits the data quite well; her t-values for MIDTERM and FEMALE are 16.38 and (4.02), respectively, both very significant; and her F-statistic is a strong 142.23, suggesting a valid model.

The regression results confirm Heather’s concern: female students tend to underperform male students on the physics final by an average of 5 points. This does not mean that female students underperform male students on the final because they are female; remember, regression analysis cannot prove causality. As indicated earlier, Heather feels her style for prepping students for the final might favor boys over girls (unintentionally, of course)!

In addition, each one-point increase in the midterm score is associated with an average increase of 0.89 points on the final.

Interpretation of the Dummy Variable

Essentially, the model tells us that if two of Heather’s student’s this year, say Joe (Male) and Shari (Female) each score a 60 on the midterm, Joe’s expected score on the final will be 68.93 (0.89*60 – 5.00*0 +15.53), while Shari’s will be 63.93 (0.89*60 – 5.00*1 + 15.53).

Notice that although we are using multiple regression, we only have a single slope: 0.89. The coefficient for FEMALE has no effect if the student is male; yet the coefficient for MIDTERM applies to all students, male or female. Hence, this dummy variable affects the intercept of the regression equation.

You can visually depict this as follows:

Essentially, the dummy variable captures a structural shift (like a difference in gender) that might shift the intercept of the Final exam score model lower. Since the value of FEMALE is zero if the student is a male, then the intercept is 15.53; but if the student is female, the value of FEMALE is 1; multiply that by its coefficient, and you get -5.00. When applied towards the intercept, it’s as if female students’ intercept is 10.53.

Multiple Dummy Variables

Could Heather have instead created a variable MALE, and gave it a value of 1 if male and 0 if female? Most definitely. In that case, the parameter estimate would still have a value of 5.00, but would have a positive sign.

What if Heather had has sophomores, juniors, and seniors in her class? Can she do a dummy variable for those three outcomes? She sure can. However, she would need to create two dummy variables. Perhaps one for SENIOR and one for JUNIOR. Or she could do SOPHOMORE and JUNIOR. It doesn’t matter. In the former, a Senior would be flagged as 1 for SENIOR and a 0 for JUNIOR. A junior would be flagged as 0 for SENIOR and 1 for JUNIOR. A sophomore would be flagged as 0 for both.

Why can’t Heather still keep just one dummy variable and label sophomores 0, juniors 1 and seniors 2? Or why couldn’t she label FEMALE as 1 for female and 2 for male? Because both of these would cause serious statistical problems. Doing it these ways would imply that these qualitative, discrete variables are indeed quantitative, and imply an ordering that is somehow linearly related to the dependent variable, which is incorrect.

The number of dummy variables you use is always 1 less than the number of possible outcomes you’re trying to classify. So, if you want to compare five categories, you would need four dummy variables.

Why One Less Dummy Variable Than Number of Outcomes?

Why don’t you use a dummy variable for all categories? Very simple. Let’s go back to our example of FEMALE. If we also create a dummy variable MALE, it is going to cause a problem. All the time FEMALE=1, MALE=0; all the time FEMALE=0, MALE=1. Notice the problem: perfect multicollinearity!

Can Dummy Variables be Overdone?

Absolutely. Remember, each dummy variable you add reduces your degrees of freedom. On a small data set, several dummy variables could lead to insignificant t-values. Also, if you have two dummy variables for three outcomes, there’s always the possibility that one or both may not produce t-values that are significant. At which point, you may need to collapse the categories into fewer classifications. So dummy variables should be used sparingly, and for qualitative variables with as few possible classifications as possible.

Next Forecast Friday Topic:
Testing For Seasonal Effects with Dummy Variables

In this example, dummy variables helped Heather Hanley determine that the gender of her students is correlated to final performance on the final. Yet dummy variables can be used for much more, like testing for seasonal effects. Next week, we will discuss how to test for seasonality using dummy variables.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.