Archive for September, 2010

Forecast Friday Topic: Seasonal Dummy Variables

September 30, 2010

(Twenty-third in a series)

Last week, I introduced you to the use of dummy variables as a means of incorporating qualitative information into a regression model. Dummy variables can also be used to account for seasonality. A couple of weeks ago, we discussed adjusting your data for seasonality before constructing your model. As you saw, that could be pretty time consuming. One faster approach would be to take the raw time series data and add a dummy variable for each season of the year less one. So, if you’re working with quarterly data, you would want to use three dummy variables; if you have monthly variables, you want to add in 11 dummy variables.

For example, the fourth quarter of the year is often the busiest for most retailers. If a retail chain didn’t seasonally adjust its data, it might choose to create three dummy variables: D1, D2, and D3. The first quarter of the year would be D1; the second quarter, D2 ; and the third quarter, D3. As we discussed last week, we always want to have one fewer dummy variable than we do outcomes. In our example, if we know the fourth quarter is the busiest quarter, then we would expect our three dummy variables to be significant and negative.

Revisiting Billie Burton

A couple of weeks ago, while discussing how to decompose a time series, I used the example of Billie Burton, a businesswoman who makes gift baskets. Billie had been trying to forecast orders for planning and budgeting purposes. She had five years of monthly order data:

Month

TOTAL GIFT BASKET ORDERS

2005

2006

2007

2008

2009

January

15

18

22

26

31

February

30

36

43

52

62

March

25

18

22

43

32

April

15

30

36

27

52

May

13

16

19

23

28

June

14

17

20

24

29

July

12

14

17

20

24

August

22

26

31

37

44

September

20

24

29

35

42

October

14

17

20

24

29

November

35

42

50

60

72

December

40

48

58

70

84

 

You recall the painstaking effort we went through to adjust Billie’s orders for seasonality. Is there a simpler way? Yes. We can use dummy variables. Let’s first assume Billie ran her regression on the data just as it is, with no adjustment for seasonality. She ends up with the following regression equation:

Ŷ= 0.518t +15.829

This model suggests an upward trend with each passing month but doesn’t fit the data quite as well as we would like: R2 is just 0.313 and the F-statistic is just 26.47.

Imagine now that Billie decides to use seasonal dummy variables. Since her data is monthly, Billie must use 11 dummy variables. Since December is her busiest month, Billie decides to make one dummy variable for each month from January to November. D1 is January; D2 is February; and so on until D11 , which is November. Hence, in January, D1 will be flagged as a 1 and D2 to D11 will be 0. In February, D2 will equal 1 while all the other dummies will be zero. And so forth. Note that all dummies will be zero in December.

Picture in your mind a table with 60 rows and 13 columns. Each row contains the monthly data from January 2005 to December 2009. The first column is the number of orders for the month; the second is the time period, t, which is 1 to 60. That is our independent variable from our original model. The next eleven columns are the dummy variables. Billie enters these into Excel and runs her regression. What does she get?

I’m going to show you the resulting equation in tabular form, as it would look far too complicated in standard form. Billie gets the following output:

Parameter

Coefficients

t Stat

Intercept

42.93

15.93

t

0.47

12.13

D1 (January)

-32.38

-9.88

D2 (February)

-10.66

-3.26

D3 (March)

-27.73

-8.48

D4 (April)

-24.21

-7.41

D5 (May)

-36.88

-11.31

D6 (June)

-36.35

-11.16

D7 (July)

-40.23

-12.35

D8 (August)

-26.10

-8.02

D9 (September)

-28.58

-8.79

D10 (October)

-38.25

-11.76

D11 (November)

-7.73

-2.38

 

Billie gets a great model: notice that all the parameter estimates are significant, and they’re all negative, indicating December as the busiest month. Billie’s R2 has now shot up to 0.919, indicating an even better fit. And the F-statistic is up to 44.73, and it is more significant.

How does this compare to Billie’s model on her seasonally-adjusted data? Recall that when doing her regressions on seasonally adjusted data, Billie got the following results:

Ŷ = 0.47t +17.12

Her model had an R2 of 0.872, but her F-statistic was almost 395! So, even though Billie gained a few more points in R2 with the seasonal dummies, her F-statistic wasn’t quite as significant. However, Billie’s F-statistic using the dummy variables is still very strong, and I would argue more stable. Recall that the F-statistic is determined by dividing the mean squared error of the regression by the mean squared error of the residuals. The mean squared error of the regression is the sum of squares regression (RSS) divided by the number of independent variables in the model; the mean squared error of the residuals is the Sum of Squared Error (SSE) divided by the number of observations less the number of independent variables and less one more. To illustrate, here is a side by side comparison:

   Seasonally Adjusted Model Seasonal Dummy Model
# Observations

60

60

SSR

3,982

14,179

# Independent Variables

1

12

Mean Square Error of Regression

3,982

1,182

SSE

585

1,241

Degrees of Freedom

58

47

Mean Squared Error of Residuals

10.08

26.41

F-Statistic

394.91

44.73

So, although the F-statistic is much lower for the seasonal dummy model, the mean square error of the regression is also much lower. As a result, the F-statistic is still quite significant, but much more stable than our one variable model built on the seasonally-adjusted data.

It is important to note that sometimes data sets do not lend themselves well to seasonal dummies, and that the manual adjustment process we worked through a few weeks ago may be a better approach.

Next Forecast Friday Topic: Slope Dummy Variables

The dummy variables we worked with last week and this week are intercept dummies. These dummy variables alter the Y-intercept of the regression equation. Sometimes, it is necessary to affect the slope of the equation. We will discuss how slope dummies are used in next week’s Forecast Friday post.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Data, Data Everywhere

September 29, 2010

Every time we use a cell phone, surf the Web, interact on Facebook, make a purchase, what have you, we create data that businesses, charities, and other organizations analyze to learn more about us.

While such a scenario sounds Orwellian, it is not necessarily terrible.  For example, if your local supermarket chain knows from your frequent shopper card that you buy Kashi Go-Lean cereal four or five packages at a time, you might appreciate them telling you when Kashi goes on sale.  It’s a win-win situation: you want to stock up on Kashi for the best price, and the store wants to bait you with the Kashi in the hopes you’ll buy more than the cereal.

But I digress.  The fact that we create data both seamlessly and almost instantaneously with every one of life’s transactions has greatly increased the demand for tools and specialized professionals to analyze that data and help companies turn it into actionable information.  In fact, IBM is banking its future growth on analytics, a market estimated to be worth $100 billion, by its planned purchase of Netezza, which it announced last week.

Analytics is big business, and even if your job description doesn’t require you to analyze data, you should be aware of it.  Almost anything electronic can be tracked and/or monitored these days.  Anytime you get an email offer from an online retailer you’ve done business with, or direct mail from a charity or other retailer, you’ve been selected by analytical tools who are viewing your past purchasing and giving history.

If you run a business, you should be cognizant of all the data you accumulate and the ways in which you accumulate it.  What’s more, you should weigh the data you’re currently collecting against the decisions it helps you make, so that you can identify additional data you may need.  This can be a goldmine for you in helping you better understand your customers’ needs and wants, identify new trends and changing patterns, and develop new products in services in response to those changing needs and wants.

Data and the need to analyze it are here to stay.

Data Mining Meets Online Dating

September 28, 2010

The September 27 issue of Fortune Magazine had two stories in it that pertain to data mining and predictive modeling.  One of them, eHarmony’s Algorithm of Love,  is an interesting account of how eHarmony is using predictive analytics tools to maximize the likelihood of a couple being a good match.  Since the article is brief, any commentary I might add – other than “it drives home the points I’ve been making,” will simply parrot the artice.  So I thought I’d let you click on the link and enjoy!

Forecast Friday Topic: Dummy Variables

September 23, 2010

(Twenty-second in a series)

To date, all of the independent variables we have used in our regression equations have been quantitative; they could easily be counted. Sometimes, it is important to understand the relationship between a dependent variable and a categorical, or qualitative, variable. For example, an economist who is developing a model to predict salaries for persons in a particular profession might want to see if salaries are different if the employee is male vs. female; white vs. non-white; or college-degreed vs. non-degreed. Whereas quantitative independent variables can be continuous (any value from negative to positive infinity), qualitative variables like those the economist is considering have only two values; that is, they are discrete.

Qualitative Variables in Regression Analysis

Heather Hanley is a high school physics teacher who is interested in predicting student scores on the final exam. Heather has a strong hunch that the midterm score is a good predictor of performance on the final, but she is also concerned that female students are not performing as highly on the final as males. Heather wants to test this hypothesis to see if she needs to adjust her teaching style so that she could help the girls in her class prepare better for the final.

Since Heather teaches her class more or less the same way each year, she pulls the midterm and final scores for the thirty students in her class last year, and notes their gender. She has the following data:

Student #

Gender

Midterm

Final

1

Male

91

98

2

Male

51

59

3

Female

56

53

4

Male

79

84

5

Female

74

77

6

Female

91

90

7

Male

65

69

8

Male

88

97

9

Female

69

73

10

Female

84

84

11

Female

79

75

12

Female

53

59

13

Male

85

91

14

Female

97

97

15

Male

91

93

16

Female

81

84

17

Male

86

90

18

Male

84

89

19

Male

79

87

20

Male

70

77

21

Female

82

85

22

Female

82

86

23

Male

70

80

24

Male

62

80

25

Female

77

78

26

Female

79

85

27

Male

81

90

28

Female

85

88

29

Female

84

86

30

Male

91

91

 

“Male” and “Female” are not quantitative states of nature, so how does Heather create a variable that accounts for gender? She can create a dummy variable. Heather creates a variable called FEMALE. If the student is a girl, then FEMALE=1; otherwise, if the student is a boy, then FEMALE=0. So Heather’s new table looks like this:

Student #

Gender

FEMALE

Midterm

Final

1

Male

0

91

98

2

Male

0

51

59

3

Female

1

56

53

4

Male

0

79

84

5

Female

1

74

77

6

Female

1

91

90

7

Male

0

65

69

8

Male

0

88

97

9

Female

1

69

73

10

Female

1

84

84

11

Female

1

79

75

12

Female

1

53

59

13

Male

0

85

91

14

Female

1

97

97

15

Male

0

91

93

16

Female

1

81

84

17

Male

0

86

90

18

Male

0

84

89

19

Male

0

79

87

20

Male

0

70

77

21

Female

1

82

85

22

Female

1

82

86

23

Male

0

70

80

24

Male

0

62

80

25

Female

1

77

78

26

Female

1

79

85

27

Male

0

81

90

28

Female

1

85

88

29

Female

1

84

86

30

Male

0

91

91

 

Now, with FEMALE as a quantitative representation of gender, Heather can run her regression just as easily, using FEMALE and MIDTERM as her independent variables and FINAL as her dependent variable.

Regression Results

Heather runs her regression and obtains the following results:

Heather’s R2=0.913, which means that her model fits the data quite well; her t-values for MIDTERM and FEMALE are 16.38 and (4.02), respectively, both very significant; and her F-statistic is a strong 142.23, suggesting a valid model.

The regression results confirm Heather’s concern: female students tend to underperform male students on the physics final by an average of 5 points. This does not mean that female students underperform male students on the final because they are female; remember, regression analysis cannot prove causality. As indicated earlier, Heather feels her style for prepping students for the final might favor boys over girls (unintentionally, of course)!

In addition, each one-point increase in the midterm score is associated with an average increase of 0.89 points on the final.

Interpretation of the Dummy Variable

Essentially, the model tells us that if two of Heather’s student’s this year, say Joe (Male) and Shari (Female) each score a 60 on the midterm, Joe’s expected score on the final will be 68.93 (0.89*60 – 5.00*0 +15.53), while Shari’s will be 63.93 (0.89*60 – 5.00*1 + 15.53).

Notice that although we are using multiple regression, we only have a single slope: 0.89. The coefficient for FEMALE has no effect if the student is male; yet the coefficient for MIDTERM applies to all students, male or female. Hence, this dummy variable affects the intercept of the regression equation.

You can visually depict this as follows:


Essentially, the dummy variable captures a structural shift (like a difference in gender) that might shift the intercept of the Final exam score model lower. Since the value of FEMALE is zero if the student is a male, then the intercept is 15.53; but if the student is female, the value of FEMALE is 1; multiply that by its coefficient, and you get -5.00. When applied towards the intercept, it’s as if female students’ intercept is 10.53.

Multiple Dummy Variables

Could Heather have instead created a variable MALE, and gave it a value of 1 if male and 0 if female? Most definitely. In that case, the parameter estimate would still have a value of 5.00, but would have a positive sign.

What if Heather had has sophomores, juniors, and seniors in her class? Can she do a dummy variable for those three outcomes? She sure can. However, she would need to create two dummy variables. Perhaps one for SENIOR and one for JUNIOR. Or she could do SOPHOMORE and JUNIOR. It doesn’t matter. In the former, a Senior would be flagged as 1 for SENIOR and a 0 for JUNIOR. A junior would be flagged as 0 for SENIOR and 1 for JUNIOR. A sophomore would be flagged as 0 for both.

Why can’t Heather still keep just one dummy variable and label sophomores 0, juniors 1 and seniors 2? Or why couldn’t she label FEMALE as 1 for female and 2 for male? Because both of these would cause serious statistical problems. Doing it these ways would imply that these qualitative, discrete variables are indeed quantitative, and imply an ordering that is somehow linearly related to the dependent variable, which is incorrect.

The number of dummy variables you use is always 1 less than the number of possible outcomes you’re trying to classify. So, if you want to compare five categories, you would need four dummy variables.

Why One Less Dummy Variable Than Number of Outcomes?

Why don’t you use a dummy variable for all categories? Very simple. Let’s go back to our example of FEMALE. If we also create a dummy variable MALE, it is going to cause a problem. All the time FEMALE=1, MALE=0; all the time FEMALE=0, MALE=1. Notice the problem: perfect multicollinearity!

Can Dummy Variables be Overdone?

Absolutely. Remember, each dummy variable you add reduces your degrees of freedom. On a small data set, several dummy variables could lead to insignificant t-values. Also, if you have two dummy variables for three outcomes, there’s always the possibility that one or both may not produce t-values that are significant. At which point, you may need to collapse the categories into fewer classifications. So dummy variables should be used sparingly, and for qualitative variables with as few possible classifications as possible.

Next Forecast Friday Topic:
Testing For Seasonal Effects with Dummy Variables

In this example, dummy variables helped Heather Hanley determine that the gender of her students is correlated to final performance on the final. Yet dummy variables can be used for much more, like testing for seasonal effects. Next week, we will discuss how to test for seasonality using dummy variables.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Insight Central will Resume Thursday 09/23

September 21, 2010

I’m traveling on business this week and will not be posting on Insight Central these next couple of days.  Insight Central will resume this Thursday, September 23 with our Forecast Friday  post.

I’m sorry for the inconvenience.

Alex