## Archive for September, 2010

### Forecast Friday Topic: Seasonal Dummy Variables

September 30, 2010

(Twenty-third in a series)

Last week, I introduced you to the use of dummy variables as a means of incorporating qualitative information into a regression model. Dummy variables can also be used to account for seasonality. A couple of weeks ago, we discussed adjusting your data for seasonality before constructing your model. As you saw, that could be pretty time consuming. One faster approach would be to take the raw time series data and add a dummy variable for each season of the year less one. So, if you’re working with quarterly data, you would want to use three dummy variables; if you have monthly variables, you want to add in 11 dummy variables.

For example, the fourth quarter of the year is often the busiest for most retailers. If a retail chain didn’t seasonally adjust its data, it might choose to create three dummy variables: D1, D2, and D3. The first quarter of the year would be D1; the second quarter, D2 ; and the third quarter, D3. As we discussed last week, we always want to have one fewer dummy variable than we do outcomes. In our example, if we know the fourth quarter is the busiest quarter, then we would expect our three dummy variables to be significant and negative.

Revisiting Billie Burton

A couple of weeks ago, while discussing how to decompose a time series, I used the example of Billie Burton, a businesswoman who makes gift baskets. Billie had been trying to forecast orders for planning and budgeting purposes. She had five years of monthly order data:

 Month TOTAL GIFT BASKET ORDERS 2005 2006 2007 2008 2009 January 15 18 22 26 31 February 30 36 43 52 62 March 25 18 22 43 32 April 15 30 36 27 52 May 13 16 19 23 28 June 14 17 20 24 29 July 12 14 17 20 24 August 22 26 31 37 44 September 20 24 29 35 42 October 14 17 20 24 29 November 35 42 50 60 72 December 40 48 58 70 84

You recall the painstaking effort we went through to adjust Billie’s orders for seasonality. Is there a simpler way? Yes. We can use dummy variables. Let’s first assume Billie ran her regression on the data just as it is, with no adjustment for seasonality. She ends up with the following regression equation:

Ŷ= 0.518t +15.829

This model suggests an upward trend with each passing month but doesn’t fit the data quite as well as we would like: R2 is just 0.313 and the F-statistic is just 26.47.

Imagine now that Billie decides to use seasonal dummy variables. Since her data is monthly, Billie must use 11 dummy variables. Since December is her busiest month, Billie decides to make one dummy variable for each month from January to November. D1 is January; D2 is February; and so on until D11 , which is November. Hence, in January, D1 will be flagged as a 1 and D2 to D11 will be 0. In February, D2 will equal 1 while all the other dummies will be zero. And so forth. Note that all dummies will be zero in December.

Picture in your mind a table with 60 rows and 13 columns. Each row contains the monthly data from January 2005 to December 2009. The first column is the number of orders for the month; the second is the time period, t, which is 1 to 60. That is our independent variable from our original model. The next eleven columns are the dummy variables. Billie enters these into Excel and runs her regression. What does she get?

I’m going to show you the resulting equation in tabular form, as it would look far too complicated in standard form. Billie gets the following output:

 Parameter Coefficients t Stat Intercept 42.93 15.93 t 0.47 12.13 D1 (January) -32.38 -9.88 D2 (February) -10.66 -3.26 D3 (March) -27.73 -8.48 D4 (April) -24.21 -7.41 D5 (May) -36.88 -11.31 D6 (June) -36.35 -11.16 D7 (July) -40.23 -12.35 D8 (August) -26.10 -8.02 D9 (September) -28.58 -8.79 D10 (October) -38.25 -11.76 D11 (November) -7.73 -2.38

Billie gets a great model: notice that all the parameter estimates are significant, and they’re all negative, indicating December as the busiest month. Billie’s R2 has now shot up to 0.919, indicating an even better fit. And the F-statistic is up to 44.73, and it is more significant.

How does this compare to Billie’s model on her seasonally-adjusted data? Recall that when doing her regressions on seasonally adjusted data, Billie got the following results:

Ŷ = 0.47t +17.12

Her model had an R2 of 0.872, but her F-statistic was almost 395! So, even though Billie gained a few more points in R2 with the seasonal dummies, her F-statistic wasn’t quite as significant. However, Billie’s F-statistic using the dummy variables is still very strong, and I would argue more stable. Recall that the F-statistic is determined by dividing the mean squared error of the regression by the mean squared error of the residuals. The mean squared error of the regression is the sum of squares regression (RSS) divided by the number of independent variables in the model; the mean squared error of the residuals is the Sum of Squared Error (SSE) divided by the number of observations less the number of independent variables and less one more. To illustrate, here is a side by side comparison:

 Seasonally Adjusted Model Seasonal Dummy Model # Observations 60 60 SSR 3,982 14,179 # Independent Variables 1 12 Mean Square Error of Regression 3,982 1,182 SSE 585 1,241 Degrees of Freedom 58 47 Mean Squared Error of Residuals 10.08 26.41 F-Statistic 394.91 44.73

So, although the F-statistic is much lower for the seasonal dummy model, the mean square error of the regression is also much lower. As a result, the F-statistic is still quite significant, but much more stable than our one variable model built on the seasonally-adjusted data.

It is important to note that sometimes data sets do not lend themselves well to seasonal dummies, and that the manual adjustment process we worked through a few weeks ago may be a better approach.

Next Forecast Friday Topic: Slope Dummy Variables

The dummy variables we worked with last week and this week are intercept dummies. These dummy variables alter the Y-intercept of the regression equation. Sometimes, it is necessary to affect the slope of the equation. We will discuss how slope dummies are used in next week’s Forecast Friday post.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

### Data, Data Everywhere

September 29, 2010

Every time we use a cell phone, surf the Web, interact on Facebook, make a purchase, what have you, we create data that businesses, charities, and other organizations analyze to learn more about us.

While such a scenario sounds Orwellian, it is not necessarily terrible.  For example, if your local supermarket chain knows from your frequent shopper card that you buy Kashi Go-Lean cereal four or five packages at a time, you might appreciate them telling you when Kashi goes on sale.  It’s a win-win situation: you want to stock up on Kashi for the best price, and the store wants to bait you with the Kashi in the hopes you’ll buy more than the cereal.

But I digress.  The fact that we create data both seamlessly and almost instantaneously with every one of life’s transactions has greatly increased the demand for tools and specialized professionals to analyze that data and help companies turn it into actionable information.  In fact, IBM is banking its future growth on analytics, a market estimated to be worth \$100 billion, by its planned purchase of Netezza, which it announced last week.

Analytics is big business, and even if your job description doesn’t require you to analyze data, you should be aware of it.  Almost anything electronic can be tracked and/or monitored these days.  Anytime you get an email offer from an online retailer you’ve done business with, or direct mail from a charity or other retailer, you’ve been selected by analytical tools who are viewing your past purchasing and giving history.

If you run a business, you should be cognizant of all the data you accumulate and the ways in which you accumulate it.  What’s more, you should weigh the data you’re currently collecting against the decisions it helps you make, so that you can identify additional data you may need.  This can be a goldmine for you in helping you better understand your customers’ needs and wants, identify new trends and changing patterns, and develop new products in services in response to those changing needs and wants.

Data and the need to analyze it are here to stay.

### Data Mining Meets Online Dating

September 28, 2010

The September 27 issue of Fortune Magazine had two stories in it that pertain to data mining and predictive modeling.  One of them, eHarmony’s Algorithm of Love,  is an interesting account of how eHarmony is using predictive analytics tools to maximize the likelihood of a couple being a good match.  Since the article is brief, any commentary I might add – other than “it drives home the points I’ve been making,” will simply parrot the artice.  So I thought I’d let you click on the link and enjoy!

### Forecast Friday Topic: Dummy Variables

September 23, 2010

(Twenty-second in a series)

To date, all of the independent variables we have used in our regression equations have been quantitative; they could easily be counted. Sometimes, it is important to understand the relationship between a dependent variable and a categorical, or qualitative, variable. For example, an economist who is developing a model to predict salaries for persons in a particular profession might want to see if salaries are different if the employee is male vs. female; white vs. non-white; or college-degreed vs. non-degreed. Whereas quantitative independent variables can be continuous (any value from negative to positive infinity), qualitative variables like those the economist is considering have only two values; that is, they are discrete.

Qualitative Variables in Regression Analysis

Heather Hanley is a high school physics teacher who is interested in predicting student scores on the final exam. Heather has a strong hunch that the midterm score is a good predictor of performance on the final, but she is also concerned that female students are not performing as highly on the final as males. Heather wants to test this hypothesis to see if she needs to adjust her teaching style so that she could help the girls in her class prepare better for the final.

Since Heather teaches her class more or less the same way each year, she pulls the midterm and final scores for the thirty students in her class last year, and notes their gender. She has the following data:

 Student # Gender Midterm Final 1 Male 91 98 2 Male 51 59 3 Female 56 53 4 Male 79 84 5 Female 74 77 6 Female 91 90 7 Male 65 69 8 Male 88 97 9 Female 69 73 10 Female 84 84 11 Female 79 75 12 Female 53 59 13 Male 85 91 14 Female 97 97 15 Male 91 93 16 Female 81 84 17 Male 86 90 18 Male 84 89 19 Male 79 87 20 Male 70 77 21 Female 82 85 22 Female 82 86 23 Male 70 80 24 Male 62 80 25 Female 77 78 26 Female 79 85 27 Male 81 90 28 Female 85 88 29 Female 84 86 30 Male 91 91

“Male” and “Female” are not quantitative states of nature, so how does Heather create a variable that accounts for gender? She can create a dummy variable. Heather creates a variable called FEMALE. If the student is a girl, then FEMALE=1; otherwise, if the student is a boy, then FEMALE=0. So Heather’s new table looks like this:

 Student # Gender FEMALE Midterm Final 1 Male 0 91 98 2 Male 0 51 59 3 Female 1 56 53 4 Male 0 79 84 5 Female 1 74 77 6 Female 1 91 90 7 Male 0 65 69 8 Male 0 88 97 9 Female 1 69 73 10 Female 1 84 84 11 Female 1 79 75 12 Female 1 53 59 13 Male 0 85 91 14 Female 1 97 97 15 Male 0 91 93 16 Female 1 81 84 17 Male 0 86 90 18 Male 0 84 89 19 Male 0 79 87 20 Male 0 70 77 21 Female 1 82 85 22 Female 1 82 86 23 Male 0 70 80 24 Male 0 62 80 25 Female 1 77 78 26 Female 1 79 85 27 Male 0 81 90 28 Female 1 85 88 29 Female 1 84 86 30 Male 0 91 91

Now, with FEMALE as a quantitative representation of gender, Heather can run her regression just as easily, using FEMALE and MIDTERM as her independent variables and FINAL as her dependent variable.

Regression Results

Heather runs her regression and obtains the following results:

Heather’s R2=0.913, which means that her model fits the data quite well; her t-values for MIDTERM and FEMALE are 16.38 and (4.02), respectively, both very significant; and her F-statistic is a strong 142.23, suggesting a valid model.

The regression results confirm Heather’s concern: female students tend to underperform male students on the physics final by an average of 5 points. This does not mean that female students underperform male students on the final because they are female; remember, regression analysis cannot prove causality. As indicated earlier, Heather feels her style for prepping students for the final might favor boys over girls (unintentionally, of course)!

In addition, each one-point increase in the midterm score is associated with an average increase of 0.89 points on the final.

Interpretation of the Dummy Variable

Essentially, the model tells us that if two of Heather’s student’s this year, say Joe (Male) and Shari (Female) each score a 60 on the midterm, Joe’s expected score on the final will be 68.93 (0.89*60 – 5.00*0 +15.53), while Shari’s will be 63.93 (0.89*60 – 5.00*1 + 15.53).

Notice that although we are using multiple regression, we only have a single slope: 0.89. The coefficient for FEMALE has no effect if the student is male; yet the coefficient for MIDTERM applies to all students, male or female. Hence, this dummy variable affects the intercept of the regression equation.

You can visually depict this as follows:

Essentially, the dummy variable captures a structural shift (like a difference in gender) that might shift the intercept of the Final exam score model lower. Since the value of FEMALE is zero if the student is a male, then the intercept is 15.53; but if the student is female, the value of FEMALE is 1; multiply that by its coefficient, and you get -5.00. When applied towards the intercept, it’s as if female students’ intercept is 10.53.

Multiple Dummy Variables

Could Heather have instead created a variable MALE, and gave it a value of 1 if male and 0 if female? Most definitely. In that case, the parameter estimate would still have a value of 5.00, but would have a positive sign.

What if Heather had has sophomores, juniors, and seniors in her class? Can she do a dummy variable for those three outcomes? She sure can. However, she would need to create two dummy variables. Perhaps one for SENIOR and one for JUNIOR. Or she could do SOPHOMORE and JUNIOR. It doesn’t matter. In the former, a Senior would be flagged as 1 for SENIOR and a 0 for JUNIOR. A junior would be flagged as 0 for SENIOR and 1 for JUNIOR. A sophomore would be flagged as 0 for both.

Why can’t Heather still keep just one dummy variable and label sophomores 0, juniors 1 and seniors 2? Or why couldn’t she label FEMALE as 1 for female and 2 for male? Because both of these would cause serious statistical problems. Doing it these ways would imply that these qualitative, discrete variables are indeed quantitative, and imply an ordering that is somehow linearly related to the dependent variable, which is incorrect.

The number of dummy variables you use is always 1 less than the number of possible outcomes you’re trying to classify. So, if you want to compare five categories, you would need four dummy variables.

Why One Less Dummy Variable Than Number of Outcomes?

Why don’t you use a dummy variable for all categories? Very simple. Let’s go back to our example of FEMALE. If we also create a dummy variable MALE, it is going to cause a problem. All the time FEMALE=1, MALE=0; all the time FEMALE=0, MALE=1. Notice the problem: perfect multicollinearity!

Can Dummy Variables be Overdone?

Absolutely. Remember, each dummy variable you add reduces your degrees of freedom. On a small data set, several dummy variables could lead to insignificant t-values. Also, if you have two dummy variables for three outcomes, there’s always the possibility that one or both may not produce t-values that are significant. At which point, you may need to collapse the categories into fewer classifications. So dummy variables should be used sparingly, and for qualitative variables with as few possible classifications as possible.

Next Forecast Friday Topic:
Testing For Seasonal Effects with Dummy Variables

In this example, dummy variables helped Heather Hanley determine that the gender of her students is correlated to final performance on the final. Yet dummy variables can be used for much more, like testing for seasonal effects. Next week, we will discuss how to test for seasonality using dummy variables.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

### Insight Central will Resume Thursday 09/23

September 21, 2010

I’m traveling on business this week and will not be posting on Insight Central these next couple of days.  Insight Central will resume this Thursday, September 23 with our Forecast Friday  post.

I’m sorry for the inconvenience.

Alex