(Twentythird in a series)
Last week, I introduced you to the use of dummy variables as a means of incorporating qualitative information into a regression model. Dummy variables can also be used to account for seasonality. A couple of weeks ago, we discussed adjusting your data for seasonality before constructing your model. As you saw, that could be pretty time consuming. One faster approach would be to take the raw time series data and add a dummy variable for each season of the year less one. So, if you’re working with quarterly data, you would want to use three dummy variables; if you have monthly variables, you want to add in 11 dummy variables.
For example, the fourth quarter of the year is often the busiest for most retailers. If a retail chain didn’t seasonally adjust its data, it might choose to create three dummy variables: D_{1}, D_{2}, and D_{3.} The first quarter of the year would be D_{1}; the second quarter, D_{2} ; and the third quarter, D_{3}. As we discussed last week, we always want to have one fewer dummy variable than we do outcomes. In our example, if we know the fourth quarter is the busiest quarter, then we would expect our three dummy variables to be significant and negative.
Revisiting Billie Burton
A couple of weeks ago, while discussing how to decompose a time series, I used the example of Billie Burton, a businesswoman who makes gift baskets. Billie had been trying to forecast orders for planning and budgeting purposes. She had five years of monthly order data:
Month 
TOTAL GIFT BASKET ORDERS 

2005 
2006 
2007 
2008 
2009 

January 
15 
18 
22 
26 
31 
February 
30 
36 
43 
52 
62 
March 
25 
18 
22 
43 
32 
April 
15 
30 
36 
27 
52 
May 
13 
16 
19 
23 
28 
June 
14 
17 
20 
24 
29 
July 
12 
14 
17 
20 
24 
August 
22 
26 
31 
37 
44 
September 
20 
24 
29 
35 
42 
October 
14 
17 
20 
24 
29 
November 
35 
42 
50 
60 
72 
December 
40 
48 
58 
70 
84 
You recall the painstaking effort we went through to adjust Billie’s orders for seasonality. Is there a simpler way? Yes. We can use dummy variables. Let’s first assume Billie ran her regression on the data just as it is, with no adjustment for seasonality. She ends up with the following regression equation:
Ŷ= 0.518t +15.829
This model suggests an upward trend with each passing month but doesn’t fit the data quite as well as we would like: R^{2} is just 0.313 and the Fstatistic is just 26.47.
Imagine now that Billie decides to use seasonal dummy variables. Since her data is monthly, Billie must use 11 dummy variables. Since December is her busiest month, Billie decides to make one dummy variable for each month from January to November. D_{1} is January; D_{2} is February; and so on until D_{11} , which is November. Hence, in January, D_{1} will be flagged as a 1 and D_{2} to D_{11 }will be 0. In February, D_{2} will equal 1 while all the other dummies will be zero. And so forth. Note that all dummies will be zero in December.
Picture in your mind a table with 60 rows and 13 columns. Each row contains the monthly data from January 2005 to December 2009. The first column is the number of orders for the month; the second is the time period, t, which is 1 to 60. That is our independent variable from our original model. The next eleven columns are the dummy variables. Billie enters these into Excel and runs her regression. What does she get?
I’m going to show you the resulting equation in tabular form, as it would look far too complicated in standard form. Billie gets the following output:
Parameter 
Coefficients 
t Stat 
Intercept 
42.93 
15.93 
t 
0.47 
12.13 
D_{1 }(January) 
32.38 
9.88 
D_{2} (February) 
10.66 
3.26 
D_{3} (March) 
27.73 
8.48 
D_{4} (April) 
24.21 
7.41 
D_{5} (May) 
36.88 
11.31 
D_{6} (June) 
36.35 
11.16 
D_{7} (July) 
40.23 
12.35 
D_{8} (August) 
26.10 
8.02 
D_{9} (September) 
28.58 
8.79 
D_{10} (October) 
38.25 
11.76 
D_{11} (November) 
7.73 
2.38 
Billie gets a great model: notice that all the parameter estimates are significant, and they’re all negative, indicating December as the busiest month. Billie’s R^{2} has now shot up to 0.919, indicating an even better fit. And the Fstatistic is up to 44.73, and it is more significant.
How does this compare to Billie’s model on her seasonallyadjusted data? Recall that when doing her regressions on seasonally adjusted data, Billie got the following results:
Ŷ = 0.47t +17.12
Her model had an R^{2} of 0.872, but her Fstatistic was almost 395! So, even though Billie gained a few more points in R^{2 }with the seasonal dummies, her Fstatistic wasn’t quite as significant. However, Billie’s Fstatistic using the dummy variables is still very strong, and I would argue more stable. Recall that the Fstatistic is determined by dividing the mean squared error of the regression by the mean squared error of the residuals. The mean squared error of the regression is the sum of squares regression (RSS) divided by the number of independent variables in the model; the mean squared error of the residuals is the Sum of Squared Error (SSE) divided by the number of observations less the number of independent variables and less one more. To illustrate, here is a side by side comparison:
Seasonally Adjusted Model  Seasonal Dummy Model  
# Observations 
60 
60 
SSR 
3,982 
14,179 
# Independent Variables 
1 
12 
Mean Square Error of Regression 
3,982 
1,182 
SSE 
585 
1,241 
Degrees of Freedom 
58 
47 
Mean Squared Error of Residuals 
10.08 
26.41 
FStatistic 
394.91 
44.73 
So, although the Fstatistic is much lower for the seasonal dummy model, the mean square error of the regression is also much lower. As a result, the Fstatistic is still quite significant, but much more stable than our one variable model built on the seasonallyadjusted data.
It is important to note that sometimes data sets do not lend themselves well to seasonal dummies, and that the manual adjustment process we worked through a few weeks ago may be a better approach.
Next Forecast Friday Topic: Slope Dummy Variables
The dummy variables we worked with last week and this week are intercept dummies. These dummy variables alter the Yintercept of the regression equation. Sometimes, it is necessary to affect the slope of the equation. We will discuss how slope dummies are used in next week’s Forecast Friday post.
*************************
If you Like Our Posts, Then “Like” Us on Facebook and Twitter!
Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Likeing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.
Tags: Analysights, degrees of freedom, dummy variable, f statistic, Forecast Friday, Forecasting, forecasts, intercept dummies, regression analysis, seasonal dummy variable, sum of squares, tvalues
January 13, 2011 at 12:03 am 
[…] the time series (removing the trend, seasonal, cyclical, and irregular components), adding seasonal dummy variables into the model, and lagging the dependent variable. The ACF is another way of detecting […]