Posts Tagged ‘sum of squares’

Forecast Friday Topic: Seasonal Dummy Variables

September 30, 2010

(Twenty-third in a series)

Last week, I introduced you to the use of dummy variables as a means of incorporating qualitative information into a regression model. Dummy variables can also be used to account for seasonality. A couple of weeks ago, we discussed adjusting your data for seasonality before constructing your model. As you saw, that could be pretty time consuming. One faster approach would be to take the raw time series data and add a dummy variable for each season of the year less one. So, if you’re working with quarterly data, you would want to use three dummy variables; if you have monthly variables, you want to add in 11 dummy variables.

For example, the fourth quarter of the year is often the busiest for most retailers. If a retail chain didn’t seasonally adjust its data, it might choose to create three dummy variables: D1, D2, and D3. The first quarter of the year would be D1; the second quarter, D2 ; and the third quarter, D3. As we discussed last week, we always want to have one fewer dummy variable than we do outcomes. In our example, if we know the fourth quarter is the busiest quarter, then we would expect our three dummy variables to be significant and negative.

Revisiting Billie Burton

A couple of weeks ago, while discussing how to decompose a time series, I used the example of Billie Burton, a businesswoman who makes gift baskets. Billie had been trying to forecast orders for planning and budgeting purposes. She had five years of monthly order data:

Month

TOTAL GIFT BASKET ORDERS

2005

2006

2007

2008

2009

January

15

18

22

26

31

February

30

36

43

52

62

March

25

18

22

43

32

April

15

30

36

27

52

May

13

16

19

23

28

June

14

17

20

24

29

July

12

14

17

20

24

August

22

26

31

37

44

September

20

24

29

35

42

October

14

17

20

24

29

November

35

42

50

60

72

December

40

48

58

70

84

 

You recall the painstaking effort we went through to adjust Billie’s orders for seasonality. Is there a simpler way? Yes. We can use dummy variables. Let’s first assume Billie ran her regression on the data just as it is, with no adjustment for seasonality. She ends up with the following regression equation:

Ŷ= 0.518t +15.829

This model suggests an upward trend with each passing month but doesn’t fit the data quite as well as we would like: R2 is just 0.313 and the F-statistic is just 26.47.

Imagine now that Billie decides to use seasonal dummy variables. Since her data is monthly, Billie must use 11 dummy variables. Since December is her busiest month, Billie decides to make one dummy variable for each month from January to November. D1 is January; D2 is February; and so on until D11 , which is November. Hence, in January, D1 will be flagged as a 1 and D2 to D11 will be 0. In February, D2 will equal 1 while all the other dummies will be zero. And so forth. Note that all dummies will be zero in December.

Picture in your mind a table with 60 rows and 13 columns. Each row contains the monthly data from January 2005 to December 2009. The first column is the number of orders for the month; the second is the time period, t, which is 1 to 60. That is our independent variable from our original model. The next eleven columns are the dummy variables. Billie enters these into Excel and runs her regression. What does she get?

I’m going to show you the resulting equation in tabular form, as it would look far too complicated in standard form. Billie gets the following output:

Parameter

Coefficients

t Stat

Intercept

42.93

15.93

t

0.47

12.13

D1 (January)

-32.38

-9.88

D2 (February)

-10.66

-3.26

D3 (March)

-27.73

-8.48

D4 (April)

-24.21

-7.41

D5 (May)

-36.88

-11.31

D6 (June)

-36.35

-11.16

D7 (July)

-40.23

-12.35

D8 (August)

-26.10

-8.02

D9 (September)

-28.58

-8.79

D10 (October)

-38.25

-11.76

D11 (November)

-7.73

-2.38

 

Billie gets a great model: notice that all the parameter estimates are significant, and they’re all negative, indicating December as the busiest month. Billie’s R2 has now shot up to 0.919, indicating an even better fit. And the F-statistic is up to 44.73, and it is more significant.

How does this compare to Billie’s model on her seasonally-adjusted data? Recall that when doing her regressions on seasonally adjusted data, Billie got the following results:

Ŷ = 0.47t +17.12

Her model had an R2 of 0.872, but her F-statistic was almost 395! So, even though Billie gained a few more points in R2 with the seasonal dummies, her F-statistic wasn’t quite as significant. However, Billie’s F-statistic using the dummy variables is still very strong, and I would argue more stable. Recall that the F-statistic is determined by dividing the mean squared error of the regression by the mean squared error of the residuals. The mean squared error of the regression is the sum of squares regression (RSS) divided by the number of independent variables in the model; the mean squared error of the residuals is the Sum of Squared Error (SSE) divided by the number of observations less the number of independent variables and less one more. To illustrate, here is a side by side comparison:

   Seasonally Adjusted Model Seasonal Dummy Model
# Observations

60

60

SSR

3,982

14,179

# Independent Variables

1

12

Mean Square Error of Regression

3,982

1,182

SSE

585

1,241

Degrees of Freedom

58

47

Mean Squared Error of Residuals

10.08

26.41

F-Statistic

394.91

44.73

So, although the F-statistic is much lower for the seasonal dummy model, the mean square error of the regression is also much lower. As a result, the F-statistic is still quite significant, but much more stable than our one variable model built on the seasonally-adjusted data.

It is important to note that sometimes data sets do not lend themselves well to seasonal dummies, and that the manual adjustment process we worked through a few weeks ago may be a better approach.

Next Forecast Friday Topic: Slope Dummy Variables

The dummy variables we worked with last week and this week are intercept dummies. These dummy variables alter the Y-intercept of the regression equation. Sometimes, it is necessary to affect the slope of the equation. We will discuss how slope dummies are used in next week’s Forecast Friday post.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Advertisements

Forecast Friday Topic: Simple Regression Analysis

May 27, 2010

(Sixth in a series)

Today, we begin our discussion of regression analysis as a time series forecasting tool. This discussion will take the next few weeks, as there is much behind it. As always, I will make sure everything is simplified and easy for you to digest. Regression is a powerful tool that can be very helpful for mid- and long-range forecasting. Quite often, the business decisions we make require us to consider relationships between two or more variables. Rarely can we make changes to our promotion, pricing, and/or product development strategies without them having an impact of some kind on our sales. Just how big an impact would that be? How do we measure the relationship between two or more variables? And does a real relationship even exist between those variables? Regression analysis helps us find out.

One thing I must point out: Remember the “deviations” we discussed in the posts on moving average and exponential smoothing techniques: The difference between the forecasted and actual values for each observation, of which we took the absolute value? Good. In regression analysis, we refer to the deviations as the “error terms” or “residuals.” In regression analysis, the residuals – which we will square, rather than take the absolute value – become very important in gauging the regression model’s accuracy, validity, efficiency, and “goodness of fit.”

Simple Linear Regression Analysis

Sue Stone, owner of Stone & Associates, looked at her CPA practice’s monthly receipts from January to December 2009. The sales were as follows:

Month 

Sales 

January 

$10,000 

February 

$11,000 

March 

$10,500 

April 

$11,500 

May 

$12,500 

June 

$12,000 

July 

$14,000 

August 

$13,000 

September 

$13,500 

October 

$15,000 

November

$14,500 

December 

$15,500 

Sue is trying to predict what sales will be for each month in the first quarter of 2010, but is unsure of how to go about it. Moving average and exponential smoothing techniques rarely go more than one period ahead. So, what is Sue to do?

When we are presented with a set of numbers, one of the ways we try to make sense of it is by taking its average. Perhaps Sue can average all 12 months’ sales – $12,750 – and use that her forecast for each of next three months. But how accurately would that measure each month of 2009? How spread out are each month’s sales from the average? Sue subtracts the average from each month’s sales and examines the difference:

Month 

Sales 

Sales Less Average Sales 
January 

$10,000 

-$2,750 

February

$11,000 

-$1,750 

March 

$10,500 

-$2,250 

April 

$11,500 

-$1,250 

May 

$12,500 

-$250 

June 

$12,000 

-$750 

July 

$14,000 

$1,250 

August 

$13,000 

$250 

September 

$13,500 

$750 

October 

$15,000 

$2,250 

November 

$14,500 

$1,750 

December 

$15,500 

$2,750 

 

Sue notices that the error between actual and average is quite high in both the first four months of 2009 and in the last three months of 2009. She wants to understand the overall error in using the average as a forecast of sales. However, when she sums up all the errors from month to month, Sue finds they sum to zero. That tells her nothing. So she squares each month’s error value and sums them:

Month 

Sales 

Error 

Error Squared 

January 

$10,000 

-$2,750 

$7,562,500 

February 

$11,000 

-$1,750 

$3,062,500 

March 

$10,500

-$2,250 

$5,062,500 

April 

$11,500 

-$1,250 

$1,562,500 

May 

$12,500 

-$250 

$62,500 

June 

$12,000 

-$750 

$562,500 

July 

$14,000 

$1,250 

$1,562,500 

August 

$13,000 

$250 

$62,500 

September 

$13,500 

$750 

$562,500 

October 

$15,000 

$2,250 

$5,062,500 

November 

$14,500

$1,750 

$3,062,500 

December 

$15,500 

$2,750 

$7,562,500 

   

Total Error: 

$35,750,000 

    

In totaling these squared errors, Sue derives the total sum of squares, or TSS error: 35,750,000. Is there any way she can improve upon that? Sue thinks for a while. She doesn’t know too much more about her 2009 sales except for the month in which they were generated. She plots the sales on a chart:

Sue notices that sales by month appear to be on an upward trend. Sue thinks for a moment. “All I know is the sales and the month,” she says to herself, “How can I develop a model to forecast accurately?” Sue reads about a statistical procedure called regression analysis and, seeing that each month’s sales is in sequential order, she wonders whether the mere passage of time simply causes sales to go higher. Sue numbers each month, with January assigned a 1 and December, a 12.

She also realizes that she is trying to predict sales with each passing month. Hence, she hypothesizes that the change in sales depends on the change in the month. Hence, sales is Sue’s dependent variable. Because the month number is used to estimate change in sales, it is her independent variable. In regression analysis, the relationship between an independent and a dependent value is expressed:

Y = α + βX + ε

    Where: Y is the value of the dependent variable

    X is the value of the independent variable

    α is a population parameter, called the intercept, which would be the value of Y when X=0

    β is also a population parameter – the slope of the regression line – representing the change in Y associated with each one-unit change in X.

    ε is the error term.

Sue further reads that the goal of regression analysis is to minimize the error sum of squares, which is why it is referred to as ordinary least squares (OLS) regression. She also notices that she is building her regression on a sample, so there is a sample regression equation used to estimate what the true regression is for the population:

Essentially, the equation is the same as the one above, however the terms indicate the sample. The Y-term (called “Y hat”) is the sample forecasted value of the dependent variable (sales) at period i; a is the sample estimate of α; b is the sample estimate of β; Xi is the value of the independent variable at period i; and ei is the error, or difference between Y hat (the forecasted value) and actual Y for period i. Sue needs to find the values for a and b – the estimates of the population parameters – that minimize the error sum of squares.

Sue reads that the equations for estimating a and b are derived from calculus, but expressed algebraically as:

Sue learns that the X and Y terms with lines above them, known as “X bar” and “Y bar,” respectively are the averages of all the X and Y values, respectively. She also reads that the Σ notation – the Greek letter sigma – represents a sum. Hence, Sue realizes a few things:

  1. She must estimate b before she can estimate a;
  2. To estimate b,she must take care of the numerator:
    1. first subtract each observation’s month number from the average month’s number (X minus X-bar),
    2. subtract each observation’s sales from the average sales (Y minus Y-bar),
    3. multiply those two together, and
    4. Add up (2c) for all observations.
  3. To get the denominator for calculating b, she must:
    1. Again subtract X-bar from X, but then square the difference, for each observation.
    2. Sum them up
  4. Calculating b is easy: She needs only to divide the result from (2) by the result from (3).
  5. Calculating a is also easy: She multiplies her b value by the average month (X-bar), and subtracts it from average sales (Y-bar).

Sue now goes to work to compute her regression equation. She goes into Excel and enters her monthly sales data in a table, and computes the averages for sales and month number:

 

Month (X) 

Sales (Y) 

 

1 

$10,000 

 

2 

$11,000 

 

3 

$10,500 

 

4 

$11,500 

 

5 

$12,500 

 

6 

$12,000 

 

7 

$14,000 

 

8 

$13,000 

 

9 

$13,500 

 

10 

$15,000 

 

11 

$14,500 

 

12 

$15,500 

Average 

6.5 

$12,750 

 

Sue goes ahead and subtracts the X and Y values from their respective averages, and computes the components she needs (the “Product” is the result of multiplying the values in the first two columns together):

X minus X-bar 

Y minus Y-bar 

Product 

(X minus X-bar) Squared 

-5.5 

-$2,750 

$15,125 

30.25 

-4.5 

-$1,750 

$7,875 

20.25 

-3.5 

-$2,250 

$7,875 

12.25 

-2.5 

-$1,250 

$3,125 

6.25 

-1.5 

-$250 

$375 

2.25 

-0.5 

-$750 

$375 

0.25 

0.5 

$1,250 

$625 

0.25 

1.5 

$250 

$375 

2.25 

2.5 

$750 

$1,875 

6.25 

3.5 

$2,250 

$7,875 

12.25 

4.5 

$1,750 

$7,875

20.25 

5.5 

$2,750 

$15,125 

30.25 

Total 

$68,500 

143 

 

Sue computes b:

b = $68,500/143

= $479.02

Now that Sue knows b, she calculates a:

a = $12,750 – $479.02(6.5)

= $12,750 – $3,113.64

= $9,636.36

Hence, assuming errors are zero, Sue’s least-squares regression equation is:

Y(hat) =$9,636.36 + $479.02X

Or, in business terminology:

Forecasted Sales = $9,636.36 + $479.02 * Month number.

This means that each passing month is associated with an average increase in sales of $479.02 for Sue’s CPA firm. How accurately does this regression model predict sales? Sue estimates the error by plugging each month’s number into the equation and then comparing her forecast for that month with the actual sales:

Month (X) 

Sales (Y) 

Forecasted Sales 

Error 

1 

$10,000 

$10,115.38

-$115.38 

2 

$11,000 

$10,594.41 

$405.59 

3 

$10,500 

$11,073.43 

-$573.43 

4 

$11,500 

$11,552.45 

-$52.45 

5 

$12,500 

$12,031.47 

$468.53 

6 

$12,000 

$12,510.49 

-$510.49 

7 

$14,000 

$12,989.51 

$1,010.49 

8 

$13,000 

$13,468.53 

-$468.53 

9 

$13,500 

$13,947.55 

-$447.55

10 

$15,000 

$14,426.57 

$573.43 

11 

$14,500 

$14,905.59 

-$405.59 

12 

$15,500 

$15,384.62 

$115.38 

 

Sue’s actual and forecasted sales appear to be pretty close, except for her July estimate, which is off by a little over $1,000. But does her model predict better than if she simply used average sales as her forecast for each month? To do that, she must compute the error sum of squares, ESS, error. Sue must square the error terms for each observation and sum them up to obtain ESS:

ESS = Σe2

Error 

Squared Error 

-$115.38 

$13,313.61 

$405.59 

$164,506.82 

-$573.43 

$328,818.04 

-$52.45 

$2,750.75 

$468.53 

$219,521.74 

-$510.49 

$260,599.54 

$1,010.49 

$1,021,089.05 

-$468.53 

$219,521.74 

-$447.55 

$200,303.19 

$573.43 

$328,818.04 

-$405.59 

$164,506.82 

$115.38 

$13,313.61 

ESS=

$2,937,062.94 

 

Notice Sue’s error sum of squares. This is the error, or unexplained, sum of squared deviations between the forecasted and actual sales. The difference between the total sum of squares (TSS) and the Error Sum of Squares (ESS) is the regression sum of squares, RSS, and that is the sum of squared deviations that are explained by the regression. RSS is also calculated as each forecasted value of sales less the average of sales:

Forecasted Sales 

Average Sales

Regression Error 

Reg. Error Squared 

$10,115.38 

$12,750 

-$2,634.62 

$6,941,198.22 

$10,594.41 

$12,750 

-$2,155.59 

$4,646,587.24 

$11,073.43 

$12,750 

-$1,676.57 

$2,810,898.45 

$11,552.45 

$12,750 

-$1,197.55 

$1,434,131.86 

$12,031.47 

$12,750 

-$718.53

$516,287.47 

$12,510.49 

$12,750 

-$239.51 

$57,365.27 

$12,989.51 

$12,750 

$239.51 

$57,365.27 

$13,468.53 

$12,750 

$718.53 

$516,287.47 

$13,947.55 

$12,750 

$1,197.55 

$1,434,131.86 

$14,426.57 

$12,750 

$1,676.57 

$2,810,898.45 

$14,905.59 

$12,750 

$2,155.59 

$4,646,587.24

$15,384.62 

$12,750 

$2,634.62 

$6,941,198.22 

   

RSS= 

$32,812,937.06 

 

Sue immediately adds the RSS and the ESS and sees they match the TSS: $35,750,000. She also knows that nearly 33 million of that TSS is explained by her regression model, so she divides her RSS by the TSS:

32,812,937.06 / 35,750,000

=.917 or 91.7%

This quotient, known as the coefficient of determination, and denoted as R2, tells Sue that each passing month explains 91.7% of the change in monthly sales that she experiences. What R2 means is that Sue improved her forecast accuracy by 91.7% by using this simple model instead of the simple average. As you will find out in subsequent blog posts, maximizing R2 isn’t the “be all and end all”. In fact, there is still much to do with this model, which will be discussed in next week’s Forecast Friday post. But for now, Sue’s model seems to have reduced a great deal of error.

It is important to note that while each month does seem to be related to sales, the passing months do not cause the increase in sales. Correlation does not mean causation. There could be something behind the scenes (e.g., Sue’s advertising, or the types of projects she works on, etc.) that is driving the upward trend in her sales.

Using the Regression Equation to Forecast Sales

Now Sue can use the same model to forecast sales for January 2010 and February 2010, etc. She has her equation, so since January 2010 is period 13, she plugs in 13 for X, and gets a forecast of $15,863.64; for February (period 14), she gets $16,342.66.

Recap and Plan for Next Week

You have now learned the basics of simple regression analysis. You have learned how to estimate the parameters for the regression equation, how to measure the improvement in accuracy from the regression model, and how to generate forecasts. Next week, we will be checking the validity of Sue’s equation, and discussing the important assumptions underlying regression analysis. Until then, you have a basic overview of what regression analysis is.