## Posts Tagged ‘ordinary least squares’

### Forecast Friday Topic: Detecting Heteroscedasticity – Analytical Approaches

August 19, 2010

(Eighteenth in a series)

Last week, we discussed the violation of the homoscedasticity assumption of regression analysis: the assumption that the error terms have a constant variance. When the error terms do not exhibit a constant variance, they are said to be heteroscedastic. A model that exhibits heteroscedasticity produces parameter estimates that are not biased, but rather inefficient. Heteroscedasticity most often appears in cross-sectional data and is frequently caused by a wide range of possible values for one or more independent variables.

Last week, we showed you how to detect heteroscedasticity by visually inspecting the plot of the error terms against the independent variable. Today, we are going to discuss three simple, but very powerful, analytical approaches to detecting heteroscedasticity: the Goldfeld-Quandt test, the Breusch-Pagan test, and the Park test. These approaches are quite simple, but can be a bid tedious to employ.

Reviewing Our Model

Recall our model from last week. We were trying to determine the relationship between a census tract’s median family income (INCOME) and the ratio of the number of families who own their homes to the number of families who rent (OWNRATIO). Our hypothesis was that census tracts with higher median family incomes had a higher proportion of families who owned their homes. I snatched an example from my college econometrics textbook, which pulled INCOME and OWNRATIOs from 59 census tracts in Pierce County, Washington, which were compiled during the 1980 Census. We had the following data:

 Housing Data Tract Income Ownratio 601 \$24,909 7.220 602 \$11,875 1.094 603 \$19,308 3.587 604 \$20,375 5.279 605 \$20,132 3.508 606 \$15,351 0.789 607 \$14,821 1.837 608 \$18,816 5.150 609 \$19,179 2.201 609 \$21,434 1.932 610 \$15,075 0.919 611 \$15,634 1.898 612 \$12,307 1.584 613 \$10,063 0.901 614 \$5,090 0.128 615 \$8,110 0.059 616 \$4,399 0.022 616 \$5,411 0.172 617 \$9,541 0.916 618 \$13,095 1.265 619 \$11,638 1.019 620 \$12,711 1.698 621 \$12,839 2.188 623 \$15,202 2.850 624 \$15,932 3.049 625 \$14,178 2.307 626 \$12,244 0.873 627 \$10,391 0.410 628 \$13,934 1.151 629 \$14,201 1.274 630 \$15,784 1.751 631 \$18,917 5.074 632 \$17,431 4.272 633 \$17,044 3.868 634 \$14,870 2.009 635 \$19,384 2.256 701 \$18,250 2.471 705 \$14,212 3.019 706 \$15,817 2.154 710 \$21,911 5.190 711 \$19,282 4.579 712 \$21,795 3.717 713 \$22,904 3.720 713 \$22,507 6.127 714 \$19,592 4.468 714 \$16,900 2.110 718 \$12,818 0.782 718 \$9,849 0.259 719 \$16,931 1.233 719 \$23,545 3.288 720 \$9,198 0.235 721 \$22,190 1.406 721 \$19,646 2.206 724 \$24,750 5.650 726 \$18,140 5.078 728 \$21,250 1.433 731 \$22,231 7.452 731 \$19,788 5.738 735 \$13,269 1.364

Data taken from U.S. Bureau of Census 1980 Pierce County, WA; Reprinted in Brown, W.S., Introducing Econometrics, St. Paul (1991): 198-200.

And we got the following regression equation:

Ŷ= 0.000297*Income – 2.221

With an R2=0.597, an F-ratio of 84.31, the t-ratios for INCOME (9.182) and the intercept (-4.094) both solidly significant, and the positive sign on the parameter estimate for INCOME, our model appeared to do very well. However, visual inspection of the regression residuals suggested the presence of heteroscedasticity. Unfortunately, visual inspection can only suggest; we need more objective ways of determining the presence of heteroscedasticity. Hence our three tests below.

The Goldfeld-Quandt Test

The Goldfeld-Quandt test is a computationally simple, and perhaps the most commonly used, method for detecting heteroscedasticity. Since a model with heteroscedastic error terms does not have a constant variance, the Goldfeld-Quandt test postulates that the variances associated with high values of the independent variable, X, are statistically significant from those associated with low values. Essentially, you would run separate regression analyses for the low values of X and the high values, and then compare their F-ratios.

The Goldfeld-Quandt test has four steps:

Step #1: Sort the data

Take the independent variable you suspect to be the source of the heteroscedasticity and sort your data set by the X value in low-to-high order:

 Housing Data Tract Income Ownratio 616 \$4,399 0.022 614 \$5,090 0.128 616 \$5,411 0.172 615 \$8,110 0.059 720 \$9,198 0.235 617 \$9,541 0.916 718 \$9,849 0.259 613 \$10,063 0.901 627 \$10,391 0.410 619 \$11,638 1.019 602 \$11,875 1.094 626 \$12,244 0.873 612 \$12,307 1.584 620 \$12,711 1.698 718 \$12,818 0.782 621 \$12,839 2.188 618 \$13,095 1.265 735 \$13,269 1.364 628 \$13,934 1.151 625 \$14,178 2.307 629 \$14,201 1.274 705 \$14,212 3.019 607 \$14,821 1.837 634 \$14,870 2.009 610 \$15,075 0.919 623 \$15,202 2.850 606 \$15,351 0.789 611 \$15,634 1.898 630 \$15,784 1.751 706 \$15,817 2.154 624 \$15,932 3.049 714 \$16,900 2.110 719 \$16,931 1.233 633 \$17,044 3.868 632 \$17,431 4.272 726 \$18,140 5.078 701 \$18,250 2.471 608 \$18,816 5.150 631 \$18,917 5.074 609 \$19,179 2.201 711 \$19,282 4.579 603 \$19,308 3.587 635 \$19,384 2.256 714 \$19,592 4.468 721 \$19,646 2.206 731 \$19,788 5.738 605 \$20,132 3.508 604 \$20,375 5.279 728 \$21,250 1.433 609 \$21,434 1.932 712 \$21,795 3.717 710 \$21,911 5.190 721 \$22,190 1.406 731 \$22,231 7.452 713 \$22,507 6.127 713 \$22,904 3.720 719 \$23,545 3.288 724 \$24,750 5.650 601 \$24,909 7.220

Step #2: Omit the middle observations

Next, take out the observations in the middle. This usually amounts between one-fifth to one-third of your observations. There’s no hard and fast rule about how many variables to omit, and if your data set is small, you may not be able to omit any. In our example, we can omit 13 observations (highlighted in orange):

 Housing Data Tract Income Ownratio 616 \$4,399 0.022 614 \$5,090 0.128 616 \$5,411 0.172 615 \$8,110 0.059 720 \$9,198 0.235 617 \$9,541 0.916 718 \$9,849 0.259 613 \$10,063 0.901 627 \$10,391 0.410 619 \$11,638 1.019 602 \$11,875 1.094 626 \$12,244 0.873 612 \$12,307 1.584 620 \$12,711 1.698 718 \$12,818 0.782 621 \$12,839 2.188 618 \$13,095 1.265 735 \$13,269 1.364 628 \$13,934 1.151 625 \$14,178 2.307 629 \$14,201 1.274 705 \$14,212 3.019 607 \$14,821 1.837 634 \$14,870 2.009 610 \$15,075 0.919 623 \$15,202 2.850 606 \$15,351 0.789 611 \$15,634 1.898 630 \$15,784 1.751 706 \$15,817 2.154 624 \$15,932 3.049 714 \$16,900 2.110 719 \$16,931 1.233 633 \$17,044 3.868 632 \$17,431 4.272 726 \$18,140 5.078 Tract Income Ownratio 701 \$18,250 2.471 608 \$18,816 5.150 631 \$18,917 5.074 609 \$19,179 2.201 711 \$19,282 4.579 603 \$19,308 3.587 635 \$19,384 2.256 714 \$19,592 4.468 721 \$19,646 2.206 731 \$19,788 5.738 605 \$20,132 3.508 604 \$20,375 5.279 728 \$21,250 1.433 609 \$21,434 1.932 712 \$21,795 3.717 710 \$21,911 5.190 721 \$22,190 1.406 731 \$22,231 7.452 713 \$22,507 6.127 713 \$22,904 3.720 719 \$23,545 3.288 724 \$24,750 5.650 601 \$24,909 7.220

Step #3: Run two separate regressions, one for the low values, one for the high

We ran separate regressions for the 23 observations with the lowest values for INCOME and the 23 observations with the highest values. In these regressions, we weren’t concerned with whether the t-ratios of the parameter estimates were significant. Rather, we wanted to look at their Error Sum of Squares (ESS). Each model has 21 degrees of freedom.

Step #4: Divide the ESS of the higher value regression by the ESS of the lower value regression, and compare quotient to the F-table.

The higher value regression produced an ESS of 61.489 and the lower value regression produced an ESS of 5.189. Dividing the former by the latter, we get a quotient of 11.851. Now, we need to go to the F-table and check the critical F-value for a 95% significance level and 21 degrees of freedom, which is a value of 2.10. Since our quotient of 11.851 is greater than that of the critical F-value, we can conclude there is strong evidence of heteroscedasticity in the model.

The Breusch-Pagan Test

The Breusch-Pagan test is also pretty simple, but it’s a very powerful test, in that it can be used to detect whether more than one independent variable is causing the heteroscedasticity. Since it can involve multiple variables, the Breusch-Pagan test relies on critical values of chi-squared (χ2) to determine the presence of heteroscedasticity, and works best with large sample sets. There are five steps to the Breusch-Pagan test:

Step #1:
Run the regular regression model and collect the residuals

Step #2: Estimate the variance of the regression residuals

To do this, we square each residual, sum it up and then divide it by the number of observations. Our formula is:

Our residuals and their squares are as follows:

 Observation Predicted Ownratio Residuals Residuals Squared 1 5.165 2.055 4.222 2 1.300 (0.206) 0.043 3 3.504 0.083 0.007 4 3.821 1.458 2.126 5 3.749 (0.241) 0.058 6 2.331 (1.542) 2.378 7 2.174 (0.337) 0.113 8 3.358 1.792 3.209 9 3.466 (1.265) 1.601 10 4.135 (2.203) 4.852 11 2.249 (1.330) 1.769 12 2.415 (0.517) 0.267 13 1.428 0.156 0.024 14 0.763 0.138 0.019 15 (0.712) 0.840 0.705 16 0.184 (0.125) 0.016 17 (0.917) 0.939 0.881 18 (0.617) 0.789 0.622 19 0.608 0.308 0.095 20 1.662 (0.397) 0.158 21 1.230 (0.211) 0.045 22 1.548 0.150 0.022 23 1.586 0.602 0.362 24 2.287 0.563 0.317 25 2.503 0.546 0.298 26 1.983 0.324 0.105 27 1.410 (0.537) 0.288 28 0.860 (0.450) 0.203 29 1.911 (0.760) 0.577 30 1.990 (0.716) 0.513 31 2.459 (0.708) 0.502 32 3.388 1.686 2.841 33 2.948 1.324 1.754 34 2.833 1.035 1.071 35 2.188 (0.179) 0.032 36 3.527 (1.271) 1.615 37 3.191 (0.720) 0.518 38 1.993 1.026 1.052 39 2.469 (0.315) 0.099 40 4.276 0.914 0.835 41 3.497 1.082 1.171 42 4.242 (0.525) 0.275 43 4.571 (0.851) 0.724 44 4.453 1.674 2.802 45 3.589 0.879 0.773 46 2.790 (0.680) 0.463 47 1.580 (0.798) 0.637 48 0.699 (0.440) 0.194 49 2.800 (1.567) 2.454 50 4.761 (1.473) 2.169 51 0.506 (0.271) 0.074 52 4.359 (2.953) 8.720 53 3.605 (1.399) 1.956 54 5.118 0.532 0.283 55 3.158 1.920 3.686 56 4.080 (2.647) 7.008 57 4.371 3.081 9.492 58 3.647 2.091 4.373 59 1.714 (0.350) 0.122

Summing the last column, we get 83.591. We divide this by 59, and get 1.417.

Step #3: Compute the square of the standardized residuals

Now that we know the variance of the regression residuals – 1.417 – we compute the standardized residuals by dividing each residual by 1.417 and then squaring the results, so that we get our square of standardized residuals, si2:

 Obs. Predicted Ownratio Residuals Standardized Residuals Square of Standardized Residuals 1 5.165 2.055 1.450 2.103 2 1.300 (0.206) (0.146) 0.021 3 3.504 0.083 0.058 0.003 4 3.821 1.458 1.029 1.059 5 3.749 (0.241) (0.170) 0.029 6 2.331 (1.542) (1.088) 1.185 7 2.174 (0.337) (0.238) 0.057 8 3.358 1.792 1.264 1.599 9 3.466 (1.265) (0.893) 0.797 10 4.135 (2.203) (1.555) 2.417 11 2.249 (1.330) (0.939) 0.881 12 2.415 (0.517) (0.365) 0.133 13 1.428 0.156 0.110 0.012 14 0.763 0.138 0.097 0.009 15 (0.712) 0.840 0.593 0.351 16 0.184 (0.125) (0.088) 0.008 17 (0.917) 0.939 0.662 0.439 18 (0.617) 0.789 0.557 0.310 19 0.608 0.308 0.217 0.047 20 1.662 (0.397) (0.280) 0.079 21 1.230 (0.211) (0.149) 0.022 22 1.548 0.150 0.106 0.011 23 1.586 0.602 0.425 0.180 24 2.287 0.563 0.397 0.158 25 2.503 0.546 0.385 0.148 26 1.983 0.324 0.229 0.052 27 1.410 (0.537) (0.379) 0.143 28 0.860 (0.450) (0.318) 0.101 29 1.911 (0.760) (0.536) 0.288 30 1.990 (0.716) (0.505) 0.255 31 2.459 (0.708) (0.500) 0.250 32 3.388 1.686 1.190 1.415 33 2.948 1.324 0.935 0.874 34 2.833 1.035 0.730 0.534 35 2.188 (0.179) (0.127) 0.016 36 3.527 (1.271) (0.897) 0.805 37 3.191 (0.720) (0.508) 0.258 38 1.993 1.026 0.724 0.524 39 2.469 (0.315) (0.222) 0.049 40 4.276 0.914 0.645 0.416 41 3.497 1.082 0.764 0.584 42 4.242 (0.525) (0.370) 0.137 43 4.571 (0.851) (0.600) 0.361 44 4.453 1.674 1.182 1.396 45 3.589 0.879 0.621 0.385 46 2.790 (0.680) (0.480) 0.231 47 1.580 (0.798) (0.563) 0.317 48 0.699 (0.440) (0.311) 0.097 49 2.800 (1.567) (1.106) 1.223 50 4.761 (1.473) (1.040) 1.081 51 0.506 (0.271) (0.192) 0.037 52 4.359 (2.953) (2.084) 4.344 53 3.605 (1.399) (0.987) 0.974 54 5.118 0.532 0.375 0.141 55 3.158 1.920 1.355 1.836 56 4.080 (2.647) (1.868) 3.491 57 4.371 3.081 2.175 4.728 58 3.647 2.091 1.476 2.179 59 1.714 (0.350) (0.247) 0.061

Step #4: Run another regression with all your independent variables using the sum of standardized residuals as the dependent variable

In this case, we had only one independent variable, INCOME. We will now run a regression substituting the last column of the table above for OWNRATIO, and making it the dependent variable. Again, we’re not interested in the parameter estimates. We are, however, interested in the regression sum of squares (RSS), which is 15.493.

Step #5: Divide the RSS by 2 and compare with the χ2 table’s critical value for the appropriate degrees of freedom

Dividing the RSS by 2, we get 7.747. We look up the critical χ2 value for one degree of freedom and in the table, for a 5% significance level, we get 3.84. Since our χ2 value exceeds our critical, we can conclude there is strong evidence of heteroscedasticity present.

The Park Test

Last, but certainly not least comes the Park test. I saved this one for last because it is the simplest of the three methods and unlike the other two, provides information that can help eliminate the heteroscedasticity. The Park Test assumes there is a relationship between the error variance and one of the regression model’s independent variables. The steps involved are as follows:

Step #1: Run your original regression model and collect the residuals

Done.

Step #2: Square the regression residuals and compute the logs of the squared residuals and the values of the suspected independent variable.

We’ll square the regression residuals, and take their natural log. We will also take the natural log of INCOME:

 Tract Residual Squared LnResidual Squared LnIncome 601 4.222 1.440 10.123 602 0.043 (3.157) 9.382 603 0.007 (4.987) 9.868 604 2.126 0.754 9.922 605 0.058 (2.848) 9.910 606 2.378 0.866 9.639 607 0.113 (2.176) 9.604 608 3.209 1.166 9.842 609 1.601 0.470 9.862 609 4.852 1.579 9.973 610 1.769 0.571 9.621 611 0.267 (1.320) 9.657 612 0.024 (3.720) 9.418 613 0.019 (3.960) 9.217 614 0.705 (0.349) 8.535 615 0.016 (4.162) 9.001 616 0.881 (0.127) 8.389 616 0.622 (0.475) 8.596 617 0.095 (2.356) 9.163 618 0.158 (1.847) 9.480 619 0.045 (3.112) 9.362 620 0.022 (3.796) 9.450 621 0.362 (1.015) 9.460 623 0.317 (1.148) 9.629 624 0.298 (1.211) 9.676 625 0.105 (2.255) 9.559 626 0.288 (1.245) 9.413 627 0.203 (1.596) 9.249 628 0.577 (0.549) 9.542 629 0.513 (0.668) 9.561 630 0.502 (0.689) 9.667 631 2.841 1.044 9.848 632 1.754 0.562 9.766 633 1.071 0.069 9.744 634 0.032 (3.437) 9.607 635 1.615 0.479 9.872 701 0.518 (0.658) 9.812 705 1.052 0.051 9.562 706 0.099 (2.309) 9.669 710 0.835 (0.180) 9.995 711 1.171 0.158 9.867 712 0.275 (1.289) 9.989 713 0.724 (0.323) 10.039 713 2.802 1.030 10.022 714 0.773 (0.257) 9.883 714 0.463 (0.770) 9.735 718 0.637 (0.452) 9.459 718 0.194 (1.640) 9.195 719 2.454 0.898 9.737 719 2.169 0.774 10.067 720 0.074 (2.608) 9.127 721 8.720 2.166 10.007 721 1.956 0.671 9.886 724 0.283 (1.263) 10.117 726 3.686 1.305 9.806 728 7.008 1.947 9.964 731 9.492 2.250 10.009 731 4.373 1.476 9.893 735 0.122 (2.102) 9.493

Step #3: Run the regression equation using the log of the squared residuals as the dependent variable and the log of the suspected independent variable as the dependent variable

That results in the following regression equation:

Ln(e2) = 1.957(LnIncome) – 19.592

Step #4: If the t-ratio for the transformed independent variable is significant, you can conclude heteroscedasticity is present.

The parameter estimate for the LnIncome is significant, with a t-ratio of 3.499, so we conclude heteroscedasticity.

Next Forecast Friday Topic: Correcting Heteroscedasticity

Thanks for your patience! Now you know the three most common methods for detecting heteroscedasticity: the Goldfeld-Quandt test, the Breusch-Pagan test, and the Park test. As you will see in next week’s Forecast Friday post, the Park test will be beneficial in helping us eliminate the heteroscedasticity. We will discuss the most common approach to correcting heteroscedasticity: weighted least squares (WLS) regression, and show you how to apply it. Next week’s Forecast Friday post will conclude our discussion of regression violations, and allow us to resume discussions of more practical applications in forecasting.

*************************

Help us Reach 200 Fans on Facebook by Tomorrow!

Thanks to all of you, Analysights now has over 160 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

### Forecast Friday Topic: Simple Regression Analysis (Continued)

June 3, 2010

(Seventh in a series)

Last week I introduced the concept of simple linear regression and how it could be used in forecasting. I introduced the fictional businesswoman, Sue Stone, who runs her own CPA firm. Using the last 12 months of her firm’s sales, I walked you through the regression modeling process: determining the independent and dependent variables, estimating the parameter estimates, α and β, deriving the regression equation, calculating the residuals for each observation, and using those residuals to estimate the coefficient of determination – R2 – which indicates how much of the change in the dependent variable is explained by changes in the independent variable. Then I deliberately skipped a couple of steps to get straight to using the regression equation for forecasting. Today, I am going to fill in that gap, and then talk about a couple of other things so that we can move on to next week’s topic on multiple regression.

Revisiting Sue Stone

Last week, we helped Sue Stone develop a model using simple regression analysis, so that she could forecast sales. She had 12 months of sales data, which was her dependent variable, or Y, and each month (numbered from 1 to 12), was her independent variable, or X. Sue’s regression equation was as follows:

Where i is the period number corresponding to the month. So, in June 2009, i would be equal to 6; in January 2010, i would be equal to 13. Of course, since X is the month number, X=i in this example. Recall that Sue’s equation states that each passing month is associated with an average sales increase of \$479.02, suggesting her sales are on an upward trend. Also note that Sue’s R2=.917, which says 91.7% of the change in Sue’s monthly sales is explained by changes in the passing months.

Are these claims valid? We need to do some further work here.

Are the Parameter Estimates Statistically Significant?

Measuring an entire population is often impossible. Quite often, we must measure a sample of the population and generalize our findings to the population. When we take an average or standard deviation of a data set that is a subset of the population, our values are estimates of the actual parameters for the population’s true average and standard deviation. These are subject to sampling error. Likewise, when we perform regression analysis on a sample of the population, our coefficients (a and b) are also subject to sampling error. Whenever we estimate population parameters (the population’s true α and β), we are frequently concerned that they might actually have values of zero. Even though we have derived values a=\$9636.36 and b=\$479.02, we want to perform a statistical significance test to make sure their distance from zero is meaningful and not due to sampling error.

Recall from the May 25 blog post, Using Statistics to Evaluate a Promotion, that in order to do significance testing, we must set up a hypothesis test. In this case, our null hypothesis is that the true population coefficient for month – β – is equal to zero. Our alternative hypothesis is that β is not equal to zero:

H0: β = 0

HA: β≠ 0

Our first step here is to compute the standard error of the estimate, that is, how spread out each value of the dependent variable (sales) is from the average value of sales. Since we are sampling from a population, we are looking for the estimator for the standard error of the estimate. That equation is:

Where ESS is the error sum of squares – or \$2,937,062.94 – from Sue’s equation; n is the sample size, or 12; k is the number of independent variables in the model, in this case, just 1. When we plug those numbers into the above equation, we’re dividing the ESS by 10 and then taking the square root, so Sue’s estimator is:

sε = \$541.95

Now that we know the estimator for the standard error of the estimate, we need to use that to find the estimator for the standard deviation of the regression slope (b). That equation is given by:

Remember from last week’s blog post that the sum of all the (x-xbar) squared values was 143. Since we have the estimator for the standard error of the estimate, we divide \$541.95 by the square root of 143 to get an Sb = 45.32. Next we need to compute the t-statistic. If Sue’s t-statistic is greater than her critical t-value, then she’ll know the parameter estimate of \$479.02 is significant. In Sue’s regression, she has 12 observations, and thus 10 degrees of freedom: (n-k-1) = (12-1-1) = 10. Assuming a 95% confidence interval, her critical t is 2.228. Since parameter estimates can be positive or negative, if her t value is less than -2.228 or greater than 2.228, Sue can reject her null hypothesis and conclude that her parameter estimates is meaningfully different from zero.

To compute the t-statistic, all Sue needs to do is divide her b1 coefficient (\$479.02) by her sb (\$45.32). She ends up with a t-statistic of 10.57, which is significant.

Next Sue must do the same for her intercept value, a. To do this, Sue, must compute the estimator of the standard deviation of the intercept (a). The equation for this estimate is:

All she needs to do is plug in her numbers from earlier: her sε = \$541.95; n=12; she just takes her average x-bar of 6.5 and squares it, bringing it to 42.25; and the denominator is the same 143. Working that all in, Sue gets a standard error of 333.545. She divides her intercept value of \$9636.36 by 333.545 and gets a t-statistic of 28.891, which exceeds the 2.228 critical t, so her intercept is also significant.

Prediction Intervals in Forecasting

Whew! Aren’t you glad those t-statistics calculations are over? If you run regressions in Excel, these values will be calculated for you automatically, but it’s very important that you understand how they were derived and the theory behind them. Now, we move back to forecasting. In last week’s post, we predicted just a single point with the regression equation. For January 2010, we substituted the number 13 for X, and got a point forecast for sales in that month: \$15,863.64. But Sue needs a range, because she knows forecasts are not precise. Sue wants to develop a prediction interval. A prediction interval is simply the point forecast plus or minus the critical t value (2.228) for a desired level of confidence (95%, in this example) times the estimator of the standard error of the estimate (\$541.95). So, Sue’s prediction interval is:

\$15,863.64 ± 2.228(\$541.95)

= \$15,863.64 ± \$1,207.46

\$14,656.18_____\$17,071.10

So, since Sue had chosen a 95% level of confidence, she can be 95% confident that January 2010 sales will fall somewhere between \$14,656.18 and \$17,071.10

Recap and Plan for Next Week’s Post

Today, you learned how to test the parameter estimates for significance to determine the validity of your regression model. You also learned how to compute the estimates of the standard error of the estimates, as well as the estimators of the standard deviations of the slope and intercept. You then learned how to derive the t-statistics you need to determine whether those parameter estimates were indeed significant. And finally, you learned how to derive a prediction interval. Next week, we begin our discussion of multiple regression. We will begin by talking about the assumptions behind a regression model; then we will talk about adding a second independent variable into the model. From there, we will test the model for validity, assess the model against those assumptions, and generate projections.

### Forecast Friday Topic: Simple Regression Analysis

May 27, 2010

(Sixth in a series)

Today, we begin our discussion of regression analysis as a time series forecasting tool. This discussion will take the next few weeks, as there is much behind it. As always, I will make sure everything is simplified and easy for you to digest. Regression is a powerful tool that can be very helpful for mid- and long-range forecasting. Quite often, the business decisions we make require us to consider relationships between two or more variables. Rarely can we make changes to our promotion, pricing, and/or product development strategies without them having an impact of some kind on our sales. Just how big an impact would that be? How do we measure the relationship between two or more variables? And does a real relationship even exist between those variables? Regression analysis helps us find out.

One thing I must point out: Remember the “deviations” we discussed in the posts on moving average and exponential smoothing techniques: The difference between the forecasted and actual values for each observation, of which we took the absolute value? Good. In regression analysis, we refer to the deviations as the “error terms” or “residuals.” In regression analysis, the residuals – which we will square, rather than take the absolute value – become very important in gauging the regression model’s accuracy, validity, efficiency, and “goodness of fit.”

Simple Linear Regression Analysis

Sue Stone, owner of Stone & Associates, looked at her CPA practice’s monthly receipts from January to December 2009. The sales were as follows:

 Month Sales January \$10,000 February \$11,000 March \$10,500 April \$11,500 May \$12,500 June \$12,000 July \$14,000 August \$13,000 September \$13,500 October \$15,000 November \$14,500 December \$15,500

Sue is trying to predict what sales will be for each month in the first quarter of 2010, but is unsure of how to go about it. Moving average and exponential smoothing techniques rarely go more than one period ahead. So, what is Sue to do?

When we are presented with a set of numbers, one of the ways we try to make sense of it is by taking its average. Perhaps Sue can average all 12 months’ sales – \$12,750 – and use that her forecast for each of next three months. But how accurately would that measure each month of 2009? How spread out are each month’s sales from the average? Sue subtracts the average from each month’s sales and examines the difference:

 Month Sales Sales Less Average Sales January \$10,000 -\$2,750 February \$11,000 -\$1,750 March \$10,500 -\$2,250 April \$11,500 -\$1,250 May \$12,500 -\$250 June \$12,000 -\$750 July \$14,000 \$1,250 August \$13,000 \$250 September \$13,500 \$750 October \$15,000 \$2,250 November \$14,500 \$1,750 December \$15,500 \$2,750

Sue notices that the error between actual and average is quite high in both the first four months of 2009 and in the last three months of 2009. She wants to understand the overall error in using the average as a forecast of sales. However, when she sums up all the errors from month to month, Sue finds they sum to zero. That tells her nothing. So she squares each month’s error value and sums them:

 Month Sales Error Error Squared January \$10,000 -\$2,750 \$7,562,500 February \$11,000 -\$1,750 \$3,062,500 March \$10,500 -\$2,250 \$5,062,500 April \$11,500 -\$1,250 \$1,562,500 May \$12,500 -\$250 \$62,500 June \$12,000 -\$750 \$562,500 July \$14,000 \$1,250 \$1,562,500 August \$13,000 \$250 \$62,500 September \$13,500 \$750 \$562,500 October \$15,000 \$2,250 \$5,062,500 November \$14,500 \$1,750 \$3,062,500 December \$15,500 \$2,750 \$7,562,500 Total Error: \$35,750,000

In totaling these squared errors, Sue derives the total sum of squares, or TSS error: 35,750,000. Is there any way she can improve upon that? Sue thinks for a while. She doesn’t know too much more about her 2009 sales except for the month in which they were generated. She plots the sales on a chart:

Sue notices that sales by month appear to be on an upward trend. Sue thinks for a moment. “All I know is the sales and the month,” she says to herself, “How can I develop a model to forecast accurately?” Sue reads about a statistical procedure called regression analysis and, seeing that each month’s sales is in sequential order, she wonders whether the mere passage of time simply causes sales to go higher. Sue numbers each month, with January assigned a 1 and December, a 12.

She also realizes that she is trying to predict sales with each passing month. Hence, she hypothesizes that the change in sales depends on the change in the month. Hence, sales is Sue’s dependent variable. Because the month number is used to estimate change in sales, it is her independent variable. In regression analysis, the relationship between an independent and a dependent value is expressed:

Y = α + βX + ε

Where: Y is the value of the dependent variable

X is the value of the independent variable

α is a population parameter, called the intercept, which would be the value of Y when X=0

β is also a population parameter – the slope of the regression line – representing the change in Y associated with each one-unit change in X.

ε is the error term.

Sue further reads that the goal of regression analysis is to minimize the error sum of squares, which is why it is referred to as ordinary least squares (OLS) regression. She also notices that she is building her regression on a sample, so there is a sample regression equation used to estimate what the true regression is for the population:

Essentially, the equation is the same as the one above, however the terms indicate the sample. The Y-term (called “Y hat”) is the sample forecasted value of the dependent variable (sales) at period i; a is the sample estimate of α; b is the sample estimate of β; Xi is the value of the independent variable at period i; and ei is the error, or difference between Y hat (the forecasted value) and actual Y for period i. Sue needs to find the values for a and b – the estimates of the population parameters – that minimize the error sum of squares.

Sue reads that the equations for estimating a and b are derived from calculus, but expressed algebraically as:

Sue learns that the X and Y terms with lines above them, known as “X bar” and “Y bar,” respectively are the averages of all the X and Y values, respectively. She also reads that the Σ notation – the Greek letter sigma – represents a sum. Hence, Sue realizes a few things:

1. She must estimate b before she can estimate a;
2. To estimate b,she must take care of the numerator:
1. first subtract each observation’s month number from the average month’s number (X minus X-bar),
2. subtract each observation’s sales from the average sales (Y minus Y-bar),
3. multiply those two together, and
4. Add up (2c) for all observations.
3. To get the denominator for calculating b, she must:
1. Again subtract X-bar from X, but then square the difference, for each observation.
2. Sum them up
4. Calculating b is easy: She needs only to divide the result from (2) by the result from (3).
5. Calculating a is also easy: She multiplies her b value by the average month (X-bar), and subtracts it from average sales (Y-bar).

Sue now goes to work to compute her regression equation. She goes into Excel and enters her monthly sales data in a table, and computes the averages for sales and month number:

 Month (X) Sales (Y) 1 \$10,000 2 \$11,000 3 \$10,500 4 \$11,500 5 \$12,500 6 \$12,000 7 \$14,000 8 \$13,000 9 \$13,500 10 \$15,000 11 \$14,500 12 \$15,500 Average 6.5 \$12,750

Sue goes ahead and subtracts the X and Y values from their respective averages, and computes the components she needs (the “Product” is the result of multiplying the values in the first two columns together):

 X minus X-bar Y minus Y-bar Product (X minus X-bar) Squared -5.5 -\$2,750 \$15,125 30.25 -4.5 -\$1,750 \$7,875 20.25 -3.5 -\$2,250 \$7,875 12.25 -2.5 -\$1,250 \$3,125 6.25 -1.5 -\$250 \$375 2.25 -0.5 -\$750 \$375 0.25 0.5 \$1,250 \$625 0.25 1.5 \$250 \$375 2.25 2.5 \$750 \$1,875 6.25 3.5 \$2,250 \$7,875 12.25 4.5 \$1,750 \$7,875 20.25 5.5 \$2,750 \$15,125 30.25 Total \$68,500 143

Sue computes b:

b = \$68,500/143

= \$479.02

Now that Sue knows b, she calculates a:

a = \$12,750 – \$479.02(6.5)

= \$12,750 – \$3,113.64

= \$9,636.36

Hence, assuming errors are zero, Sue’s least-squares regression equation is:

Y(hat) =\$9,636.36 + \$479.02X

Forecasted Sales = \$9,636.36 + \$479.02 * Month number.

This means that each passing month is associated with an average increase in sales of \$479.02 for Sue’s CPA firm. How accurately does this regression model predict sales? Sue estimates the error by plugging each month’s number into the equation and then comparing her forecast for that month with the actual sales:

 Month (X) Sales (Y) Forecasted Sales Error 1 \$10,000 \$10,115.38 -\$115.38 2 \$11,000 \$10,594.41 \$405.59 3 \$10,500 \$11,073.43 -\$573.43 4 \$11,500 \$11,552.45 -\$52.45 5 \$12,500 \$12,031.47 \$468.53 6 \$12,000 \$12,510.49 -\$510.49 7 \$14,000 \$12,989.51 \$1,010.49 8 \$13,000 \$13,468.53 -\$468.53 9 \$13,500 \$13,947.55 -\$447.55 10 \$15,000 \$14,426.57 \$573.43 11 \$14,500 \$14,905.59 -\$405.59 12 \$15,500 \$15,384.62 \$115.38

Sue’s actual and forecasted sales appear to be pretty close, except for her July estimate, which is off by a little over \$1,000. But does her model predict better than if she simply used average sales as her forecast for each month? To do that, she must compute the error sum of squares, ESS, error. Sue must square the error terms for each observation and sum them up to obtain ESS:

ESS = Σe2

 Error Squared Error -\$115.38 \$13,313.61 \$405.59 \$164,506.82 -\$573.43 \$328,818.04 -\$52.45 \$2,750.75 \$468.53 \$219,521.74 -\$510.49 \$260,599.54 \$1,010.49 \$1,021,089.05 -\$468.53 \$219,521.74 -\$447.55 \$200,303.19 \$573.43 \$328,818.04 -\$405.59 \$164,506.82 \$115.38 \$13,313.61 ESS= \$2,937,062.94

Notice Sue’s error sum of squares. This is the error, or unexplained, sum of squared deviations between the forecasted and actual sales. The difference between the total sum of squares (TSS) and the Error Sum of Squares (ESS) is the regression sum of squares, RSS, and that is the sum of squared deviations that are explained by the regression. RSS is also calculated as each forecasted value of sales less the average of sales:

 Forecasted Sales Average Sales Regression Error Reg. Error Squared \$10,115.38 \$12,750 -\$2,634.62 \$6,941,198.22 \$10,594.41 \$12,750 -\$2,155.59 \$4,646,587.24 \$11,073.43 \$12,750 -\$1,676.57 \$2,810,898.45 \$11,552.45 \$12,750 -\$1,197.55 \$1,434,131.86 \$12,031.47 \$12,750 -\$718.53 \$516,287.47 \$12,510.49 \$12,750 -\$239.51 \$57,365.27 \$12,989.51 \$12,750 \$239.51 \$57,365.27 \$13,468.53 \$12,750 \$718.53 \$516,287.47 \$13,947.55 \$12,750 \$1,197.55 \$1,434,131.86 \$14,426.57 \$12,750 \$1,676.57 \$2,810,898.45 \$14,905.59 \$12,750 \$2,155.59 \$4,646,587.24 \$15,384.62 \$12,750 \$2,634.62 \$6,941,198.22 RSS= \$32,812,937.06

Sue immediately adds the RSS and the ESS and sees they match the TSS: \$35,750,000. She also knows that nearly 33 million of that TSS is explained by her regression model, so she divides her RSS by the TSS:

32,812,937.06 / 35,750,000

=.917 or 91.7%

This quotient, known as the coefficient of determination, and denoted as R2, tells Sue that each passing month explains 91.7% of the change in monthly sales that she experiences. What R2 means is that Sue improved her forecast accuracy by 91.7% by using this simple model instead of the simple average. As you will find out in subsequent blog posts, maximizing R2 isn’t the “be all and end all”. In fact, there is still much to do with this model, which will be discussed in next week’s Forecast Friday post. But for now, Sue’s model seems to have reduced a great deal of error.

It is important to note that while each month does seem to be related to sales, the passing months do not cause the increase in sales. Correlation does not mean causation. There could be something behind the scenes (e.g., Sue’s advertising, or the types of projects she works on, etc.) that is driving the upward trend in her sales.

Using the Regression Equation to Forecast Sales

Now Sue can use the same model to forecast sales for January 2010 and February 2010, etc. She has her equation, so since January 2010 is period 13, she plugs in 13 for X, and gets a forecast of \$15,863.64; for February (period 14), she gets \$16,342.66.

Recap and Plan for Next Week

You have now learned the basics of simple regression analysis. You have learned how to estimate the parameters for the regression equation, how to measure the improvement in accuracy from the regression model, and how to generate forecasts. Next week, we will be checking the validity of Sue’s equation, and discussing the important assumptions underlying regression analysis. Until then, you have a basic overview of what regression analysis is.