## Posts Tagged ‘error terms’

### Forecast Friday Topic: Detecting Heteroscedasticity – Analytical Approaches

August 19, 2010

(Eighteenth in a series)

Last week, we discussed the violation of the homoscedasticity assumption of regression analysis: the assumption that the error terms have a constant variance. When the error terms do not exhibit a constant variance, they are said to be heteroscedastic. A model that exhibits heteroscedasticity produces parameter estimates that are not biased, but rather inefficient. Heteroscedasticity most often appears in cross-sectional data and is frequently caused by a wide range of possible values for one or more independent variables.

Last week, we showed you how to detect heteroscedasticity by visually inspecting the plot of the error terms against the independent variable. Today, we are going to discuss three simple, but very powerful, analytical approaches to detecting heteroscedasticity: the Goldfeld-Quandt test, the Breusch-Pagan test, and the Park test. These approaches are quite simple, but can be a bid tedious to employ.

Reviewing Our Model

Recall our model from last week. We were trying to determine the relationship between a census tract’s median family income (INCOME) and the ratio of the number of families who own their homes to the number of families who rent (OWNRATIO). Our hypothesis was that census tracts with higher median family incomes had a higher proportion of families who owned their homes. I snatched an example from my college econometrics textbook, which pulled INCOME and OWNRATIOs from 59 census tracts in Pierce County, Washington, which were compiled during the 1980 Census. We had the following data:

 Housing Data Tract Income Ownratio 601 \$24,909 7.220 602 \$11,875 1.094 603 \$19,308 3.587 604 \$20,375 5.279 605 \$20,132 3.508 606 \$15,351 0.789 607 \$14,821 1.837 608 \$18,816 5.150 609 \$19,179 2.201 609 \$21,434 1.932 610 \$15,075 0.919 611 \$15,634 1.898 612 \$12,307 1.584 613 \$10,063 0.901 614 \$5,090 0.128 615 \$8,110 0.059 616 \$4,399 0.022 616 \$5,411 0.172 617 \$9,541 0.916 618 \$13,095 1.265 619 \$11,638 1.019 620 \$12,711 1.698 621 \$12,839 2.188 623 \$15,202 2.850 624 \$15,932 3.049 625 \$14,178 2.307 626 \$12,244 0.873 627 \$10,391 0.410 628 \$13,934 1.151 629 \$14,201 1.274 630 \$15,784 1.751 631 \$18,917 5.074 632 \$17,431 4.272 633 \$17,044 3.868 634 \$14,870 2.009 635 \$19,384 2.256 701 \$18,250 2.471 705 \$14,212 3.019 706 \$15,817 2.154 710 \$21,911 5.190 711 \$19,282 4.579 712 \$21,795 3.717 713 \$22,904 3.720 713 \$22,507 6.127 714 \$19,592 4.468 714 \$16,900 2.110 718 \$12,818 0.782 718 \$9,849 0.259 719 \$16,931 1.233 719 \$23,545 3.288 720 \$9,198 0.235 721 \$22,190 1.406 721 \$19,646 2.206 724 \$24,750 5.650 726 \$18,140 5.078 728 \$21,250 1.433 731 \$22,231 7.452 731 \$19,788 5.738 735 \$13,269 1.364

Data taken from U.S. Bureau of Census 1980 Pierce County, WA; Reprinted in Brown, W.S., Introducing Econometrics, St. Paul (1991): 198-200.

And we got the following regression equation:

Ŷ= 0.000297*Income – 2.221

With an R2=0.597, an F-ratio of 84.31, the t-ratios for INCOME (9.182) and the intercept (-4.094) both solidly significant, and the positive sign on the parameter estimate for INCOME, our model appeared to do very well. However, visual inspection of the regression residuals suggested the presence of heteroscedasticity. Unfortunately, visual inspection can only suggest; we need more objective ways of determining the presence of heteroscedasticity. Hence our three tests below.

The Goldfeld-Quandt Test

The Goldfeld-Quandt test is a computationally simple, and perhaps the most commonly used, method for detecting heteroscedasticity. Since a model with heteroscedastic error terms does not have a constant variance, the Goldfeld-Quandt test postulates that the variances associated with high values of the independent variable, X, are statistically significant from those associated with low values. Essentially, you would run separate regression analyses for the low values of X and the high values, and then compare their F-ratios.

The Goldfeld-Quandt test has four steps:

Step #1: Sort the data

Take the independent variable you suspect to be the source of the heteroscedasticity and sort your data set by the X value in low-to-high order:

 Housing Data Tract Income Ownratio 616 \$4,399 0.022 614 \$5,090 0.128 616 \$5,411 0.172 615 \$8,110 0.059 720 \$9,198 0.235 617 \$9,541 0.916 718 \$9,849 0.259 613 \$10,063 0.901 627 \$10,391 0.410 619 \$11,638 1.019 602 \$11,875 1.094 626 \$12,244 0.873 612 \$12,307 1.584 620 \$12,711 1.698 718 \$12,818 0.782 621 \$12,839 2.188 618 \$13,095 1.265 735 \$13,269 1.364 628 \$13,934 1.151 625 \$14,178 2.307 629 \$14,201 1.274 705 \$14,212 3.019 607 \$14,821 1.837 634 \$14,870 2.009 610 \$15,075 0.919 623 \$15,202 2.850 606 \$15,351 0.789 611 \$15,634 1.898 630 \$15,784 1.751 706 \$15,817 2.154 624 \$15,932 3.049 714 \$16,900 2.110 719 \$16,931 1.233 633 \$17,044 3.868 632 \$17,431 4.272 726 \$18,140 5.078 701 \$18,250 2.471 608 \$18,816 5.150 631 \$18,917 5.074 609 \$19,179 2.201 711 \$19,282 4.579 603 \$19,308 3.587 635 \$19,384 2.256 714 \$19,592 4.468 721 \$19,646 2.206 731 \$19,788 5.738 605 \$20,132 3.508 604 \$20,375 5.279 728 \$21,250 1.433 609 \$21,434 1.932 712 \$21,795 3.717 710 \$21,911 5.190 721 \$22,190 1.406 731 \$22,231 7.452 713 \$22,507 6.127 713 \$22,904 3.720 719 \$23,545 3.288 724 \$24,750 5.650 601 \$24,909 7.220

Step #2: Omit the middle observations

Next, take out the observations in the middle. This usually amounts between one-fifth to one-third of your observations. There’s no hard and fast rule about how many variables to omit, and if your data set is small, you may not be able to omit any. In our example, we can omit 13 observations (highlighted in orange):

 Housing Data Tract Income Ownratio 616 \$4,399 0.022 614 \$5,090 0.128 616 \$5,411 0.172 615 \$8,110 0.059 720 \$9,198 0.235 617 \$9,541 0.916 718 \$9,849 0.259 613 \$10,063 0.901 627 \$10,391 0.410 619 \$11,638 1.019 602 \$11,875 1.094 626 \$12,244 0.873 612 \$12,307 1.584 620 \$12,711 1.698 718 \$12,818 0.782 621 \$12,839 2.188 618 \$13,095 1.265 735 \$13,269 1.364 628 \$13,934 1.151 625 \$14,178 2.307 629 \$14,201 1.274 705 \$14,212 3.019 607 \$14,821 1.837 634 \$14,870 2.009 610 \$15,075 0.919 623 \$15,202 2.850 606 \$15,351 0.789 611 \$15,634 1.898 630 \$15,784 1.751 706 \$15,817 2.154 624 \$15,932 3.049 714 \$16,900 2.110 719 \$16,931 1.233 633 \$17,044 3.868 632 \$17,431 4.272 726 \$18,140 5.078 Tract Income Ownratio 701 \$18,250 2.471 608 \$18,816 5.150 631 \$18,917 5.074 609 \$19,179 2.201 711 \$19,282 4.579 603 \$19,308 3.587 635 \$19,384 2.256 714 \$19,592 4.468 721 \$19,646 2.206 731 \$19,788 5.738 605 \$20,132 3.508 604 \$20,375 5.279 728 \$21,250 1.433 609 \$21,434 1.932 712 \$21,795 3.717 710 \$21,911 5.190 721 \$22,190 1.406 731 \$22,231 7.452 713 \$22,507 6.127 713 \$22,904 3.720 719 \$23,545 3.288 724 \$24,750 5.650 601 \$24,909 7.220

Step #3: Run two separate regressions, one for the low values, one for the high

We ran separate regressions for the 23 observations with the lowest values for INCOME and the 23 observations with the highest values. In these regressions, we weren’t concerned with whether the t-ratios of the parameter estimates were significant. Rather, we wanted to look at their Error Sum of Squares (ESS). Each model has 21 degrees of freedom.

Step #4: Divide the ESS of the higher value regression by the ESS of the lower value regression, and compare quotient to the F-table.

The higher value regression produced an ESS of 61.489 and the lower value regression produced an ESS of 5.189. Dividing the former by the latter, we get a quotient of 11.851. Now, we need to go to the F-table and check the critical F-value for a 95% significance level and 21 degrees of freedom, which is a value of 2.10. Since our quotient of 11.851 is greater than that of the critical F-value, we can conclude there is strong evidence of heteroscedasticity in the model.

The Breusch-Pagan Test

The Breusch-Pagan test is also pretty simple, but it’s a very powerful test, in that it can be used to detect whether more than one independent variable is causing the heteroscedasticity. Since it can involve multiple variables, the Breusch-Pagan test relies on critical values of chi-squared (χ2) to determine the presence of heteroscedasticity, and works best with large sample sets. There are five steps to the Breusch-Pagan test:

Step #1:
Run the regular regression model and collect the residuals

Step #2: Estimate the variance of the regression residuals

To do this, we square each residual, sum it up and then divide it by the number of observations. Our formula is:

Our residuals and their squares are as follows:

 Observation Predicted Ownratio Residuals Residuals Squared 1 5.165 2.055 4.222 2 1.300 (0.206) 0.043 3 3.504 0.083 0.007 4 3.821 1.458 2.126 5 3.749 (0.241) 0.058 6 2.331 (1.542) 2.378 7 2.174 (0.337) 0.113 8 3.358 1.792 3.209 9 3.466 (1.265) 1.601 10 4.135 (2.203) 4.852 11 2.249 (1.330) 1.769 12 2.415 (0.517) 0.267 13 1.428 0.156 0.024 14 0.763 0.138 0.019 15 (0.712) 0.840 0.705 16 0.184 (0.125) 0.016 17 (0.917) 0.939 0.881 18 (0.617) 0.789 0.622 19 0.608 0.308 0.095 20 1.662 (0.397) 0.158 21 1.230 (0.211) 0.045 22 1.548 0.150 0.022 23 1.586 0.602 0.362 24 2.287 0.563 0.317 25 2.503 0.546 0.298 26 1.983 0.324 0.105 27 1.410 (0.537) 0.288 28 0.860 (0.450) 0.203 29 1.911 (0.760) 0.577 30 1.990 (0.716) 0.513 31 2.459 (0.708) 0.502 32 3.388 1.686 2.841 33 2.948 1.324 1.754 34 2.833 1.035 1.071 35 2.188 (0.179) 0.032 36 3.527 (1.271) 1.615 37 3.191 (0.720) 0.518 38 1.993 1.026 1.052 39 2.469 (0.315) 0.099 40 4.276 0.914 0.835 41 3.497 1.082 1.171 42 4.242 (0.525) 0.275 43 4.571 (0.851) 0.724 44 4.453 1.674 2.802 45 3.589 0.879 0.773 46 2.790 (0.680) 0.463 47 1.580 (0.798) 0.637 48 0.699 (0.440) 0.194 49 2.800 (1.567) 2.454 50 4.761 (1.473) 2.169 51 0.506 (0.271) 0.074 52 4.359 (2.953) 8.720 53 3.605 (1.399) 1.956 54 5.118 0.532 0.283 55 3.158 1.920 3.686 56 4.080 (2.647) 7.008 57 4.371 3.081 9.492 58 3.647 2.091 4.373 59 1.714 (0.350) 0.122

Summing the last column, we get 83.591. We divide this by 59, and get 1.417.

Step #3: Compute the square of the standardized residuals

Now that we know the variance of the regression residuals – 1.417 – we compute the standardized residuals by dividing each residual by 1.417 and then squaring the results, so that we get our square of standardized residuals, si2:

 Obs. Predicted Ownratio Residuals Standardized Residuals Square of Standardized Residuals 1 5.165 2.055 1.450 2.103 2 1.300 (0.206) (0.146) 0.021 3 3.504 0.083 0.058 0.003 4 3.821 1.458 1.029 1.059 5 3.749 (0.241) (0.170) 0.029 6 2.331 (1.542) (1.088) 1.185 7 2.174 (0.337) (0.238) 0.057 8 3.358 1.792 1.264 1.599 9 3.466 (1.265) (0.893) 0.797 10 4.135 (2.203) (1.555) 2.417 11 2.249 (1.330) (0.939) 0.881 12 2.415 (0.517) (0.365) 0.133 13 1.428 0.156 0.110 0.012 14 0.763 0.138 0.097 0.009 15 (0.712) 0.840 0.593 0.351 16 0.184 (0.125) (0.088) 0.008 17 (0.917) 0.939 0.662 0.439 18 (0.617) 0.789 0.557 0.310 19 0.608 0.308 0.217 0.047 20 1.662 (0.397) (0.280) 0.079 21 1.230 (0.211) (0.149) 0.022 22 1.548 0.150 0.106 0.011 23 1.586 0.602 0.425 0.180 24 2.287 0.563 0.397 0.158 25 2.503 0.546 0.385 0.148 26 1.983 0.324 0.229 0.052 27 1.410 (0.537) (0.379) 0.143 28 0.860 (0.450) (0.318) 0.101 29 1.911 (0.760) (0.536) 0.288 30 1.990 (0.716) (0.505) 0.255 31 2.459 (0.708) (0.500) 0.250 32 3.388 1.686 1.190 1.415 33 2.948 1.324 0.935 0.874 34 2.833 1.035 0.730 0.534 35 2.188 (0.179) (0.127) 0.016 36 3.527 (1.271) (0.897) 0.805 37 3.191 (0.720) (0.508) 0.258 38 1.993 1.026 0.724 0.524 39 2.469 (0.315) (0.222) 0.049 40 4.276 0.914 0.645 0.416 41 3.497 1.082 0.764 0.584 42 4.242 (0.525) (0.370) 0.137 43 4.571 (0.851) (0.600) 0.361 44 4.453 1.674 1.182 1.396 45 3.589 0.879 0.621 0.385 46 2.790 (0.680) (0.480) 0.231 47 1.580 (0.798) (0.563) 0.317 48 0.699 (0.440) (0.311) 0.097 49 2.800 (1.567) (1.106) 1.223 50 4.761 (1.473) (1.040) 1.081 51 0.506 (0.271) (0.192) 0.037 52 4.359 (2.953) (2.084) 4.344 53 3.605 (1.399) (0.987) 0.974 54 5.118 0.532 0.375 0.141 55 3.158 1.920 1.355 1.836 56 4.080 (2.647) (1.868) 3.491 57 4.371 3.081 2.175 4.728 58 3.647 2.091 1.476 2.179 59 1.714 (0.350) (0.247) 0.061

Step #4: Run another regression with all your independent variables using the sum of standardized residuals as the dependent variable

In this case, we had only one independent variable, INCOME. We will now run a regression substituting the last column of the table above for OWNRATIO, and making it the dependent variable. Again, we’re not interested in the parameter estimates. We are, however, interested in the regression sum of squares (RSS), which is 15.493.

Step #5: Divide the RSS by 2 and compare with the χ2 table’s critical value for the appropriate degrees of freedom

Dividing the RSS by 2, we get 7.747. We look up the critical χ2 value for one degree of freedom and in the table, for a 5% significance level, we get 3.84. Since our χ2 value exceeds our critical, we can conclude there is strong evidence of heteroscedasticity present.

The Park Test

Last, but certainly not least comes the Park test. I saved this one for last because it is the simplest of the three methods and unlike the other two, provides information that can help eliminate the heteroscedasticity. The Park Test assumes there is a relationship between the error variance and one of the regression model’s independent variables. The steps involved are as follows:

Step #1: Run your original regression model and collect the residuals

Done.

Step #2: Square the regression residuals and compute the logs of the squared residuals and the values of the suspected independent variable.

We’ll square the regression residuals, and take their natural log. We will also take the natural log of INCOME:

 Tract Residual Squared LnResidual Squared LnIncome 601 4.222 1.440 10.123 602 0.043 (3.157) 9.382 603 0.007 (4.987) 9.868 604 2.126 0.754 9.922 605 0.058 (2.848) 9.910 606 2.378 0.866 9.639 607 0.113 (2.176) 9.604 608 3.209 1.166 9.842 609 1.601 0.470 9.862 609 4.852 1.579 9.973 610 1.769 0.571 9.621 611 0.267 (1.320) 9.657 612 0.024 (3.720) 9.418 613 0.019 (3.960) 9.217 614 0.705 (0.349) 8.535 615 0.016 (4.162) 9.001 616 0.881 (0.127) 8.389 616 0.622 (0.475) 8.596 617 0.095 (2.356) 9.163 618 0.158 (1.847) 9.480 619 0.045 (3.112) 9.362 620 0.022 (3.796) 9.450 621 0.362 (1.015) 9.460 623 0.317 (1.148) 9.629 624 0.298 (1.211) 9.676 625 0.105 (2.255) 9.559 626 0.288 (1.245) 9.413 627 0.203 (1.596) 9.249 628 0.577 (0.549) 9.542 629 0.513 (0.668) 9.561 630 0.502 (0.689) 9.667 631 2.841 1.044 9.848 632 1.754 0.562 9.766 633 1.071 0.069 9.744 634 0.032 (3.437) 9.607 635 1.615 0.479 9.872 701 0.518 (0.658) 9.812 705 1.052 0.051 9.562 706 0.099 (2.309) 9.669 710 0.835 (0.180) 9.995 711 1.171 0.158 9.867 712 0.275 (1.289) 9.989 713 0.724 (0.323) 10.039 713 2.802 1.030 10.022 714 0.773 (0.257) 9.883 714 0.463 (0.770) 9.735 718 0.637 (0.452) 9.459 718 0.194 (1.640) 9.195 719 2.454 0.898 9.737 719 2.169 0.774 10.067 720 0.074 (2.608) 9.127 721 8.720 2.166 10.007 721 1.956 0.671 9.886 724 0.283 (1.263) 10.117 726 3.686 1.305 9.806 728 7.008 1.947 9.964 731 9.492 2.250 10.009 731 4.373 1.476 9.893 735 0.122 (2.102) 9.493

Step #3: Run the regression equation using the log of the squared residuals as the dependent variable and the log of the suspected independent variable as the dependent variable

That results in the following regression equation:

Ln(e2) = 1.957(LnIncome) – 19.592

Step #4: If the t-ratio for the transformed independent variable is significant, you can conclude heteroscedasticity is present.

The parameter estimate for the LnIncome is significant, with a t-ratio of 3.499, so we conclude heteroscedasticity.

Next Forecast Friday Topic: Correcting Heteroscedasticity

Thanks for your patience! Now you know the three most common methods for detecting heteroscedasticity: the Goldfeld-Quandt test, the Breusch-Pagan test, and the Park test. As you will see in next week’s Forecast Friday post, the Park test will be beneficial in helping us eliminate the heteroscedasticity. We will discuss the most common approach to correcting heteroscedasticity: weighted least squares (WLS) regression, and show you how to apply it. Next week’s Forecast Friday post will conclude our discussion of regression violations, and allow us to resume discussions of more practical applications in forecasting.

*************************

Help us Reach 200 Fans on Facebook by Tomorrow!

Thanks to all of you, Analysights now has over 160 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

### Forecast Friday Topic: Correcting Autocorrelation

August 5, 2010

(Sixteenth in a series)

Last week, we discussed how to detect autocorrelation – the violation of the regression assumption that the error terms are not correlated with one another – in your forecasting model. Models exhibiting autocorrelation have parameter estimates that are inefficient, and R2s and t-ratios that seem overly inflated. As a result, your model generates forecasts that are too good to be true and has a tendency to miss turning points in your time series. In last week’s Forecast Friday post, we showed you how to diagnose autocorrelation: examining the model’s parameter estimates, visually inspecting the data, and computing the Durbin-Watson statistic. Today, we’re going to discuss how to correct it.

Revisiting our Data Set

Recall our data set: average hourly wages of textile and apparel workers for the 18 months from January 1986 through June 1987, as reported in the Survey of Current Business (September issues from 1986 and 1987), and reprinted in Data Analysis Using Microsoft ® Excel, by Michael R. Middleton, page 219:

 Month t Wage Jan-86 1 5.82 Feb-86 2 5.79 Mar-86 3 5.8 Apr-86 4 5.81 May-86 5 5.78 Jun-86 6 5.79 Jul-86 7 5.79 Aug-86 8 5.83 Sep-86 9 5.91 Oct-86 10 5.87 Nov-86 11 5.87 Dec-86 12 5.9 Jan-87 13 5.94 Feb-87 14 5.93 Mar-87 15 5.93 Apr-87 16 5.94 May-87 17 5.89 Jun-87 18 5.91

We generated the following regression model:

Ŷ = 5.7709 + 0.0095t

Our model had an R2 of .728, and t-ratios of about 368 for the intercept term and 6.55 for the parameter estimate, t. The Durbin-Watson statistic was 1.05, indicating positive autocorrelation. How do we correct for autocorrelation?

Lagging the Dependent Variable

One of the most common remedies for autocorrelation is to lag the dependent variable one or more periods and then make the lagged dependent variable the independent variable. So, in our data set above, you would take the first value of the dependent variable, \$5.82, and make it the independent variable for period 2, with \$5.79 being the dependent variable; in like manner, \$5.79 will also become the independent variable for the next period, whose dependent variable has a value of \$5.80, and so on. Since the error terms from one period to another exhibit correlation, by using the previous value of the dependent variable to predict the next one, you reduce that correlation of errors.

You can lag for as many periods as you need to; however, note that you lose the first observation when you lag one period (unless you know the previous period before the start of the data set, you have nothing to predict the first observation). You’ll lose two observations if you lag two periods, and so on. If you have a very small data set, the loss of degrees of freedom can lead to Type II error – failing to identify a parameter estimate as significant, when in fact it is. So, you must be careful here.

In this case, by lagging our data by one period, we have the following data set:

 Month Wage Lag1 Wage Feb-86 \$5.79 \$5.82 Mar-86 \$5.80 \$5.79 Apr-86 \$5.81 \$5.80 May-86 \$5.78 \$5.81 Jun-86 \$5.79 \$5.78 Jul-86 \$5.79 \$5.79 Aug-86 \$5.83 \$5.79 Sep-86 \$5.91 \$5.83 Oct-86 \$5.87 \$5.91 Nov-86 \$5.87 \$5.87 Dec-86 \$5.90 \$5.87 Jan-87 \$5.94 \$5.90 Feb-87 \$5.93 \$5.94 Mar-87 \$5.93 \$5.93 Apr-87 \$5.94 \$5.93 May-87 \$5.89 \$5.94 Jun-87 \$5.91 \$5.89

So, we have created a new independent variable, Lag1_Wage. Notice that we are not going to regress time period t as an independent variable. This doesn’t mean that we should or shouldn’t; in this case, we’re only trying to demonstrate the effect of the lagging.

Rerunning the Regression

Now we do our regression analysis. We come up with the following equation:

Ŷ = 0.8253 + 0.8600*Lag1_Wage

Apparently, from this model, each \$1 change in hourly wage from the previous month is associated with an average \$0.86 change in hourly wages for the current month. The R2 for this model was virtually unchanged, 0.730. However, the Durbin-Watson statistic is now 2.01 – just about the total eradication of autocorrelation. Unfortunately, the intercept has a t-ratio of 1.04, indicating it is not significant. The parameter estimate for Lag1_Wage is about 6.37, not much different than the parameter estimate for t in our previous model. However, we did get rid of the autocorrelation.

The statistically insignificant intercept term resulting from this lagging is a result of the Type II error involved with the loss of a degree of freedom in a small sample size. Perhaps if we had several more months of data, we might have had a significant intercept estimate.

Other Approaches to Correcting Autocorrelation

There are other approaches to correcting autocorrelation. One other important way might be to identify important independent variables that have been omitted from the model. Perhaps if we had data on the average years work experience of the textile and apparel labor force from month to month, that might have increased our R2, and reduced correlations in the error term. Another thing we could do is difference the data. Differencing works like lagging, only we subtract the value of the dependent and independent variables of the first observation from their respective values in the second observation; then we subtract those of the second observation’s original values from those of the third, and so on. Then we run a regression on the differences in observations. The problem here is that again, your data set is reduced by one observation and your transformed model will not have an intercept term, which can cause issues in some studies.

Other approaches to correcting autocorrelation include quasi-differencing, the Cochran-Orcutt Procedure, the Hildreth-Lu Procedure, and the Durbin Two-Step Method. These methods are iterative, require a lot of tedious effort and are beyond the scope of our post. But many college-level forecasting textbooks have sections on these procedures if you’re interested in further reading on them.

Next Forecast Friday Topic: Detecting Heteroscedasticity

Next week, we’ll discuss the last of the regression violations, heteroscedasticity, which is the violation of the assumption that error terms have a constant variance. We will discuss why heteroscedasticity exists and how to diagnose it. The week after that, we’ll discuss remedying heteroscedasticity. Once we have completed our discussions on the regression violations, we will spend a couple of weeks discussing regression modeling techniques like transforming independent variables, using categorical variables, adjusting for seasonality, and other regression techniques. These topics will be far less theoretical and more practical in terms of forecasting.

### Forecast Friday Topic: Prelude to Multiple Regression Analysis – Regression Assumptions

June 10, 2010

(Eighth in a series)

In last week’s Forecast Friday post, we continued our discussion of simple linear regression analysis, discussing how to check both the slope and intercept coefficients for significance. We then discussed how to create a prediction interval for our forecasts. I had intended this week’s Forecast Friday post to delve straight into multiple regression analysis, but have decided instead to spend some time talking about the assumptions that go into building a regression model.  These assumptions apply to both simple and multiple regression analysis, but their importance is especially noticeable with multiple regression, and I feel it is best to make you aware of them, so that when we discuss multiple regression both as a time series and as a causal/econometric forecasting tool, you’ll know how to detect and correct regression models that violate these assumptions. We will formally begin our discussion of multiple regression methods next week.

Five Key Assumptions for Ordinary Least Squares (OLS) Regression

When we develop our parameter estimates for our regression model, we want to make sure that all of our estimators have the smallest variance. Recall that when you were computing the value of your estimate, b, for the parameter β, in the equation below:

You were subtracting your independent variable’s average from each of its actual values, and doing likewise for the dependent variable. You then multiplied those two quantities together (for each observation) and summed them up to get the numerator of that calculation. To get the denominator, you again subtracted the independent variable’s mean from each of its actual values and then squared them. Then you summed those up. The calculation of the denominator is the focal point here: the value you get for your estimate of β is the estimate that minimizes the squared error for your model. Hence, the term, least squares. If you were to take the denominator of the equation above and divide it by your sample size (less one: n-1), you would get the variance of your independent variable, X. This variance is something you also want to minimize, so that your estimate of β is efficient. When your parameter estimates are efficient, you can make stronger statistical statements about them.

We also want to be sure that our estimators are free of bias. That is, we want to be sure that our sample estimate, b, is on average, equal to our true population parameter, β. That is, if we calculated several estimates of β, the average of our b’s should equal β.

Essentially, there are five assumptions that must be made to ensure our estimators are unbiased and efficient:

Assumption #1: The regression equation correctly specifies the true model.

In order to correctly specify the true model, the relationship between the dependent and independent variable must be linear. Also, we must neither exclude relevant independent variables from nor include irrelevant independent variables in our regression equation. If any of these conditions are not met – that is, Assumption #1 is violated – then our parameter estimates will exhibit bias, particularly specification bias.

In addition, our independent and dependent variables must be measured accurately. For example, if we are trying to estimate salary based on years of schooling, we want to make sure our model is measuring years of schooling as actual years of schooling, and not desired years of schooling.

Assumption #2: The independent variables are fixed numbers and not correlated with error terms.

I warned you at the start of our discussion of linear regression that the error terms were going to be important. Let’s start with the notion of fixed numbers. When you are running a regression analysis, the values of each independent variable should not change every time you test of the equation. That is, the values of your independent variables are known and controlled by you. In addition, the independent variables should not be correlated with the error term. If an independent variable is correlated with the error term, then it is very possible a relevant independent variable was excluded from the equation. If Assumption #2 is violated, then your parameter estimates will be biased.

Assumption #3: The error terms ε, have a mean, or expected value, of zero.

As you noticed in the past blog post, when we developed our regression equation for Sue Stone’s monthly sales, we went back in and plugged each observation’s independent variable into our model and generated estimates of sales for that month. We then subtracted the estimated sales from the actual. Some of our estimates were higher than average, some were lower. Summing up all these errors, they should equal zero. If they don’t, they will result in a biased estimate of the intercept, a (which we use to estimate α). This assumption is not of serious concern, however, since the intercept is often of secondary importance to the slope estimate. We also assume that the error terms are normally distributed.

Assumption #4: The error terms have a constant variance.

The variance of the error term for all values of Xi should be constant, that is, the error terms should be homoscedastic. Visually, if you were to plot the line generated by your regression equation, and then plot the error terms for each observation as points above or below the regression line, the points should cluster around the line in a band of equal width above and below the regression line. If, instead, the points began to move further and further away from the regression line as the value of X increased, then the error terms are heteroscedastic, and the constant variance assumption is violated. Heteroscedasticity does not bias parameter estimates, but makes them inefficient, or untrustworthy.

Why does heteroscedasticity occur? Sometimes, a data set has some observations whose values for the independent variable are vastly different from those of the other observations. These cases are known as outliers. For example, if you have five observations, and their X values were as follows:

{ 5, 6, 6, 7, 20}

The fifth observation would be the outlier, since its X value of 20 is so different from that of the four previous observations. Regression equations place excessive weight on extreme values. Let’s assume that you were trying to construct a model to predict new car purchases based on income. You choose “household income” as your dependent variable and “new car spending” as the dependent variable. You survey 10 people who bought a new car, and you record both their income and the amount they paid for the car. You sort each respondent in order by income and look at their spending, as depicted in the table below:

 Respondent Annual Income New Car Purchase Price 1 \$30,000 \$25,900 2 \$32,500 \$27,500 3 \$35,000 \$26,000 4 \$37,500 \$29,000 5 \$40,000 \$32,000 6 \$42,500 \$30,500 7 \$45,000 \$34,000 8 \$47,500 \$26,500 9 \$50,000 \$38,000 10 \$52,500 \$40,000

Do you notice the pattern that as income increases, the new car purchase price tends to move upward? For the most part, it does. But, does it go up consistently? No. Notice how respondent #3 spent less for a car than the two respondents with lower incomes; respondent #8 spent much less for a car than lower-income respondents 4-7. Respondent #8 is an outlier. This happens because lower-income households are limited in their options for new cars, while higher-income households have more options. A low-income respondent may be limited to buying a Ford Focus or a Honda Civic; but a higher-income respondent may be able to buy a Lexus or BMW, yet still choose to buy the Civic or the Focus. Heteroscedasticity is very likely to occur with this data set. In case you haven’t guessed, heteroscedasticity is more likely to occur with cross-sectional data, rather than with time series data.

Assumption #5: The error terms are not correlated with each other.

Knowing the error term for any of our observations should not allow us to predict the error term of any other observation; the errors must be truly random. If they aren’t, autocorrelation results and the parameter estimates are inefficient, though unbiased. Autocorrelation is much more common with time series data than with cross-sectional data, and occurs because past occurrences can influence future ones. A good example of this is when I was building a regression model to help a college forecast enrollment. I started by building a simple time series regression model, then examined the errors and detected autocorrelation. How did it happen? Because most students who are enrolled in the Fall term are also likely to be enrolled again in the consecutive Spring term. Hence, I needed to correct for that autocorrelation. Similarly, while a company’s advertising expenditures in April may impact its sales in April, they are also likely to have some impact on its sales in May. This too can cause autocorrelation.

When these assumptions are kept, your regression equation is likely to contain parameter estimates that are the “best, linear, unbiased estimators” or BLUE. Keep these in mind as we go through our upcoming discussions on multiple regression.

Next Forecast Friday Topic: Regression with Two or More Independent Variables

Next week, we will plunge into our discussion of multiple regression. I will give you an example of how multiple variables are used to forecast a single dependent variable, and how to check for validity. As we go through the next couple of discussions, I will show you how to analyze the error terms to find violations of the regression assumptions. I will also show you how to determine the validity of the model, and to identify whether all independent variables within your model are relevant.