Forecast Friday Topic: Detecting Heteroscedasticity – Analytical Approaches

(Eighteenth in a series)

Last week, we discussed the violation of the homoscedasticity assumption of regression analysis: the assumption that the error terms have a constant variance. When the error terms do not exhibit a constant variance, they are said to be heteroscedastic. A model that exhibits heteroscedasticity produces parameter estimates that are not biased, but rather inefficient. Heteroscedasticity most often appears in cross-sectional data and is frequently caused by a wide range of possible values for one or more independent variables.

Last week, we showed you how to detect heteroscedasticity by visually inspecting the plot of the error terms against the independent variable. Today, we are going to discuss three simple, but very powerful, analytical approaches to detecting heteroscedasticity: the Goldfeld-Quandt test, the Breusch-Pagan test, and the Park test. These approaches are quite simple, but can be a bid tedious to employ.

Reviewing Our Model

Recall our model from last week. We were trying to determine the relationship between a census tract’s median family income (INCOME) and the ratio of the number of families who own their homes to the number of families who rent (OWNRATIO). Our hypothesis was that census tracts with higher median family incomes had a higher proportion of families who owned their homes. I snatched an example from my college econometrics textbook, which pulled INCOME and OWNRATIOs from 59 census tracts in Pierce County, Washington, which were compiled during the 1980 Census. We had the following data:

Housing Data 

Tract 

Income 

Ownratio 

601 

$24,909 

7.220 

602 

$11,875 

1.094 

603 

$19,308 

3.587 

604 

$20,375 

5.279 

605 

$20,132 

3.508 

606 

$15,351 

0.789 

607 

$14,821 

1.837 

608 

$18,816 

5.150 

609 

$19,179 

2.201 

609 

$21,434 

1.932 

610 

$15,075 

0.919 

611 

$15,634 

1.898 

612 

$12,307 

1.584 

613 

$10,063 

0.901 

614 

$5,090 

0.128 

615 

$8,110 

0.059 

616 

$4,399 

0.022 

616 

$5,411 

0.172 

617 

$9,541 

0.916 

618 

$13,095 

1.265 

619 

$11,638 

1.019 

620 

$12,711 

1.698 

621 

$12,839 

2.188 

623 

$15,202 

2.850 

624 

$15,932 

3.049 

625 

$14,178 

2.307 

626 

$12,244 

0.873 

627 

$10,391 

0.410 

628 

$13,934 

1.151 

629 

$14,201 

1.274 

630 

$15,784 

1.751 

631 

$18,917 

5.074 

632 

$17,431

4.272 

633 

$17,044 

3.868 

634 

$14,870 

2.009 

635 

$19,384 

2.256 

701 

$18,250 

2.471 

705 

$14,212 

3.019 

706 

$15,817 

2.154 

710 

$21,911 

5.190 

711 

$19,282 

4.579 

712 

$21,795 

3.717 

713 

$22,904 

3.720 

713 

$22,507 

6.127 

714 

$19,592 

4.468 

714 

$16,900 

2.110

718 

$12,818 

0.782 

718 

$9,849 

0.259 

719 

$16,931 

1.233 

719 

$23,545 

3.288 

720 

$9,198 

0.235 

721 

$22,190 

1.406 

721 

$19,646 

2.206 

724 

$24,750 

5.650 

726 

$18,140 

5.078 

728 

$21,250 

1.433 

731 

$22,231 

7.452 

731 

$19,788 

5.738 

735 

$13,269 

1.364 

Data taken from U.S. Bureau of Census 1980 Pierce County, WA; Reprinted in Brown, W.S., Introducing Econometrics, St. Paul (1991): 198-200.

And we got the following regression equation:

Ŷ= 0.000297*Income – 2.221

With an R2=0.597, an F-ratio of 84.31, the t-ratios for INCOME (9.182) and the intercept (-4.094) both solidly significant, and the positive sign on the parameter estimate for INCOME, our model appeared to do very well. However, visual inspection of the regression residuals suggested the presence of heteroscedasticity. Unfortunately, visual inspection can only suggest; we need more objective ways of determining the presence of heteroscedasticity. Hence our three tests below.

The Goldfeld-Quandt Test

The Goldfeld-Quandt test is a computationally simple, and perhaps the most commonly used, method for detecting heteroscedasticity. Since a model with heteroscedastic error terms does not have a constant variance, the Goldfeld-Quandt test postulates that the variances associated with high values of the independent variable, X, are statistically significant from those associated with low values. Essentially, you would run separate regression analyses for the low values of X and the high values, and then compare their F-ratios.

The Goldfeld-Quandt test has four steps:

Step #1: Sort the data

Take the independent variable you suspect to be the source of the heteroscedasticity and sort your data set by the X value in low-to-high order:

Housing Data 

Tract 

Income 

Ownratio 

616 

$4,399 

0.022 

614 

$5,090 

0.128 

616 

$5,411 

0.172 

615 

$8,110 

0.059 

720 

$9,198 

0.235 

617 

$9,541 

0.916 

718 

$9,849 

0.259 

613 

$10,063 

0.901 

627 

$10,391 

0.410 

619 

$11,638 

1.019 

602 

$11,875 

1.094 

626 

$12,244 

0.873 

612 

$12,307

1.584 

620 

$12,711 

1.698 

718 

$12,818 

0.782 

621 

$12,839 

2.188 

618 

$13,095 

1.265 

735 

$13,269 

1.364 

628 

$13,934 

1.151 

625 

$14,178 

2.307 

629 

$14,201 

1.274 

705 

$14,212 

3.019 

607 

$14,821 

1.837 

634 

$14,870 

2.009 

610 

$15,075 

0.919 

623 

$15,202 

2.850 

606

$15,351 

0.789 

611 

$15,634 

1.898 

630 

$15,784 

1.751 

706 

$15,817 

2.154 

624 

$15,932 

3.049 

714 

$16,900 

2.110 

719 

$16,931 

1.233 

633 

$17,044 

3.868 

632 

$17,431 

4.272 

726 

$18,140 

5.078 

701 

$18,250 

2.471 

608 

$18,816 

5.150 

631 

$18,917 

5.074 

609 

$19,179

2.201 

711 

$19,282 

4.579 

603 

$19,308 

3.587 

635 

$19,384 

2.256 

714 

$19,592 

4.468 

721 

$19,646 

2.206 

731 

$19,788 

5.738 

605 

$20,132 

3.508 

604 

$20,375 

5.279 

728 

$21,250 

1.433 

609 

$21,434 

1.932 

712 

$21,795 

3.717 

710 

$21,911 

5.190 

721 

$22,190 

1.406 

731 

$22,231 

7.452 

713 

$22,507 

6.127 

713 

$22,904 

3.720 

719 

$23,545 

3.288 

724 

$24,750 

5.650 

601 

$24,909 

7.220 

Step #2: Omit the middle observations

Next, take out the observations in the middle. This usually amounts between one-fifth to one-third of your observations. There’s no hard and fast rule about how many variables to omit, and if your data set is small, you may not be able to omit any. In our example, we can omit 13 observations (highlighted in orange):

Housing Data 

Tract 

Income 

Ownratio 

616

$4,399 

0.022 

614 

$5,090 

0.128 

616 

$5,411 

0.172 

615 

$8,110 

0.059 

720 

$9,198 

0.235 

617 

$9,541 

0.916 

718 

$9,849 

0.259 

613 

$10,063 

0.901 

627 

$10,391 

0.410 

619 

$11,638 

1.019 

602 

$11,875 

1.094 

626 

$12,244 

0.873 

612 

$12,307 

1.584 

620 

$12,711 

1.698 

718 

$12,818 

0.782 

621 

$12,839 

2.188 

618 

$13,095 

1.265 

735 

$13,269 

1.364 

628 

$13,934 

1.151 

625 

$14,178 

2.307 

629 

$14,201 

1.274 

705 

$14,212 

3.019 

607 

$14,821 

1.837 

634 

$14,870 

2.009 

610 

$15,075 

0.919 

623 

$15,202 

2.850 

606 

$15,351 

0.789 

611 

$15,634

1.898 

630 

$15,784 

1.751 

706 

$15,817 

2.154 

624 

$15,932 

3.049 

714 

$16,900 

2.110 

719 

$16,931 

1.233 

633 

$17,044 

3.868 

632 

$17,431 

4.272 

726 

$18,140 

5.078 

Tract 

Income 

Ownratio 

701 

$18,250 

2.471 

608 

$18,816 

5.150 

631 

$18,917 

5.074 

609 

$19,179 

2.201

711 

$19,282 

4.579 

603 

$19,308 

3.587 

635 

$19,384 

2.256 

714 

$19,592 

4.468 

721 

$19,646 

2.206 

731 

$19,788 

5.738 

605 

$20,132 

3.508 

604 

$20,375 

5.279 

728 

$21,250 

1.433 

609 

$21,434 

1.932 

712 

$21,795 

3.717 

710 

$21,911 

5.190 

721 

$22,190 

1.406 

731

$22,231 

7.452 

713 

$22,507 

6.127 

713 

$22,904 

3.720 

719 

$23,545 

3.288 

724 

$24,750 

5.650 

601 

$24,909 

7.220 

 

Step #3: Run two separate regressions, one for the low values, one for the high

We ran separate regressions for the 23 observations with the lowest values for INCOME and the 23 observations with the highest values. In these regressions, we weren’t concerned with whether the t-ratios of the parameter estimates were significant. Rather, we wanted to look at their Error Sum of Squares (ESS). Each model has 21 degrees of freedom.

Step #4: Divide the ESS of the higher value regression by the ESS of the lower value regression, and compare quotient to the F-table.

The higher value regression produced an ESS of 61.489 and the lower value regression produced an ESS of 5.189. Dividing the former by the latter, we get a quotient of 11.851. Now, we need to go to the F-table and check the critical F-value for a 95% significance level and 21 degrees of freedom, which is a value of 2.10. Since our quotient of 11.851 is greater than that of the critical F-value, we can conclude there is strong evidence of heteroscedasticity in the model.

The Breusch-Pagan Test

The Breusch-Pagan test is also pretty simple, but it’s a very powerful test, in that it can be used to detect whether more than one independent variable is causing the heteroscedasticity. Since it can involve multiple variables, the Breusch-Pagan test relies on critical values of chi-squared (χ2) to determine the presence of heteroscedasticity, and works best with large sample sets. There are five steps to the Breusch-Pagan test:

Step #1:
Run the regular regression model and collect the residuals

We already did that.

Step #2: Estimate the variance of the regression residuals

To do this, we square each residual, sum it up and then divide it by the number of observations. Our formula is:

Our residuals and their squares are as follows:

Observation 

Predicted Ownratio 

Residuals 

Residuals Squared 

1 

5.165  

2.055  

4.222 

2 

1.300  

(0.206) 

0.043 

3 

3.504  

0.083  

0.007 

4 

3.821  

1.458  

2.126 

5 

3.749  

(0.241) 

0.058 

6 

2.331  

(1.542) 

2.378 

7 

2.174

(0.337) 

0.113 

8 

3.358  

1.792  

3.209 

9 

3.466  

(1.265) 

1.601 

10 

4.135  

(2.203) 

4.852 

11 

2.249  

(1.330) 

1.769 

12 

2.415  

(0.517) 

0.267 

13 

1.428  

0.156

0.024 

14 

0.763  

0.138  

0.019 

15 

(0.712) 

0.840  

0.705 

16 

0.184  

(0.125) 

0.016 

17 

(0.917) 

0.939  

0.881 

18 

(0.617) 

0.789  

0.622 

19 

0.608  

0.308  

0.095 

20

1.662  

(0.397) 

0.158 

21 

1.230  

(0.211) 

0.045 

22 

1.548  

0.150  

0.022 

23 

1.586  

0.602  

0.362 

24 

2.287  

0.563  

0.317 

25 

2.503  

0.546  

0.298 

26 

1.983  

0.324  

0.105 

27 

1.410  

(0.537) 

0.288 

28 

0.860  

(0.450) 

0.203 

29 

1.911  

(0.760) 

0.577 

30 

1.990  

(0.716) 

0.513 

31 

2.459  

(0.708) 

0.502 

32 

3.388  

1.686  

2.841 

33 

2.948  

1.324  

1.754 

34 

2.833  

1.035  

1.071 

35 

2.188  

(0.179) 

0.032 

36 

3.527  

(1.271) 

1.615 

37 

3.191  

(0.720) 

0.518 

38 

1.993  

1.026

1.052 

39 

2.469  

(0.315) 

0.099 

40 

4.276  

0.914  

0.835 

41 

3.497  

1.082  

1.171 

42 

4.242  

(0.525) 

0.275 

43 

4.571  

(0.851) 

0.724 

44 

4.453  

1.674  

2.802 

45

3.589  

0.879  

0.773 

46 

2.790  

(0.680) 

0.463 

47 

1.580  

(0.798) 

0.637 

48 

0.699  

(0.440) 

0.194 

49 

2.800  

(1.567) 

2.454 

50 

4.761  

(1.473) 

2.169 

51 

0.506

(0.271) 

0.074 

52 

4.359  

(2.953) 

8.720 

53 

3.605  

(1.399) 

1.956 

54 

5.118  

0.532  

0.283 

55 

3.158  

1.920  

3.686 

56 

4.080  

(2.647) 

7.008 

57 

4.371  

3.081  

9.492 

58 

3.647  

2.091  

4.373 

59 

1.714  

(0.350) 

0.122 

Summing the last column, we get 83.591. We divide this by 59, and get 1.417.

Step #3: Compute the square of the standardized residuals

Now that we know the variance of the regression residuals – 1.417 – we compute the standardized residuals by dividing each residual by 1.417 and then squaring the results, so that we get our square of standardized residuals, si2:

Obs. 

Predicted Ownratio 

Residuals 

Standardized Residuals

Square of Standardized Residuals 

1 

5.165  

2.055  

1.450  

2.103  

2 

1.300  

(0.206) 

(0.146) 

0.021  

3 

3.504  

0.083  

0.058

0.003  

4 

3.821  

1.458  

1.029  

1.059  

5 

3.749  

(0.241) 

(0.170) 

0.029  

6 

2.331  

(1.542) 

(1.088) 

1.185  

7 

2.174  

(0.337) 

(0.238) 

0.057  

8 

3.358  

1.792  

1.264  

1.599  

9 

3.466  

(1.265) 

(0.893) 

0.797  

10 

4.135  

(2.203) 

(1.555) 

2.417  

11 

2.249  

(1.330) 

(0.939) 

0.881  

12 

2.415  

(0.517) 

(0.365) 

0.133  

13 

1.428  

0.156  

0.110  

0.012  

14 

0.763  

0.138  

0.097  

0.009  

15 

(0.712) 

0.840  

0.593  

0.351  

16 

0.184  

(0.125)

(0.088) 

0.008  

17 

(0.917) 

0.939  

0.662  

0.439  

18 

(0.617) 

0.789  

0.557  

0.310  

19 

0.608  

0.308  

0.217  

0.047  

20 

1.662  

(0.397) 

(0.280) 

0.079  

21 

1.230  

(0.211) 

(0.149) 

0.022  

22 

1.548  

0.150  

0.106

0.011  

23 

1.586  

0.602  

0.425  

0.180  

24 

2.287  

0.563  

0.397  

0.158  

25 

2.503  

0.546  

0.385  

0.148  

26 

1.983  

0.324  

0.229  

0.052  

27 

1.410  

(0.537) 

(0.379) 

0.143  

28 

0.860  

(0.450) 

(0.318) 

0.101

29 

1.911  

(0.760) 

(0.536) 

0.288  

30 

1.990  

(0.716) 

(0.505) 

0.255  

31 

2.459  

(0.708) 

(0.500) 

0.250  

32 

3.388  

1.686  

1.190  

1.415  

33 

2.948  

1.324  

0.935  

0.874  

34 

2.833  

1.035  

0.730  

0.534  

35 

2.188

(0.179) 

(0.127) 

0.016  

36 

3.527  

(1.271) 

(0.897) 

0.805  

37 

3.191  

(0.720) 

(0.508) 

0.258  

38 

1.993  

1.026

0.724  

0.524  

39 

2.469  

(0.315) 

(0.222) 

0.049  

40 

4.276  

0.914  

0.645  

0.416  

41 

3.497  

1.082  

0.764  

0.584  

42 

4.242  

(0.525) 

(0.370) 

0.137  

43 

4.571  

(0.851) 

(0.600) 

0.361  

44 

4.453  

1.674  

1.182

1.396  

45 

3.589  

0.879  

0.621  

0.385  

46 

2.790  

(0.680) 

(0.480) 

0.231  

47 

1.580  

(0.798) 

(0.563) 

0.317  

48 

0.699  

(0.440) 

(0.311) 

0.097  

49 

2.800  

(1.567) 

(1.106) 

1.223  

50 

4.761  

(1.473) 

(1.040) 

1.081  

51

0.506  

(0.271) 

(0.192) 

0.037  

52 

4.359  

(2.953) 

(2.084) 

4.344  

53 

3.605  

(1.399) 

(0.987) 

0.974  

54 

5.118

0.532  

0.375  

0.141  

55 

3.158  

1.920  

1.355  

1.836  

56 

4.080  

(2.647) 

(1.868) 

3.491  

57 

4.371  

3.081  

2.175  

4.728  

58 

3.647  

2.091  

1.476  

2.179  

59 

1.714  

(0.350) 

(0.247) 

0.061  

 

Step #4: Run another regression with all your independent variables using the sum of standardized residuals as the dependent variable

In this case, we had only one independent variable, INCOME. We will now run a regression substituting the last column of the table above for OWNRATIO, and making it the dependent variable. Again, we’re not interested in the parameter estimates. We are, however, interested in the regression sum of squares (RSS), which is 15.493.

Step #5: Divide the RSS by 2 and compare with the χ2 table’s critical value for the appropriate degrees of freedom

Dividing the RSS by 2, we get 7.747. We look up the critical χ2 value for one degree of freedom and in the table, for a 5% significance level, we get 3.84. Since our χ2 value exceeds our critical, we can conclude there is strong evidence of heteroscedasticity present.

The Park Test

Last, but certainly not least comes the Park test. I saved this one for last because it is the simplest of the three methods and unlike the other two, provides information that can help eliminate the heteroscedasticity. The Park Test assumes there is a relationship between the error variance and one of the regression model’s independent variables. The steps involved are as follows:

Step #1: Run your original regression model and collect the residuals

Done.

Step #2: Square the regression residuals and compute the logs of the squared residuals and the values of the suspected independent variable.

We’ll square the regression residuals, and take their natural log. We will also take the natural log of INCOME:

Tract

Residual Squared

LnResidual Squared

LnIncome

601

4.222

1.440

10.123

602

0.043

(3.157)

9.382

603

0.007

(4.987)

9.868

604

2.126

0.754

9.922

605

0.058

(2.848)

9.910

606

2.378

0.866

9.639

607

0.113

(2.176)

9.604

608

3.209

1.166

9.842

609

1.601

0.470

9.862

609

4.852

1.579

9.973

610

1.769

0.571

9.621

611

0.267

(1.320)

9.657

612

0.024

(3.720)

9.418

613

0.019

(3.960)

9.217

614

0.705

(0.349)

8.535

615

0.016

(4.162)

9.001

616

0.881

(0.127)

8.389

616

0.622

(0.475)

8.596

617

0.095

(2.356)

9.163

618

0.158

(1.847)

9.480

619

0.045

(3.112)

9.362

620

0.022

(3.796)

9.450

621

0.362

(1.015)

9.460

623

0.317

(1.148)

9.629

624

0.298

(1.211)

9.676

625

0.105

(2.255)

9.559

626

0.288

(1.245)

9.413

627

0.203

(1.596)

9.249

628

0.577

(0.549)

9.542

629

0.513

(0.668)

9.561

630

0.502

(0.689)

9.667

631

2.841

1.044

9.848

632

1.754

0.562

9.766

633

1.071

0.069

9.744

634

0.032

(3.437)

9.607

635

1.615

0.479

9.872

701

0.518

(0.658)

9.812

705

1.052

0.051

9.562

706

0.099

(2.309)

9.669

710

0.835

(0.180)

9.995

711

1.171

0.158

9.867

712

0.275

(1.289)

9.989

713

0.724

(0.323)

10.039

713

2.802

1.030

10.022

714

0.773

(0.257)

9.883

714

0.463

(0.770)

9.735

718

0.637

(0.452)

9.459

718

0.194

(1.640)

9.195

719

2.454

0.898

9.737

719

2.169

0.774

10.067

720

0.074

(2.608)

9.127

721

8.720

2.166

10.007

721

1.956

0.671

9.886

724

0.283

(1.263)

10.117

726

3.686

1.305

9.806

728

7.008

1.947

9.964

731

9.492

2.250

10.009

731

4.373

1.476

9.893

735

0.122

(2.102)

9.493

 

Step #3: Run the regression equation using the log of the squared residuals as the dependent variable and the log of the suspected independent variable as the dependent variable

That results in the following regression equation:

Ln(e2) = 1.957(LnIncome) – 19.592

Step #4: If the t-ratio for the transformed independent variable is significant, you can conclude heteroscedasticity is present.

The parameter estimate for the LnIncome is significant, with a t-ratio of 3.499, so we conclude heteroscedasticity.

Next Forecast Friday Topic: Correcting Heteroscedasticity

Thanks for your patience! Now you know the three most common methods for detecting heteroscedasticity: the Goldfeld-Quandt test, the Breusch-Pagan test, and the Park test. As you will see in next week’s Forecast Friday post, the Park test will be beneficial in helping us eliminate the heteroscedasticity. We will discuss the most common approach to correcting heteroscedasticity: weighted least squares (WLS) regression, and show you how to apply it. Next week’s Forecast Friday post will conclude our discussion of regression violations, and allow us to resume discussions of more practical applications in forecasting.

*************************

Help us Reach 200 Fans on Facebook by Tomorrow!

Thanks to all of you, Analysights now has over 160 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

Tags: , , , , , , , , , , , , , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: