(Eighteenth in a series)
Last week, we discussed the violation of the homoscedasticity assumption of regression analysis: the assumption that the error terms have a constant variance. When the error terms do not exhibit a constant variance, they are said to be heteroscedastic. A model that exhibits heteroscedasticity produces parameter estimates that are not biased, but rather inefficient. Heteroscedasticity most often appears in crosssectional data and is frequently caused by a wide range of possible values for one or more independent variables.
Last week, we showed you how to detect heteroscedasticity by visually inspecting the plot of the error terms against the independent variable. Today, we are going to discuss three simple, but very powerful, analytical approaches to detecting heteroscedasticity: the GoldfeldQuandt test, the BreuschPagan test, and the Park test. These approaches are quite simple, but can be a bid tedious to employ.
Reviewing Our Model
Recall our model from last week. We were trying to determine the relationship between a census tract’s median family income (INCOME) and the ratio of the number of families who own their homes to the number of families who rent (OWNRATIO). Our hypothesis was that census tracts with higher median family incomes had a higher proportion of families who owned their homes. I snatched an example from my college econometrics textbook, which pulled INCOME and OWNRATIOs from 59 census tracts in Pierce County, Washington, which were compiled during the 1980 Census. We had the following data:
Housing Data

Tract

Income

Ownratio

601

$24,909

7.220

602

$11,875

1.094

603

$19,308

3.587

604

$20,375

5.279

605

$20,132

3.508

606

$15,351

0.789

607

$14,821

1.837

608

$18,816

5.150

609

$19,179

2.201

609

$21,434

1.932

610

$15,075

0.919

611

$15,634

1.898

612

$12,307

1.584

613

$10,063

0.901

614

$5,090

0.128

615

$8,110

0.059

616

$4,399

0.022

616

$5,411

0.172

617

$9,541

0.916

618

$13,095

1.265

619

$11,638

1.019

620

$12,711

1.698

621

$12,839

2.188

623

$15,202

2.850

624

$15,932

3.049

625

$14,178

2.307

626

$12,244

0.873

627

$10,391

0.410

628

$13,934

1.151

629

$14,201

1.274

630

$15,784

1.751

631

$18,917

5.074

632

$17,431

4.272

633

$17,044

3.868

634

$14,870

2.009

635

$19,384

2.256

701

$18,250

2.471

705

$14,212

3.019

706

$15,817

2.154

710

$21,911

5.190

711

$19,282

4.579

712

$21,795

3.717

713

$22,904

3.720

713

$22,507

6.127

714

$19,592

4.468

714

$16,900

2.110

718

$12,818

0.782

718

$9,849

0.259

719

$16,931

1.233

719

$23,545

3.288

720

$9,198

0.235

721

$22,190

1.406

721

$19,646

2.206

724

$24,750

5.650

726

$18,140

5.078

728

$21,250

1.433

731

$22,231

7.452

731

$19,788

5.738

735

$13,269

1.364

Data taken from U.S. Bureau of Census 1980 Pierce County, WA; Reprinted in Brown, W.S., Introducing Econometrics, St. Paul (1991): 198200.
And we got the following regression equation:
Ŷ= 0.000297*Income – 2.221
With an R^{2}=0.597, an Fratio of 84.31, the tratios for INCOME (9.182) and the intercept (4.094) both solidly significant, and the positive sign on the parameter estimate for INCOME, our model appeared to do very well. However, visual inspection of the regression residuals suggested the presence of heteroscedasticity. Unfortunately, visual inspection can only suggest; we need more objective ways of determining the presence of heteroscedasticity. Hence our three tests below.
The GoldfeldQuandt Test
The GoldfeldQuandt test is a computationally simple, and perhaps the most commonly used, method for detecting heteroscedasticity. Since a model with heteroscedastic error terms does not have a constant variance, the GoldfeldQuandt test postulates that the variances associated with high values of the independent variable, X, are statistically significant from those associated with low values. Essentially, you would run separate regression analyses for the low values of X and the high values, and then compare their Fratios.
The GoldfeldQuandt test has four steps:
Step #1: Sort the data
Take the independent variable you suspect to be the source of the heteroscedasticity and sort your data set by the X value in lowtohigh order:
Housing Data

Tract

Income

Ownratio

616

$4,399

0.022

614

$5,090

0.128

616

$5,411

0.172

615

$8,110

0.059

720

$9,198

0.235

617

$9,541

0.916

718

$9,849

0.259

613

$10,063

0.901

627

$10,391

0.410

619

$11,638

1.019

602

$11,875

1.094

626

$12,244

0.873

612

$12,307

1.584

620

$12,711

1.698

718

$12,818

0.782

621

$12,839

2.188

618

$13,095

1.265

735

$13,269

1.364

628

$13,934

1.151

625

$14,178

2.307

629

$14,201

1.274

705

$14,212

3.019

607

$14,821

1.837

634

$14,870

2.009

610

$15,075

0.919

623

$15,202

2.850

606

$15,351

0.789

611

$15,634

1.898

630

$15,784

1.751

706

$15,817

2.154

624

$15,932

3.049

714

$16,900

2.110

719

$16,931

1.233

633

$17,044

3.868

632

$17,431

4.272

726

$18,140

5.078

701

$18,250

2.471

608

$18,816

5.150

631

$18,917

5.074

609

$19,179

2.201

711

$19,282

4.579

603

$19,308

3.587

635

$19,384

2.256

714

$19,592

4.468

721

$19,646

2.206

731

$19,788

5.738

605

$20,132

3.508

604

$20,375

5.279

728

$21,250

1.433

609

$21,434

1.932

712

$21,795

3.717

710

$21,911

5.190

721

$22,190

1.406

731

$22,231

7.452

713

$22,507

6.127

713

$22,904

3.720

719

$23,545

3.288

724

$24,750

5.650

601

$24,909

7.220

Step #2: Omit the middle observations
Next, take out the observations in the middle. This usually amounts between onefifth to onethird of your observations. There’s no hard and fast rule about how many variables to omit, and if your data set is small, you may not be able to omit any. In our example, we can omit 13 observations (highlighted in orange):
Housing Data

Tract

Income

Ownratio

616

$4,399

0.022

614

$5,090

0.128

616

$5,411

0.172

615

$8,110

0.059

720

$9,198

0.235

617

$9,541

0.916

718

$9,849

0.259

613

$10,063

0.901

627

$10,391

0.410

619

$11,638

1.019

602

$11,875

1.094

626

$12,244

0.873

612

$12,307

1.584

620

$12,711

1.698

718

$12,818

0.782

621

$12,839

2.188

618

$13,095

1.265

735

$13,269

1.364

628

$13,934

1.151

625

$14,178

2.307

629

$14,201

1.274

705

$14,212

3.019

607

$14,821

1.837

634

$14,870

2.009

610

$15,075

0.919

623

$15,202

2.850

606

$15,351

0.789

611

$15,634

1.898

630

$15,784

1.751

706

$15,817

2.154

624

$15,932

3.049

714

$16,900

2.110

719

$16,931

1.233

633

$17,044

3.868

632

$17,431

4.272

726

$18,140

5.078

Tract

Income

Ownratio

701

$18,250

2.471

608

$18,816

5.150

631

$18,917

5.074

609

$19,179

2.201

711

$19,282

4.579

603

$19,308

3.587

635

$19,384

2.256

714

$19,592

4.468

721

$19,646

2.206

731

$19,788

5.738

605

$20,132

3.508

604

$20,375

5.279

728

$21,250

1.433

609

$21,434

1.932

712

$21,795

3.717

710

$21,911

5.190

721

$22,190

1.406

731

$22,231

7.452

713

$22,507

6.127

713

$22,904

3.720

719

$23,545

3.288

724

$24,750

5.650

601

$24,909

7.220

Step #3: Run two separate regressions, one for the low values, one for the high
We ran separate regressions for the 23 observations with the lowest values for INCOME and the 23 observations with the highest values. In these regressions, we weren’t concerned with whether the tratios of the parameter estimates were significant. Rather, we wanted to look at their Error Sum of Squares (ESS). Each model has 21 degrees of freedom.
Step #4: Divide the ESS of the higher value regression by the ESS of the lower value regression, and compare quotient to the Ftable.
The higher value regression produced an ESS of 61.489 and the lower value regression produced an ESS of 5.189. Dividing the former by the latter, we get a quotient of 11.851. Now, we need to go to the Ftable and check the critical Fvalue for a 95% significance level and 21 degrees of freedom, which is a value of 2.10. Since our quotient of 11.851 is greater than that of the critical Fvalue, we can conclude there is strong evidence of heteroscedasticity in the model.
The BreuschPagan Test
The BreuschPagan test is also pretty simple, but it’s a very powerful test, in that it can be used to detect whether more than one independent variable is causing the heteroscedasticity. Since it can involve multiple variables, the BreuschPagan test relies on critical values of chisquared (χ^{2}) to determine the presence of heteroscedasticity, and works best with large sample sets. There are five steps to the BreuschPagan test:
Step #1:
Run the regular regression model and collect the residuals
We already did that.
Step #2: Estimate the variance of the regression residuals
To do this, we square each residual, sum it up and then divide it by the number of observations. Our formula is:
Our residuals and their squares are as follows:
Observation

Predicted Ownratio

Residuals

Residuals Squared

1

5.165

2.055

4.222

2

1.300

(0.206)

0.043

3

3.504

0.083

0.007

4

3.821

1.458

2.126

5

3.749

(0.241)

0.058

6

2.331

(1.542)

2.378

7

2.174

(0.337)

0.113

8

3.358

1.792

3.209

9

3.466

(1.265)

1.601

10

4.135

(2.203)

4.852

11

2.249

(1.330)

1.769

12

2.415

(0.517)

0.267

13

1.428

0.156

0.024

14

0.763

0.138

0.019

15

(0.712)

0.840

0.705

16

0.184

(0.125)

0.016

17

(0.917)

0.939

0.881

18

(0.617)

0.789

0.622

19

0.608

0.308

0.095

20

1.662

(0.397)

0.158

21

1.230

(0.211)

0.045

22

1.548

0.150

0.022

23

1.586

0.602

0.362

24

2.287

0.563

0.317

25

2.503

0.546

0.298

26

1.983

0.324

0.105

27

1.410

(0.537)

0.288

28

0.860

(0.450)

0.203

29

1.911

(0.760)

0.577

30

1.990

(0.716)

0.513

31

2.459

(0.708)

0.502

32

3.388

1.686

2.841

33

2.948

1.324

1.754

34

2.833

1.035

1.071

35

2.188

(0.179)

0.032

36

3.527

(1.271)

1.615

37

3.191

(0.720)

0.518

38

1.993

1.026

1.052

39

2.469

(0.315)

0.099

40

4.276

0.914

0.835

41

3.497

1.082

1.171

42

4.242

(0.525)

0.275

43

4.571

(0.851)

0.724

44

4.453

1.674

2.802

45

3.589

0.879

0.773

46

2.790

(0.680)

0.463

47

1.580

(0.798)

0.637

48

0.699

(0.440)

0.194

49

2.800

(1.567)

2.454

50

4.761

(1.473)

2.169

51

0.506

(0.271)

0.074

52

4.359

(2.953)

8.720

53

3.605

(1.399)

1.956

54

5.118

0.532

0.283

55

3.158

1.920

3.686

56

4.080

(2.647)

7.008

57

4.371

3.081

9.492

58

3.647

2.091

4.373

59

1.714

(0.350)

0.122

Summing the last column, we get 83.591. We divide this by 59, and get 1.417.
Step #3: Compute the square of the standardized residuals
Now that we know the variance of the regression residuals – 1.417 – we compute the standardized residuals by dividing each residual by 1.417 and then squaring the results, so that we get our square of standardized residuals, s_{i}^{2}:
Obs.

Predicted Ownratio

Residuals

Standardized Residuals

Square of Standardized Residuals

1

5.165

2.055

1.450

2.103

2

1.300

(0.206)

(0.146)

0.021

3

3.504

0.083

0.058

0.003

4

3.821

1.458

1.029

1.059

5

3.749

(0.241)

(0.170)

0.029

6

2.331

(1.542)

(1.088)

1.185

7

2.174

(0.337)

(0.238)

0.057

8

3.358

1.792

1.264

1.599

9

3.466

(1.265)

(0.893)

0.797

10

4.135

(2.203)

(1.555)

2.417

11

2.249

(1.330)

(0.939)

0.881

12

2.415

(0.517)

(0.365)

0.133

13

1.428

0.156

0.110

0.012

14

0.763

0.138

0.097

0.009

15

(0.712)

0.840

0.593

0.351

16

0.184

(0.125)

(0.088)

0.008

17

(0.917)

0.939

0.662

0.439

18

(0.617)

0.789

0.557

0.310

19

0.608

0.308

0.217

0.047

20

1.662

(0.397)

(0.280)

0.079

21

1.230

(0.211)

(0.149)

0.022

22

1.548

0.150

0.106

0.011

23

1.586

0.602

0.425

0.180

24

2.287

0.563

0.397

0.158

25

2.503

0.546

0.385

0.148

26

1.983

0.324

0.229

0.052

27

1.410

(0.537)

(0.379)

0.143

28

0.860

(0.450)

(0.318)

0.101

29

1.911

(0.760)

(0.536)

0.288

30

1.990

(0.716)

(0.505)

0.255

31

2.459

(0.708)

(0.500)

0.250

32

3.388

1.686

1.190

1.415

33

2.948

1.324

0.935

0.874

34

2.833

1.035

0.730

0.534

35

2.188

(0.179)

(0.127)

0.016

36

3.527

(1.271)

(0.897)

0.805

37

3.191

(0.720)

(0.508)

0.258

38

1.993

1.026

0.724

0.524

39

2.469

(0.315)

(0.222)

0.049

40

4.276

0.914

0.645

0.416

41

3.497

1.082

0.764

0.584

42

4.242

(0.525)

(0.370)

0.137

43

4.571

(0.851)

(0.600)

0.361

44

4.453

1.674

1.182

1.396

45

3.589

0.879

0.621

0.385

46

2.790

(0.680)

(0.480)

0.231

47

1.580

(0.798)

(0.563)

0.317

48

0.699

(0.440)

(0.311)

0.097

49

2.800

(1.567)

(1.106)

1.223

50

4.761

(1.473)

(1.040)

1.081

51

0.506

(0.271)

(0.192)

0.037

52

4.359

(2.953)

(2.084)

4.344

53

3.605

(1.399)

(0.987)

0.974

54

5.118

0.532

0.375

0.141

55

3.158

1.920

1.355

1.836

56

4.080

(2.647)

(1.868)

3.491

57

4.371

3.081

2.175

4.728

58

3.647

2.091

1.476

2.179

59

1.714

(0.350)

(0.247)

0.061

Step #4: Run another regression with all your independent variables using the sum of standardized residuals as the dependent variable
In this case, we had only one independent variable, INCOME. We will now run a regression substituting the last column of the table above for OWNRATIO, and making it the dependent variable. Again, we’re not interested in the parameter estimates. We are, however, interested in the regression sum of squares (RSS), which is 15.493.
Step #5: Divide the RSS by 2 and compare with the χ^{2} table’s critical value for the appropriate degrees of freedom
Dividing the RSS by 2, we get 7.747. We look up the critical χ^{2} value for one degree of freedom and in the table, for a 5% significance level, we get 3.84. Since our χ^{2} value exceeds our critical, we can conclude there is strong evidence of heteroscedasticity present.
The Park Test
Last, but certainly not least comes the Park test. I saved this one for last because it is the simplest of the three methods and unlike the other two, provides information that can help eliminate the heteroscedasticity. The Park Test assumes there is a relationship between the error variance and one of the regression model’s independent variables. The steps involved are as follows:
Step #1: Run your original regression model and collect the residuals
Done.
Step #2: Square the regression residuals and compute the logs of the squared residuals and the values of the suspected independent variable.
We’ll square the regression residuals, and take their natural log. We will also take the natural log of INCOME:
Tract

Residual Squared

LnResidual Squared

LnIncome

601

4.222

1.440

10.123

602

0.043

(3.157)

9.382

603

0.007

(4.987)

9.868

604

2.126

0.754

9.922

605

0.058

(2.848)

9.910

606

2.378

0.866

9.639

607

0.113

(2.176)

9.604

608

3.209

1.166

9.842

609

1.601

0.470

9.862

609

4.852

1.579

9.973

610

1.769

0.571

9.621

611

0.267

(1.320)

9.657

612

0.024

(3.720)

9.418

613

0.019

(3.960)

9.217

614

0.705

(0.349)

8.535

615

0.016

(4.162)

9.001

616

0.881

(0.127)

8.389

616

0.622

(0.475)

8.596

617

0.095

(2.356)

9.163

618

0.158

(1.847)

9.480

619

0.045

(3.112)

9.362

620

0.022

(3.796)

9.450

621

0.362

(1.015)

9.460

623

0.317

(1.148)

9.629

624

0.298

(1.211)

9.676

625

0.105

(2.255)

9.559

626

0.288

(1.245)

9.413

627

0.203

(1.596)

9.249

628

0.577

(0.549)

9.542

629

0.513

(0.668)

9.561

630

0.502

(0.689)

9.667

631

2.841

1.044

9.848

632

1.754

0.562

9.766

633

1.071

0.069

9.744

634

0.032

(3.437)

9.607

635

1.615

0.479

9.872

701

0.518

(0.658)

9.812

705

1.052

0.051

9.562

706

0.099

(2.309)

9.669

710

0.835

(0.180)

9.995

711

1.171

0.158

9.867

712

0.275

(1.289)

9.989

713

0.724

(0.323)

10.039

713

2.802

1.030

10.022

714

0.773

(0.257)

9.883

714

0.463

(0.770)

9.735

718

0.637

(0.452)

9.459

718

0.194

(1.640)

9.195

719

2.454

0.898

9.737

719

2.169

0.774

10.067

720

0.074

(2.608)

9.127

721

8.720

2.166

10.007

721

1.956

0.671

9.886

724

0.283

(1.263)

10.117

726

3.686

1.305

9.806

728

7.008

1.947

9.964

731

9.492

2.250

10.009

731

4.373

1.476

9.893

735

0.122

(2.102)

9.493

Step #3: Run the regression equation using the log of the squared residuals as the dependent variable and the log of the suspected independent variable as the dependent variable
That results in the following regression equation:
Ln(e^{2}) = 1.957(LnIncome) – 19.592
Step #4: If the tratio for the transformed independent variable is significant, you can conclude heteroscedasticity is present.
The parameter estimate for the LnIncome is significant, with a tratio of 3.499, so we conclude heteroscedasticity.
Next Forecast Friday Topic: Correcting Heteroscedasticity
Thanks for your patience! Now you know the three most common methods for detecting heteroscedasticity: the GoldfeldQuandt test, the BreuschPagan test, and the Park test. As you will see in next week’s Forecast Friday post, the Park test will be beneficial in helping us eliminate the heteroscedasticity. We will discuss the most common approach to correcting heteroscedasticity: weighted least squares (WLS) regression, and show you how to apply it. Next week’s Forecast Friday post will conclude our discussion of regression violations, and allow us to resume discussions of more practical applications in forecasting.
*************************
Help us Reach 200 Fans on Facebook by Tomorrow!
Thanks to all of you, Analysights now has over 160 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Likeing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!