Posts Tagged ‘heteroscedasticity’

Forecast Friday Topic: The Linear Probability Model

October 28, 2010

(Twenty-seventh in a series)

Up to now, we have talked about how to build regression models with continuous dependent variables. Such models are intended to answer questions like, “How much can sales increase if we spend $5,000 more on advertising?”; or “How much impact does each year of formal education have on salaries in a given occupation?”; or “How much does each $1 per bushel change in the price of wheat affect the number of boxes of cereal that Kashi produces?” Such business questions are quantitative and estimate impact of independent variables on the dependent variable at a macro level. But what if you wanted to predict a scenario that has only two outcomes?

Consider these business questions: “Is a particular individual more likely to respond to a direct marketing solicitation or not?” “Is a family more likely to vote Democrat or Republican?” “Is a particular subscriber likely to renew his/her subscription or let it lapse?” Notice that these business questions pertain to specific individuals, and there is only one of two outcomes. Moreover, these are questions are qualitative and involve individual choice and preferences; hence they seek to understand customers at a micro level.

Only in the last 20-30 years has it become easier to develop qualitative choice models. The increased use of surveys as well as customer and transactional databases, and improvements in data collection processes has made it more feasible to develop models to predict phenomena with discrete outcomes. Essentially, a qualitative choice model works the same way as a regression model. You have your independent variables and your dependent variable. However, the dependent variable is a dummy variable – it has only two outcomes: 1 if it is a “yes” for a particular outcome, and 0 if it is a “no.” Generally, the number of observations with a dependent variable of 1 is much smaller than that whose dependent variable is a zero. Think about a catalog mailing. The catalog might be sent to a million people, but only one or two percent – 10,000-20,000 – will actually respond.

The Linear Probability Model

The Linear Probability Model (LPM) was one of the first ways analysts began to develop qualitative choice models. LPM consisted of running OLS regression, only with a dichotomous dependent variable. Generally, the scores that would result would be used to assess the probability of an outcome. A score close to 0 would mean an outcome has a low probability of occurring, while a score close to 1 would mean the outcome is almost certain to occur. Consider the following example.

Mark Moretti, Circulation Director for the Baywood Bugle, a small town local newspaper, wants to determine how likely a subscriber is to renew his/her subscription to the Bugle. Mark has been concerned that delivery issues are causing subscribers to let their subscriptions lapse, and he also suspects that the Bugle is having a hard time trying to retain relatively new subscribers. So Mark randomly selected a sample of 30 subscribers whose subscriptions recently came in for renewal. Nine of these did let their subscriptions lapse, while the other 21 renewed. Mark also pulled the number of complaints each of these subscribers logged in the last 12 months, as well as their tenure (in years) at the time their subscription came up for renewal.

For his dependent variable, Mark used whether the subscriber renewed: 1 for yes, 0 for no. The number of complaints and the tenure served as the independent variable. Mark’s sample looked like this:

Subscriber #

Complaints

Subscriber Tenure

Renewed

1

16

1

0

2

13

10

1

3

5

14

1

4

8

10

1

5

0

8

1

6

5

7

1

7

5

7

1

8

13

15

1

9

9

10

1

10

14

11

1

11

6

10

1

12

4

14

1

13

16

10

0

14

12

2

0

15

9

9

1

16

12

7

1

17

20

4

0

18

17

1

0

19

2

11

1

20

13

14

1

21

5

13

1

22

7

2

0

23

9

12

1

24

10

8

0

25

0

10

1

26

2

13

1

27

19

4

0

28

12

3

0

29

10

9

1

30

4

9

1

 

Despite the fact that there are only two outcomes for renewed, Mark decides to run OLS regression on these 30 subscribers. He gets the following results:

Which suggests that each one-year increase in tenure increases a subscriber’s likelihood of renewal by just under seven percent, while each one-unit increase in the number of complaints reduces the subscriber’s likelihood of renewal by just over three percent. We would expect these variables to exhibit the relationships they do, since the former is a measure of customer loyalty, the latter of customer dissatisfaction.

Mark also gets an R2=0.689 and an F-statistic of 29.93, suggesting a very good fit.

However, Mark’s model exhibits serious flaws. Among them:

The Error Terms are Non-Normal

The LPM shows that the fitted values of the equation represent the probability that Yi=1 for the given values Xi. The error terms, however, are not normally distributed. Because there are only two possible outcomes, the error terms are binomially distributed, because Y can only be 0 and 1:

If Yi=0, then 0=α + β1X1i + β2X2i + εi

such that :

εi = -α – β1X1i – β2X2i

 

If Yi=1, then 1=α + β1X1i + β2X2i + εi

such that :

εi = 1 -α – β1X1i – β2X2i

 

The absence of normally distributed error terms, combined with Mark’s small sample means that his parameter estimates cannot be trusted. If Mark’s sample were much larger, then the error would approach a normal distribution.

The Error Terms are Heteroscedastic!

The residuals do not have a constant variance. With a continuous dependent variable, if two or more observations have the same value for X, it’s likely that their Y values won’t be too far apart. However, when the dependent variable is discrete, we will find that observations with the same values for an X can either have a Y value of 0 or 1. Let’s look at how the residuals in Mark’s variables compare to each independent variable:

Visual inspection suggests heteroscedasticity, which makes the parameter estimates in Mark’s model inefficient.

Unacceptable Values for Ŷ!

The dependent variable can only have two outcomes: 0 or 1. Because it is intended to deliver a probability score for each observation, values for a probability can only be between 0 and 1. However, look at the following predicted probabilities the LPM calculated for nine of the thirty subscribers:

Subscriber #

Predicted Renewal

1

(0.034)

3

1.204

8

1.031

12

1.234

18

(0.064)

19

1.086

21

1.135

25

1.077

26

1.225

As the table shows, subscribers #1 and #18 have predicted probabilities of less than 0 and the other seven have predicted probabilities in excess of 1. In actuality, subscribers 1 and 18 did not renew while the other 7 did, so these results were not inaccurate. However, their probabilities are unrealistic. In this case, only 30% of the values fall outside of the 0 to 1 region, so the model can probably be constrained by capping variables that fall outside the region to just barely within the region.

R2 is Useless

Another problem with Mark’s model is that R2, despite its high value, cannot be relied upon. Only a few data points lie close to the fitted regression line, as shown by the charts of the independent variables below:

 

This example being an exception, most LPMs generate very low R2 values for the very reason depicted in these charts. Hence R2 is generally disregarded in models with qualitative dependent variables.

So Why Do We Use Linear Probability Models?

Before many statistical packages were used, LPM was one of the only ways analysts could model qualitative dependent variables. Moreover, from an approximation standpoint, LPMs were not terribly far away from the more appropriate qualitative choice modeling approaches like logistic regression. And despite both their misuse and their inferiority to the more appropriate approaches, LPMs are easy to explain and conceptualize.

Next Forecast Friday Topic: Logistic Regression

Logistic regression is the more appropriate tool to use in such situations like this one. Next week I will walk you through the concepts of logistic regression, and illustrate a simple, one-variable model. You will understand how the logistic – or logit – model is used to compute a more accurate estimate of the likelihood of an outcome occurring. You will also discover how the logistic regression model provides three values that are simply three different ways of expressing the same thing. That’s next week.

 

*************************

Help us Reach 200 Fans on Facebook by Tomorrow!

Thanks to all of you, Analysights now has 175 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

Advertisements

Forecast Friday Topic: Correcting Heteroscedasticity

August 26, 2010

(Nineteenth in a series)

In last week’s Forecast Friday post, we discussed the three most commonly used analytical approaches to detecting heteroscedasticity: the Goldfeld-Quandt test, the Breusch-Pagan test, and the Park test. We continued to work with our data set of 59 census tracts in Pierce County, WA, from which we were trying to determine what, if any, influence the tract’s median family income had on the ratio of the number of families in the tract who own their home to the number of families who rent. As we saw, heteroscedasticity was present in our model, caused largely by the wide variation in income from one census tract to the other.

Recall that while INCOME, for the most part, had a positive relationship with the OWNRATIO, yet we found many census tracts that despite having high median family incomes had low OWNRATIOs. This is because unlike low-income families whose housing options are limited, high income families have several more housing options. The fact that the wealthier census tracts have more options increases the variability within the relationship between INCOME and OWNRATIO, causing us to generate errors that don’t have a constant variance and produce forecasts with parameter estimates that don’t seem to make sense.

Today, we turn our attention to correcting heteroscedasticity, and we will do that by transforming our model using Weighted Least Squares (WLS) regression. And we’ll show how our results from the Park test can enable us to approximate the weights to use in our WLS model.

Weighted Least Squares Regression

The reason wide variances in the value of one or more independent variables cause heteroscedastic errors is because the regression model places heavier weight on extreme values. By weighting each observation in the data set, we eliminate that tendency. But how do we know what weights to use? That depends on whether the variances of each individual observation are known or unknown.

If the variances are known, then you would simply divide each observation by its standard deviation and then run your regression to get a transformed model. Rarely, however, is the individual variance known, so we need to apply a more intricate approach.

Returning to our housing model, our regression equation was:

Ŷ= 0.000297*Income – 2.221

With an R2=0.597, an F-ratio of 84.31, and t-ratios of 9.182 for INCOME and -4.094 for the intercept.

We know that INCOME, our independent variable, is the source of the heteroscedasticity. Let’s also assume that the “correct” housing model is also of a linear functional form like our model above. In this case, we would divide each observation’s dependent variable (OWNRATIO) value by the value of its independent variable, INCOME, forming a new dependent variable (OwnRatio_Income) and then take the reciprocal of the INCOME value, and form a new independent variable, IncomeReciprocal.

Recalling the Park Test

How do we know to choose the reciprocal? Remember when we did the Park test last week? We got the following equation:

Ln(e2) = 1.957(LnIncome) – 19.592

The parameter estimate for LnIncome is 1.957. The Park test assumes that the variance of the heteroscedastic error is equal to the variance of the homoscedastic error times Xi raised to an exponent. That coefficient represents the exponent to which our independent variable Xi is raised. Since the Park test is performed by regressing a double log function, we divide that coefficient by two to arrive at the exponent of the Xi value by which to weight our observations:

Essentially, we are saying that:

Var(heterosc. errors in housing model) = var(homosc. errors in housing model)1.957

For simplicity’s sake, let’s round the coefficient from 1.957 to 2. Hence, we divide our dependent variable by Xi2/Xi = Xi , and our independent variable by its reciprocal:

Estimating the Housing Model Using WLS

We weight the values for each census tract’s housing data accordingly:

OwnRatio_Income

IncomeReciprocal

0.000290

0.000040

0.000092

0.000084

0.000186

0.000052

0.000259

0.000049

0.000174

0.000050

0.000051

0.000065

0.000124

0.000067

0.000274

0.000053

0.000115

0.000052

0.000090

0.000047

0.000061

0.000066

0.000121

0.000064

0.000129

0.000081

0.000090

0.000099

0.000025

0.000196

0.000007

0.000123

0.000005

0.000227

0.000032

0.000185

0.000096

0.000105

0.000097

0.000076

0.000088

0.000086

0.000134

0.000079

0.000170

0.000078

0.000187

0.000066

0.000191

0.000063

0.000163

0.000071

0.000071

0.000082

0.000039

0.000096

0.000083

0.000072

0.000090

0.000070

0.000111

0.000063

0.000268

0.000053

0.000245

0.000057

0.000227

0.000059

0.000135

0.000067

0.000116

0.000052

0.000135

0.000055

0.000212

0.000070

0.000136

0.000063

0.000237

0.000046

0.000237

0.000052

0.000171

0.000046

0.000162

0.000044

0.000272

0.000044

0.000228

0.000051

0.000125

0.000059

0.000061

0.000078

0.000026

0.000102

0.000073

0.000059

0.000140

0.000042

0.000026

0.000109

0.000063

0.000045

0.000112

0.000051

0.000228

0.000040

0.000280

0.000055

0.000067

0.000047

0.000335

0.000045

0.000290

0.000051

0.000103

0.000075

 

And we run a regression, to get a model of this form:

 

OwnRatio_Incomei = α* + β1*IncomeReciprocali + εi*

Notice the asterisks for each of the parameter estimates. They denote the transformed model. Performing our transformed regression, we get:



We get an R2 of .596 for the transformed model, not much different from that of our original model. However, notice the intercept of our transformed model and look at the coefficient of INCOME from our original model. Notice that they are almost equal. That’s because when you divided each observation by Xi , you essentially divided 0.000297*INCOME by INCOME, turning the slope into the intercept! Since heteroscedasticity doesn’t bias parameter estimates, we would expect the slope of our original model and the intercept of our transformed model to be equivalent. This is because those parameter estimates are averages. Heteroscedasticity doesn’t bias the average, but the variance.

Note the t-ratio for the intercept in our transformed model is much stronger than that of the coefficient for INCOME in our transformed model (12.19 vs. 9.182), suggesting that the transformed model has generated a more efficient estimate of the slope parameter. That’s because the standard error of the estimate (read VARIANCE) is smaller in our transformed model. We divide the parameter estimate by the standard error of the estimate to get our t-ratios. Because the standard error is smaller, our estimates are more trustworthy.

Recap

This concludes our discussions of all the violations that can occur with regression analysis and the problems these violations can cause. You now understand that omitting important independent variables, multicollinearity, autocorrelation, and heteroscedasticity can all cause you to generate models that produce unacceptable forecasts and prediction. You now know how to diagnose these violations and how to correct them. One thing you’ve probably also noticed as we went through these discussions is that data is never perfect. No matter how good our data is, we must still work with it and adapt it in a way that we can derive actionable insights from it.

Forecast Friday Will Resume Two Weeks From Today

Next week is the weekend before Labor Day, and I am forecasting that many of you will be leaving the office early for the long weekend, so I have decided to make the next edition of Forecast Friday for September 9. The other two posts that appear earlier in the week will continue as scheduled. Beginning with the September 9 Forecast Friday post, we will talk about additional regression analysis topics that are much less theoretical than these last few posts’ topics, and much more practical. Until then, Analysights wishes you and your family a great Labor Day weekend!

****************************************************

Help us Reach 200 Fans on Facebook by Tomorrow!
Thanks to all of you, Analysights now has over 160 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

Forecast Friday Topic: Detecting Heteroscedasticity – Analytical Approaches

August 19, 2010

(Eighteenth in a series)

Last week, we discussed the violation of the homoscedasticity assumption of regression analysis: the assumption that the error terms have a constant variance. When the error terms do not exhibit a constant variance, they are said to be heteroscedastic. A model that exhibits heteroscedasticity produces parameter estimates that are not biased, but rather inefficient. Heteroscedasticity most often appears in cross-sectional data and is frequently caused by a wide range of possible values for one or more independent variables.

Last week, we showed you how to detect heteroscedasticity by visually inspecting the plot of the error terms against the independent variable. Today, we are going to discuss three simple, but very powerful, analytical approaches to detecting heteroscedasticity: the Goldfeld-Quandt test, the Breusch-Pagan test, and the Park test. These approaches are quite simple, but can be a bid tedious to employ.

Reviewing Our Model

Recall our model from last week. We were trying to determine the relationship between a census tract’s median family income (INCOME) and the ratio of the number of families who own their homes to the number of families who rent (OWNRATIO). Our hypothesis was that census tracts with higher median family incomes had a higher proportion of families who owned their homes. I snatched an example from my college econometrics textbook, which pulled INCOME and OWNRATIOs from 59 census tracts in Pierce County, Washington, which were compiled during the 1980 Census. We had the following data:

Housing Data 

Tract 

Income 

Ownratio 

601 

$24,909 

7.220 

602 

$11,875 

1.094 

603 

$19,308 

3.587 

604 

$20,375 

5.279 

605 

$20,132 

3.508 

606 

$15,351 

0.789 

607 

$14,821 

1.837 

608 

$18,816 

5.150 

609 

$19,179 

2.201 

609 

$21,434 

1.932 

610 

$15,075 

0.919 

611 

$15,634 

1.898 

612 

$12,307 

1.584 

613 

$10,063 

0.901 

614 

$5,090 

0.128 

615 

$8,110 

0.059 

616 

$4,399 

0.022 

616 

$5,411 

0.172 

617 

$9,541 

0.916 

618 

$13,095 

1.265 

619 

$11,638 

1.019 

620 

$12,711 

1.698 

621 

$12,839 

2.188 

623 

$15,202 

2.850 

624 

$15,932 

3.049 

625 

$14,178 

2.307 

626 

$12,244 

0.873 

627 

$10,391 

0.410 

628 

$13,934 

1.151 

629 

$14,201 

1.274 

630 

$15,784 

1.751 

631 

$18,917 

5.074 

632 

$17,431

4.272 

633 

$17,044 

3.868 

634 

$14,870 

2.009 

635 

$19,384 

2.256 

701 

$18,250 

2.471 

705 

$14,212 

3.019 

706 

$15,817 

2.154 

710 

$21,911 

5.190 

711 

$19,282 

4.579 

712 

$21,795 

3.717 

713 

$22,904 

3.720 

713 

$22,507 

6.127 

714 

$19,592 

4.468 

714 

$16,900 

2.110

718 

$12,818 

0.782 

718 

$9,849 

0.259 

719 

$16,931 

1.233 

719 

$23,545 

3.288 

720 

$9,198 

0.235 

721 

$22,190 

1.406 

721 

$19,646 

2.206 

724 

$24,750 

5.650 

726 

$18,140 

5.078 

728 

$21,250 

1.433 

731 

$22,231 

7.452 

731 

$19,788 

5.738 

735 

$13,269 

1.364 

Data taken from U.S. Bureau of Census 1980 Pierce County, WA; Reprinted in Brown, W.S., Introducing Econometrics, St. Paul (1991): 198-200.

And we got the following regression equation:

Ŷ= 0.000297*Income – 2.221

With an R2=0.597, an F-ratio of 84.31, the t-ratios for INCOME (9.182) and the intercept (-4.094) both solidly significant, and the positive sign on the parameter estimate for INCOME, our model appeared to do very well. However, visual inspection of the regression residuals suggested the presence of heteroscedasticity. Unfortunately, visual inspection can only suggest; we need more objective ways of determining the presence of heteroscedasticity. Hence our three tests below.

The Goldfeld-Quandt Test

The Goldfeld-Quandt test is a computationally simple, and perhaps the most commonly used, method for detecting heteroscedasticity. Since a model with heteroscedastic error terms does not have a constant variance, the Goldfeld-Quandt test postulates that the variances associated with high values of the independent variable, X, are statistically significant from those associated with low values. Essentially, you would run separate regression analyses for the low values of X and the high values, and then compare their F-ratios.

The Goldfeld-Quandt test has four steps:

Step #1: Sort the data

Take the independent variable you suspect to be the source of the heteroscedasticity and sort your data set by the X value in low-to-high order:

Housing Data 

Tract 

Income 

Ownratio 

616 

$4,399 

0.022 

614 

$5,090 

0.128 

616 

$5,411 

0.172 

615 

$8,110 

0.059 

720 

$9,198 

0.235 

617 

$9,541 

0.916 

718 

$9,849 

0.259 

613 

$10,063 

0.901 

627 

$10,391 

0.410 

619 

$11,638 

1.019 

602 

$11,875 

1.094 

626 

$12,244 

0.873 

612 

$12,307

1.584 

620 

$12,711 

1.698 

718 

$12,818 

0.782 

621 

$12,839 

2.188 

618 

$13,095 

1.265 

735 

$13,269 

1.364 

628 

$13,934 

1.151 

625 

$14,178 

2.307 

629 

$14,201 

1.274 

705 

$14,212 

3.019 

607 

$14,821 

1.837 

634 

$14,870 

2.009 

610 

$15,075 

0.919 

623 

$15,202 

2.850 

606

$15,351 

0.789 

611 

$15,634 

1.898 

630 

$15,784 

1.751 

706 

$15,817 

2.154 

624 

$15,932 

3.049 

714 

$16,900 

2.110 

719 

$16,931 

1.233 

633 

$17,044 

3.868 

632 

$17,431 

4.272 

726 

$18,140 

5.078 

701 

$18,250 

2.471 

608 

$18,816 

5.150 

631 

$18,917 

5.074 

609 

$19,179

2.201 

711 

$19,282 

4.579 

603 

$19,308 

3.587 

635 

$19,384 

2.256 

714 

$19,592 

4.468 

721 

$19,646 

2.206 

731 

$19,788 

5.738 

605 

$20,132 

3.508 

604 

$20,375 

5.279 

728 

$21,250 

1.433 

609 

$21,434 

1.932 

712 

$21,795 

3.717 

710 

$21,911 

5.190 

721 

$22,190 

1.406 

731 

$22,231 

7.452 

713 

$22,507 

6.127 

713 

$22,904 

3.720 

719 

$23,545 

3.288 

724 

$24,750 

5.650 

601 

$24,909 

7.220 

Step #2: Omit the middle observations

Next, take out the observations in the middle. This usually amounts between one-fifth to one-third of your observations. There’s no hard and fast rule about how many variables to omit, and if your data set is small, you may not be able to omit any. In our example, we can omit 13 observations (highlighted in orange):

Housing Data 

Tract 

Income 

Ownratio 

616

$4,399 

0.022 

614 

$5,090 

0.128 

616 

$5,411 

0.172 

615 

$8,110 

0.059 

720 

$9,198 

0.235 

617 

$9,541 

0.916 

718 

$9,849 

0.259 

613 

$10,063 

0.901 

627 

$10,391 

0.410 

619 

$11,638 

1.019 

602 

$11,875 

1.094 

626 

$12,244 

0.873 

612 

$12,307 

1.584 

620 

$12,711 

1.698 

718 

$12,818 

0.782 

621 

$12,839 

2.188 

618 

$13,095 

1.265 

735 

$13,269 

1.364 

628 

$13,934 

1.151 

625 

$14,178 

2.307 

629 

$14,201 

1.274 

705 

$14,212 

3.019 

607 

$14,821 

1.837 

634 

$14,870 

2.009 

610 

$15,075 

0.919 

623 

$15,202 

2.850 

606 

$15,351 

0.789 

611 

$15,634

1.898 

630 

$15,784 

1.751 

706 

$15,817 

2.154 

624 

$15,932 

3.049 

714 

$16,900 

2.110 

719 

$16,931 

1.233 

633 

$17,044 

3.868 

632 

$17,431 

4.272 

726 

$18,140 

5.078 

Tract 

Income 

Ownratio 

701 

$18,250 

2.471 

608 

$18,816 

5.150 

631 

$18,917 

5.074 

609 

$19,179 

2.201

711 

$19,282 

4.579 

603 

$19,308 

3.587 

635 

$19,384 

2.256 

714 

$19,592 

4.468 

721 

$19,646 

2.206 

731 

$19,788 

5.738 

605 

$20,132 

3.508 

604 

$20,375 

5.279 

728 

$21,250 

1.433 

609 

$21,434 

1.932 

712 

$21,795 

3.717 

710 

$21,911 

5.190 

721 

$22,190 

1.406 

731

$22,231 

7.452 

713 

$22,507 

6.127 

713 

$22,904 

3.720 

719 

$23,545 

3.288 

724 

$24,750 

5.650 

601 

$24,909 

7.220 

 

Step #3: Run two separate regressions, one for the low values, one for the high

We ran separate regressions for the 23 observations with the lowest values for INCOME and the 23 observations with the highest values. In these regressions, we weren’t concerned with whether the t-ratios of the parameter estimates were significant. Rather, we wanted to look at their Error Sum of Squares (ESS). Each model has 21 degrees of freedom.

Step #4: Divide the ESS of the higher value regression by the ESS of the lower value regression, and compare quotient to the F-table.

The higher value regression produced an ESS of 61.489 and the lower value regression produced an ESS of 5.189. Dividing the former by the latter, we get a quotient of 11.851. Now, we need to go to the F-table and check the critical F-value for a 95% significance level and 21 degrees of freedom, which is a value of 2.10. Since our quotient of 11.851 is greater than that of the critical F-value, we can conclude there is strong evidence of heteroscedasticity in the model.

The Breusch-Pagan Test

The Breusch-Pagan test is also pretty simple, but it’s a very powerful test, in that it can be used to detect whether more than one independent variable is causing the heteroscedasticity. Since it can involve multiple variables, the Breusch-Pagan test relies on critical values of chi-squared (χ2) to determine the presence of heteroscedasticity, and works best with large sample sets. There are five steps to the Breusch-Pagan test:

Step #1:
Run the regular regression model and collect the residuals

We already did that.

Step #2: Estimate the variance of the regression residuals

To do this, we square each residual, sum it up and then divide it by the number of observations. Our formula is:

Our residuals and their squares are as follows:

Observation 

Predicted Ownratio 

Residuals 

Residuals Squared 

1 

5.165  

2.055  

4.222 

2 

1.300  

(0.206) 

0.043 

3 

3.504  

0.083  

0.007 

4 

3.821  

1.458  

2.126 

5 

3.749  

(0.241) 

0.058 

6 

2.331  

(1.542) 

2.378 

7 

2.174

(0.337) 

0.113 

8 

3.358  

1.792  

3.209 

9 

3.466  

(1.265) 

1.601 

10 

4.135  

(2.203) 

4.852 

11 

2.249  

(1.330) 

1.769 

12 

2.415  

(0.517) 

0.267 

13 

1.428  

0.156

0.024 

14 

0.763  

0.138  

0.019 

15 

(0.712) 

0.840  

0.705 

16 

0.184  

(0.125) 

0.016 

17 

(0.917) 

0.939  

0.881 

18 

(0.617) 

0.789  

0.622 

19 

0.608  

0.308  

0.095 

20

1.662  

(0.397) 

0.158 

21 

1.230  

(0.211) 

0.045 

22 

1.548  

0.150  

0.022 

23 

1.586  

0.602  

0.362 

24 

2.287  

0.563  

0.317 

25 

2.503  

0.546  

0.298 

26 

1.983  

0.324  

0.105 

27 

1.410  

(0.537) 

0.288 

28 

0.860  

(0.450) 

0.203 

29 

1.911  

(0.760) 

0.577 

30 

1.990  

(0.716) 

0.513 

31 

2.459  

(0.708) 

0.502 

32 

3.388  

1.686  

2.841 

33 

2.948  

1.324  

1.754 

34 

2.833  

1.035  

1.071 

35 

2.188  

(0.179) 

0.032 

36 

3.527  

(1.271) 

1.615 

37 

3.191  

(0.720) 

0.518 

38 

1.993  

1.026

1.052 

39 

2.469  

(0.315) 

0.099 

40 

4.276  

0.914  

0.835 

41 

3.497  

1.082  

1.171 

42 

4.242  

(0.525) 

0.275 

43 

4.571  

(0.851) 

0.724 

44 

4.453  

1.674  

2.802 

45

3.589  

0.879  

0.773 

46 

2.790  

(0.680) 

0.463 

47 

1.580  

(0.798) 

0.637 

48 

0.699  

(0.440) 

0.194 

49 

2.800  

(1.567) 

2.454 

50 

4.761  

(1.473) 

2.169 

51 

0.506

(0.271) 

0.074 

52 

4.359  

(2.953) 

8.720 

53 

3.605  

(1.399) 

1.956 

54 

5.118  

0.532  

0.283 

55 

3.158  

1.920  

3.686 

56 

4.080  

(2.647) 

7.008 

57 

4.371  

3.081  

9.492 

58 

3.647  

2.091  

4.373 

59 

1.714  

(0.350) 

0.122 

Summing the last column, we get 83.591. We divide this by 59, and get 1.417.

Step #3: Compute the square of the standardized residuals

Now that we know the variance of the regression residuals – 1.417 – we compute the standardized residuals by dividing each residual by 1.417 and then squaring the results, so that we get our square of standardized residuals, si2:

Obs. 

Predicted Ownratio 

Residuals 

Standardized Residuals

Square of Standardized Residuals 

1 

5.165  

2.055  

1.450  

2.103  

2 

1.300  

(0.206) 

(0.146) 

0.021  

3 

3.504  

0.083  

0.058

0.003  

4 

3.821  

1.458  

1.029  

1.059  

5 

3.749  

(0.241) 

(0.170) 

0.029  

6 

2.331  

(1.542) 

(1.088) 

1.185  

7 

2.174  

(0.337) 

(0.238) 

0.057  

8 

3.358  

1.792  

1.264  

1.599  

9 

3.466  

(1.265) 

(0.893) 

0.797  

10 

4.135  

(2.203) 

(1.555) 

2.417  

11 

2.249  

(1.330) 

(0.939) 

0.881  

12 

2.415  

(0.517) 

(0.365) 

0.133  

13 

1.428  

0.156  

0.110  

0.012  

14 

0.763  

0.138  

0.097  

0.009  

15 

(0.712) 

0.840  

0.593  

0.351  

16 

0.184  

(0.125)

(0.088) 

0.008  

17 

(0.917) 

0.939  

0.662  

0.439  

18 

(0.617) 

0.789  

0.557  

0.310  

19 

0.608  

0.308  

0.217  

0.047  

20 

1.662  

(0.397) 

(0.280) 

0.079  

21 

1.230  

(0.211) 

(0.149) 

0.022  

22 

1.548  

0.150  

0.106

0.011  

23 

1.586  

0.602  

0.425  

0.180  

24 

2.287  

0.563  

0.397  

0.158  

25 

2.503  

0.546  

0.385  

0.148  

26 

1.983  

0.324  

0.229  

0.052  

27 

1.410  

(0.537) 

(0.379) 

0.143  

28 

0.860  

(0.450) 

(0.318) 

0.101

29 

1.911  

(0.760) 

(0.536) 

0.288  

30 

1.990  

(0.716) 

(0.505) 

0.255  

31 

2.459  

(0.708) 

(0.500) 

0.250  

32 

3.388  

1.686  

1.190  

1.415  

33 

2.948  

1.324  

0.935  

0.874  

34 

2.833  

1.035  

0.730  

0.534  

35 

2.188

(0.179) 

(0.127) 

0.016  

36 

3.527  

(1.271) 

(0.897) 

0.805  

37 

3.191  

(0.720) 

(0.508) 

0.258  

38 

1.993  

1.026

0.724  

0.524  

39 

2.469  

(0.315) 

(0.222) 

0.049  

40 

4.276  

0.914  

0.645  

0.416  

41 

3.497  

1.082  

0.764  

0.584  

42 

4.242  

(0.525) 

(0.370) 

0.137  

43 

4.571  

(0.851) 

(0.600) 

0.361  

44 

4.453  

1.674  

1.182

1.396  

45 

3.589  

0.879  

0.621  

0.385  

46 

2.790  

(0.680) 

(0.480) 

0.231  

47 

1.580  

(0.798) 

(0.563) 

0.317  

48 

0.699  

(0.440) 

(0.311) 

0.097  

49 

2.800  

(1.567) 

(1.106) 

1.223  

50 

4.761  

(1.473) 

(1.040) 

1.081  

51

0.506  

(0.271) 

(0.192) 

0.037  

52 

4.359  

(2.953) 

(2.084) 

4.344  

53 

3.605  

(1.399) 

(0.987) 

0.974  

54 

5.118

0.532  

0.375  

0.141  

55 

3.158  

1.920  

1.355  

1.836  

56 

4.080  

(2.647) 

(1.868) 

3.491  

57 

4.371  

3.081  

2.175  

4.728  

58 

3.647  

2.091  

1.476  

2.179  

59 

1.714  

(0.350) 

(0.247) 

0.061  

 

Step #4: Run another regression with all your independent variables using the sum of standardized residuals as the dependent variable

In this case, we had only one independent variable, INCOME. We will now run a regression substituting the last column of the table above for OWNRATIO, and making it the dependent variable. Again, we’re not interested in the parameter estimates. We are, however, interested in the regression sum of squares (RSS), which is 15.493.

Step #5: Divide the RSS by 2 and compare with the χ2 table’s critical value for the appropriate degrees of freedom

Dividing the RSS by 2, we get 7.747. We look up the critical χ2 value for one degree of freedom and in the table, for a 5% significance level, we get 3.84. Since our χ2 value exceeds our critical, we can conclude there is strong evidence of heteroscedasticity present.

The Park Test

Last, but certainly not least comes the Park test. I saved this one for last because it is the simplest of the three methods and unlike the other two, provides information that can help eliminate the heteroscedasticity. The Park Test assumes there is a relationship between the error variance and one of the regression model’s independent variables. The steps involved are as follows:

Step #1: Run your original regression model and collect the residuals

Done.

Step #2: Square the regression residuals and compute the logs of the squared residuals and the values of the suspected independent variable.

We’ll square the regression residuals, and take their natural log. We will also take the natural log of INCOME:

Tract

Residual Squared

LnResidual Squared

LnIncome

601

4.222

1.440

10.123

602

0.043

(3.157)

9.382

603

0.007

(4.987)

9.868

604

2.126

0.754

9.922

605

0.058

(2.848)

9.910

606

2.378

0.866

9.639

607

0.113

(2.176)

9.604

608

3.209

1.166

9.842

609

1.601

0.470

9.862

609

4.852

1.579

9.973

610

1.769

0.571

9.621

611

0.267

(1.320)

9.657

612

0.024

(3.720)

9.418

613

0.019

(3.960)

9.217

614

0.705

(0.349)

8.535

615

0.016

(4.162)

9.001

616

0.881

(0.127)

8.389

616

0.622

(0.475)

8.596

617

0.095

(2.356)

9.163

618

0.158

(1.847)

9.480

619

0.045

(3.112)

9.362

620

0.022

(3.796)

9.450

621

0.362

(1.015)

9.460

623

0.317

(1.148)

9.629

624

0.298

(1.211)

9.676

625

0.105

(2.255)

9.559

626

0.288

(1.245)

9.413

627

0.203

(1.596)

9.249

628

0.577

(0.549)

9.542

629

0.513

(0.668)

9.561

630

0.502

(0.689)

9.667

631

2.841

1.044

9.848

632

1.754

0.562

9.766

633

1.071

0.069

9.744

634

0.032

(3.437)

9.607

635

1.615

0.479

9.872

701

0.518

(0.658)

9.812

705

1.052

0.051

9.562

706

0.099

(2.309)

9.669

710

0.835

(0.180)

9.995

711

1.171

0.158

9.867

712

0.275

(1.289)

9.989

713

0.724

(0.323)

10.039

713

2.802

1.030

10.022

714

0.773

(0.257)

9.883

714

0.463

(0.770)

9.735

718

0.637

(0.452)

9.459

718

0.194

(1.640)

9.195

719

2.454

0.898

9.737

719

2.169

0.774

10.067

720

0.074

(2.608)

9.127

721

8.720

2.166

10.007

721

1.956

0.671

9.886

724

0.283

(1.263)

10.117

726

3.686

1.305

9.806

728

7.008

1.947

9.964

731

9.492

2.250

10.009

731

4.373

1.476

9.893

735

0.122

(2.102)

9.493

 

Step #3: Run the regression equation using the log of the squared residuals as the dependent variable and the log of the suspected independent variable as the dependent variable

That results in the following regression equation:

Ln(e2) = 1.957(LnIncome) – 19.592

Step #4: If the t-ratio for the transformed independent variable is significant, you can conclude heteroscedasticity is present.

The parameter estimate for the LnIncome is significant, with a t-ratio of 3.499, so we conclude heteroscedasticity.

Next Forecast Friday Topic: Correcting Heteroscedasticity

Thanks for your patience! Now you know the three most common methods for detecting heteroscedasticity: the Goldfeld-Quandt test, the Breusch-Pagan test, and the Park test. As you will see in next week’s Forecast Friday post, the Park test will be beneficial in helping us eliminate the heteroscedasticity. We will discuss the most common approach to correcting heteroscedasticity: weighted least squares (WLS) regression, and show you how to apply it. Next week’s Forecast Friday post will conclude our discussion of regression violations, and allow us to resume discussions of more practical applications in forecasting.

*************************

Help us Reach 200 Fans on Facebook by Tomorrow!

Thanks to all of you, Analysights now has over 160 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

Forecast Friday Topic: Heteroscedasticity

August 12, 2010

(Seventeenth in a series)

Recall that one of the important assumptions in regression analysis is that a regression equation exhibit homoscedasticity: the condition that the error terms have a constant variance. Today we discuss heteroscedasticity, the violation of that assumption.

Heteroscedasticity, like autocorrelation and multicollinearity, results in inefficient parameter estimates. The standard errors of the parameter estimates tend to be biased, which means that the t-ratios and confidence intervals calculated around the suspect independent variable will not be valid, and will generate dubious predictions.

Heteroscedasticity occurs mostly in cross-sectional, as opposed to time series, data and mostly in large data sets. When data sets are large, the range of values for an independent variable can be quite wide. This is especially the case in data where income or other measures of wealth are used as independent variables. Persons with low income have few options about how to spend their money while persons with high incomes have many options. If you were trying to predict that the conviction rate for crimes was different in low income counties vs. high income counties, your model may exhibit heteroscedasticity because a low-income person may not have the funds for an adequate defense, and may be restricted to a public defender, or other inexpensive attorney. A wealthy individual, on the other hand, can hire the very best defense lawyer money could buy; or he could choose an inexpensive lawyer, or even the public defender. The wealthy individual may even be able to make restitution in lieu of a conviction.

How does this disparity affect your model? Recall from our earlier discussions on regression analysis that the least-squares method places more weight on extreme values. When outliers exist in data, they generate large residuals that get scattered out from those of the remaining observations. While heteroscedastic error terms will still have a mean of zero, their variance is greatly out of whack, resulting in inefficient parameter estimates.

In today’s Forecast Friday post, we will look at a data set for a regional housing market, perform a regression, and show how to detect heteroscedasticity visually.

Heteroscedasticity in the Housing Market

The best depiction of heteroscedasticity comes from my college econometrics textbook, Introducing Econometrics, by William S. Brown. In the chapter on heteroscedasticity, Brown provides a data set of housing statistics from the 1980 Census for Pierce County, Washington, which I am going to use for our model. The housing market is certainly one market where heteroscedasticity is deeply entrenched, since there is a dramatic range for both incomes and home market values. In our data set, we have 59 census tracts within Pierce County. Our independent variable is the median family income for the census tract; our dependent variable is the OwnRatio – the ratio of the number of families who own their homes to the number of families who rent. Our data set is as follows:

Housing Data

Tract

Income

Ownratio

601

$24,909

7.220

602

$11,875

1.094

603

$19,308

3.587

604

$20,375

5.279

605

$20,132

3.508

606

$15,351

0.789

607

$14,821

1.837

608

$18,816

5.150

609

$19,179

2.201

609

$21,434

1.932

610

$15,075

0.919

611

$15,634

1.898

612

$12,307

1.584

613

$10,063

0.901

614

$5,090

0.128

615

$8,110

0.059

616

$4,399

0.022

616

$5,411

0.172

617

$9,541

0.916

618

$13,095

1.265

619

$11,638

1.019

620

$12,711

1.698

621

$12,839

2.188

623

$15,202

2.850

624

$15,932

3.049

625

$14,178

2.307

626

$12,244

0.873

627

$10,391

0.410

628

$13,934

1.151

629

$14,201

1.274

630

$15,784

1.751

631

$18,917

5.074

632

$17,431

4.272

633

$17,044

3.868

634

$14,870

2.009

635

$19,384

2.256

701

$18,250

2.471

705

$14,212

3.019

706

$15,817

2.154

710

$21,911

5.190

711

$19,282

4.579

712

$21,795

3.717

713

$22,904

3.720

713

$22,507

6.127

714

$19,592

4.468

714

$16,900

2.110

718

$12,818

0.782

718

$9,849

0.259

719

$16,931

1.233

719

$23,545

3.288

720

$9,198

0.235

721

$22,190

1.406

721

$19,646

2.206

724

$24,750

5.650

726

$18,140

5.078

728

$21,250

1.433

731

$22,231

7.452

731

$19,788

5.738

735

$13,269

1.364

Data taken from U.S. Bureau of Census 1980 Pierce County, WA; Reprinted in Brown, W.S., Introducing Econometrics, St. Paul (1991): 198-200.

When we run our regression, we get the following equation:

Ŷ= 0.000297*Income – 2.221

Both the intercept and independent variable’s parameter estimates are significant, with the intercept parameter having a t-ratio of -4.094 and the income estimate having one of 9.182. R2 is 0.597, and the F-statistic is a strong 84.31. The model seems to be pretty good – strong t-ratios and F-statistic, a high coefficient of determination, and the sign on the parameter estimate for Income is positive, as we would expect. Generally, the higher the income, the greater the Own-to-rent ratio. So far so good.

The problem comes when we do a visual inspection of our data: first the independent variable against the dependent variable and the independent variable against the regression residuals. First, let’s take a look at the scatter plot of Income and OwnRatio:

Without even looking at the residuals, we can see that as median family income increases, the data points begin to spread out. Look at what happens to the distance between data points above and below the line when median family incomes reach $20,000: OwnRatios vary drastically.

Now let’s plot Income against the regression’s residuals:

This scatter plot shows essentially the same phenomenon as the previous graph, but from a different perspective. We can clearly see the error terms fanning out as Income increases. In fact, we can see the residuals diverging at increasing rates once Income starts moving from $10,000 to $15,000, and just compounding as incomes go higher. Roughly half the residuals fall on both the positive and the negative side, allowing us to meet the regression assumption of our residuals having a mean of zero, hence our parameter estimates are not biased. However, because we violated the constant variance assumption, the standard error of our regression is biased, so our parameter estimates are suspect.

Visual Inspection Only Gets You So Far

By visually inspecting our residuals, we can clearly see that our error terms are not homoscedastic. When you have a regression model, especially for cross-sectional data sets like this, you should visually inspect every independent variable against the dependent variable and against the error terms in order to get a priori indication of heteroscedasticity. However, visual inspection alone is not a guarantee that heteroscedasticity exists. There are three particularly simple methods to detecting heteroscedasticity which we will discuss in next week’s Forecast Friday post: the Park Test, the Goldfeld-Quandt Test, and the Breusch-Pagan Test.

*************************

Help us Reach 200 Fans on Facebook by Tomorrow!

Thanks to all of you, Analysights now has 150 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights!  By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter.   Thanks for your help!

Forecast Friday Topic: Correcting Autocorrelation

August 5, 2010

(Sixteenth in a series)

Last week, we discussed how to detect autocorrelation – the violation of the regression assumption that the error terms are not correlated with one another – in your forecasting model. Models exhibiting autocorrelation have parameter estimates that are inefficient, and R2s and t-ratios that seem overly inflated. As a result, your model generates forecasts that are too good to be true and has a tendency to miss turning points in your time series. In last week’s Forecast Friday post, we showed you how to diagnose autocorrelation: examining the model’s parameter estimates, visually inspecting the data, and computing the Durbin-Watson statistic. Today, we’re going to discuss how to correct it.

Revisiting our Data Set

Recall our data set: average hourly wages of textile and apparel workers for the 18 months from January 1986 through June 1987, as reported in the Survey of Current Business (September issues from 1986 and 1987), and reprinted in Data Analysis Using Microsoft ® Excel, by Michael R. Middleton, page 219:

Month

t

Wage

Jan-86

1

5.82

Feb-86

2

5.79

Mar-86

3

5.8

Apr-86

4

5.81

May-86

5

5.78

Jun-86

6

5.79

Jul-86

7

5.79

Aug-86

8

5.83

Sep-86

9

5.91

Oct-86

10

5.87

Nov-86

11

5.87

Dec-86

12

5.9

Jan-87

13

5.94

Feb-87

14

5.93

Mar-87

15

5.93

Apr-87

16

5.94

May-87

17

5.89

Jun-87

18

5.91

We generated the following regression model:

Ŷ = 5.7709 + 0.0095t

Our model had an R2 of .728, and t-ratios of about 368 for the intercept term and 6.55 for the parameter estimate, t. The Durbin-Watson statistic was 1.05, indicating positive autocorrelation. How do we correct for autocorrelation?

Lagging the Dependent Variable

One of the most common remedies for autocorrelation is to lag the dependent variable one or more periods and then make the lagged dependent variable the independent variable. So, in our data set above, you would take the first value of the dependent variable, $5.82, and make it the independent variable for period 2, with $5.79 being the dependent variable; in like manner, $5.79 will also become the independent variable for the next period, whose dependent variable has a value of $5.80, and so on. Since the error terms from one period to another exhibit correlation, by using the previous value of the dependent variable to predict the next one, you reduce that correlation of errors.

You can lag for as many periods as you need to; however, note that you lose the first observation when you lag one period (unless you know the previous period before the start of the data set, you have nothing to predict the first observation). You’ll lose two observations if you lag two periods, and so on. If you have a very small data set, the loss of degrees of freedom can lead to Type II error – failing to identify a parameter estimate as significant, when in fact it is. So, you must be careful here.

In this case, by lagging our data by one period, we have the following data set:

Month

Wage

Lag1 Wage

Feb-86

$5.79

$5.82

Mar-86

$5.80

$5.79

Apr-86

$5.81

$5.80

May-86

$5.78

$5.81

Jun-86

$5.79

$5.78

Jul-86

$5.79

$5.79

Aug-86

$5.83

$5.79

Sep-86

$5.91

$5.83

Oct-86

$5.87

$5.91

Nov-86

$5.87

$5.87

Dec-86

$5.90

$5.87

Jan-87

$5.94

$5.90

Feb-87

$5.93

$5.94

Mar-87

$5.93

$5.93

Apr-87

$5.94

$5.93

May-87

$5.89

$5.94

Jun-87

$5.91

$5.89

 

So, we have created a new independent variable, Lag1_Wage. Notice that we are not going to regress time period t as an independent variable. This doesn’t mean that we should or shouldn’t; in this case, we’re only trying to demonstrate the effect of the lagging.

Rerunning the Regression

Now we do our regression analysis. We come up with the following equation:

Ŷ = 0.8253 + 0.8600*Lag1_Wage

Apparently, from this model, each $1 change in hourly wage from the previous month is associated with an average $0.86 change in hourly wages for the current month. The R2 for this model was virtually unchanged, 0.730. However, the Durbin-Watson statistic is now 2.01 – just about the total eradication of autocorrelation. Unfortunately, the intercept has a t-ratio of 1.04, indicating it is not significant. The parameter estimate for Lag1_Wage is about 6.37, not much different than the parameter estimate for t in our previous model. However, we did get rid of the autocorrelation.

The statistically insignificant intercept term resulting from this lagging is a result of the Type II error involved with the loss of a degree of freedom in a small sample size. Perhaps if we had several more months of data, we might have had a significant intercept estimate.

Other Approaches to Correcting Autocorrelation

There are other approaches to correcting autocorrelation. One other important way might be to identify important independent variables that have been omitted from the model. Perhaps if we had data on the average years work experience of the textile and apparel labor force from month to month, that might have increased our R2, and reduced correlations in the error term. Another thing we could do is difference the data. Differencing works like lagging, only we subtract the value of the dependent and independent variables of the first observation from their respective values in the second observation; then we subtract those of the second observation’s original values from those of the third, and so on. Then we run a regression on the differences in observations. The problem here is that again, your data set is reduced by one observation and your transformed model will not have an intercept term, which can cause issues in some studies.

Other approaches to correcting autocorrelation include quasi-differencing, the Cochran-Orcutt Procedure, the Hildreth-Lu Procedure, and the Durbin Two-Step Method. These methods are iterative, require a lot of tedious effort and are beyond the scope of our post. But many college-level forecasting textbooks have sections on these procedures if you’re interested in further reading on them.

Next Forecast Friday Topic: Detecting Heteroscedasticity

Next week, we’ll discuss the last of the regression violations, heteroscedasticity, which is the violation of the assumption that error terms have a constant variance. We will discuss why heteroscedasticity exists and how to diagnose it. The week after that, we’ll discuss remedying heteroscedasticity. Once we have completed our discussions on the regression violations, we will spend a couple of weeks discussing regression modeling techniques like transforming independent variables, using categorical variables, adjusting for seasonality, and other regression techniques. These topics will be far less theoretical and more practical in terms of forecasting.