Forecast Friday Topic: Correcting Heteroscedasticity

(Nineteenth in a series)

In last week’s Forecast Friday post, we discussed the three most commonly used analytical approaches to detecting heteroscedasticity: the Goldfeld-Quandt test, the Breusch-Pagan test, and the Park test. We continued to work with our data set of 59 census tracts in Pierce County, WA, from which we were trying to determine what, if any, influence the tract’s median family income had on the ratio of the number of families in the tract who own their home to the number of families who rent. As we saw, heteroscedasticity was present in our model, caused largely by the wide variation in income from one census tract to the other.

Recall that while INCOME, for the most part, had a positive relationship with the OWNRATIO, yet we found many census tracts that despite having high median family incomes had low OWNRATIOs. This is because unlike low-income families whose housing options are limited, high income families have several more housing options. The fact that the wealthier census tracts have more options increases the variability within the relationship between INCOME and OWNRATIO, causing us to generate errors that don’t have a constant variance and produce forecasts with parameter estimates that don’t seem to make sense.

Today, we turn our attention to correcting heteroscedasticity, and we will do that by transforming our model using Weighted Least Squares (WLS) regression. And we’ll show how our results from the Park test can enable us to approximate the weights to use in our WLS model.

Weighted Least Squares Regression

The reason wide variances in the value of one or more independent variables cause heteroscedastic errors is because the regression model places heavier weight on extreme values. By weighting each observation in the data set, we eliminate that tendency. But how do we know what weights to use? That depends on whether the variances of each individual observation are known or unknown.

If the variances are known, then you would simply divide each observation by its standard deviation and then run your regression to get a transformed model. Rarely, however, is the individual variance known, so we need to apply a more intricate approach.

Returning to our housing model, our regression equation was:

Ŷ= 0.000297*Income – 2.221

With an R2=0.597, an F-ratio of 84.31, and t-ratios of 9.182 for INCOME and -4.094 for the intercept.

We know that INCOME, our independent variable, is the source of the heteroscedasticity. Let’s also assume that the “correct” housing model is also of a linear functional form like our model above. In this case, we would divide each observation’s dependent variable (OWNRATIO) value by the value of its independent variable, INCOME, forming a new dependent variable (OwnRatio_Income) and then take the reciprocal of the INCOME value, and form a new independent variable, IncomeReciprocal.

Recalling the Park Test

How do we know to choose the reciprocal? Remember when we did the Park test last week? We got the following equation:

Ln(e2) = 1.957(LnIncome) – 19.592

The parameter estimate for LnIncome is 1.957. The Park test assumes that the variance of the heteroscedastic error is equal to the variance of the homoscedastic error times Xi raised to an exponent. That coefficient represents the exponent to which our independent variable Xi is raised. Since the Park test is performed by regressing a double log function, we divide that coefficient by two to arrive at the exponent of the Xi value by which to weight our observations:

Essentially, we are saying that:

Var(heterosc. errors in housing model) = var(homosc. errors in housing model)1.957

For simplicity’s sake, let’s round the coefficient from 1.957 to 2. Hence, we divide our dependent variable by Xi2/Xi = Xi , and our independent variable by its reciprocal:

Estimating the Housing Model Using WLS

We weight the values for each census tract’s housing data accordingly:

OwnRatio_Income

IncomeReciprocal

0.000290

0.000040

0.000092

0.000084

0.000186

0.000052

0.000259

0.000049

0.000174

0.000050

0.000051

0.000065

0.000124

0.000067

0.000274

0.000053

0.000115

0.000052

0.000090

0.000047

0.000061

0.000066

0.000121

0.000064

0.000129

0.000081

0.000090

0.000099

0.000025

0.000196

0.000007

0.000123

0.000005

0.000227

0.000032

0.000185

0.000096

0.000105

0.000097

0.000076

0.000088

0.000086

0.000134

0.000079

0.000170

0.000078

0.000187

0.000066

0.000191

0.000063

0.000163

0.000071

0.000071

0.000082

0.000039

0.000096

0.000083

0.000072

0.000090

0.000070

0.000111

0.000063

0.000268

0.000053

0.000245

0.000057

0.000227

0.000059

0.000135

0.000067

0.000116

0.000052

0.000135

0.000055

0.000212

0.000070

0.000136

0.000063

0.000237

0.000046

0.000237

0.000052

0.000171

0.000046

0.000162

0.000044

0.000272

0.000044

0.000228

0.000051

0.000125

0.000059

0.000061

0.000078

0.000026

0.000102

0.000073

0.000059

0.000140

0.000042

0.000026

0.000109

0.000063

0.000045

0.000112

0.000051

0.000228

0.000040

0.000280

0.000055

0.000067

0.000047

0.000335

0.000045

0.000290

0.000051

0.000103

0.000075

 

And we run a regression, to get a model of this form:

 

OwnRatio_Incomei = α* + β1*IncomeReciprocali + εi*

Notice the asterisks for each of the parameter estimates. They denote the transformed model. Performing our transformed regression, we get:



We get an R2 of .596 for the transformed model, not much different from that of our original model. However, notice the intercept of our transformed model and look at the coefficient of INCOME from our original model. Notice that they are almost equal. That’s because when you divided each observation by Xi , you essentially divided 0.000297*INCOME by INCOME, turning the slope into the intercept! Since heteroscedasticity doesn’t bias parameter estimates, we would expect the slope of our original model and the intercept of our transformed model to be equivalent. This is because those parameter estimates are averages. Heteroscedasticity doesn’t bias the average, but the variance.

Note the t-ratio for the intercept in our transformed model is much stronger than that of the coefficient for INCOME in our transformed model (12.19 vs. 9.182), suggesting that the transformed model has generated a more efficient estimate of the slope parameter. That’s because the standard error of the estimate (read VARIANCE) is smaller in our transformed model. We divide the parameter estimate by the standard error of the estimate to get our t-ratios. Because the standard error is smaller, our estimates are more trustworthy.

Recap

This concludes our discussions of all the violations that can occur with regression analysis and the problems these violations can cause. You now understand that omitting important independent variables, multicollinearity, autocorrelation, and heteroscedasticity can all cause you to generate models that produce unacceptable forecasts and prediction. You now know how to diagnose these violations and how to correct them. One thing you’ve probably also noticed as we went through these discussions is that data is never perfect. No matter how good our data is, we must still work with it and adapt it in a way that we can derive actionable insights from it.

Forecast Friday Will Resume Two Weeks From Today

Next week is the weekend before Labor Day, and I am forecasting that many of you will be leaving the office early for the long weekend, so I have decided to make the next edition of Forecast Friday for September 9. The other two posts that appear earlier in the week will continue as scheduled. Beginning with the September 9 Forecast Friday post, we will talk about additional regression analysis topics that are much less theoretical than these last few posts’ topics, and much more practical. Until then, Analysights wishes you and your family a great Labor Day weekend!

****************************************************

Help us Reach 200 Fans on Facebook by Tomorrow!
Thanks to all of you, Analysights now has over 160 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

Tags: , , , , , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: