(Nineteenth in a series)
In last week’s Forecast Friday post, we discussed the three most commonly used analytical approaches to detecting heteroscedasticity: the Goldfeld-Quandt test, the Breusch-Pagan test, and the Park test. We continued to work with our data set of 59 census tracts in Pierce County, WA, from which we were trying to determine what, if any, influence the tract’s median family income had on the ratio of the number of families in the tract who own their home to the number of families who rent. As we saw, heteroscedasticity was present in our model, caused largely by the wide variation in income from one census tract to the other.
Recall that while INCOME, for the most part, had a positive relationship with the OWNRATIO, yet we found many census tracts that despite having high median family incomes had low OWNRATIOs. This is because unlike low-income families whose housing options are limited, high income families have several more housing options. The fact that the wealthier census tracts have more options increases the variability within the relationship between INCOME and OWNRATIO, causing us to generate errors that don’t have a constant variance and produce forecasts with parameter estimates that don’t seem to make sense.
Today, we turn our attention to correcting heteroscedasticity, and we will do that by transforming our model using Weighted Least Squares (WLS) regression. And we’ll show how our results from the Park test can enable us to approximate the weights to use in our WLS model.
Weighted Least Squares Regression
The reason wide variances in the value of one or more independent variables cause heteroscedastic errors is because the regression model places heavier weight on extreme values. By weighting each observation in the data set, we eliminate that tendency. But how do we know what weights to use? That depends on whether the variances of each individual observation are known or unknown.
If the variances are known, then you would simply divide each observation by its standard deviation and then run your regression to get a transformed model. Rarely, however, is the individual variance known, so we need to apply a more intricate approach.
Returning to our housing model, our regression equation was:
Ŷ= 0.000297*Income – 2.221
With an R^{2}=0.597, an F-ratio of 84.31, and t-ratios of 9.182 for INCOME and -4.094 for the intercept.
We know that INCOME, our independent variable, is the source of the heteroscedasticity. Let’s also assume that the “correct” housing model is also of a linear functional form like our model above. In this case, we would divide each observation’s dependent variable (OWNRATIO) value by the value of its independent variable, INCOME, forming a new dependent variable (OwnRatio_Income) and then take the reciprocal of the INCOME value, and form a new independent variable, IncomeReciprocal.
Recalling the Park Test
How do we know to choose the reciprocal? Remember when we did the Park test last week? We got the following equation:
Ln(e^{2}) = 1.957(LnIncome) – 19.592
The parameter estimate for LnIncome is 1.957. The Park test assumes that the variance of the heteroscedastic error is equal to the variance of the homoscedastic error times X_{i} raised to an exponent. That coefficient represents the exponent to which our independent variable X_{i} is raised. Since the Park test is performed by regressing a double log function, we divide that coefficient by two to arrive at the exponent of the X_{i }value by which to weight our observations:
Essentially, we are saying that:
Var(heterosc. errors in housing model) = var(homosc. errors in housing model)^{1.957 }
For simplicity’s sake, let’s round the coefficient from 1.957 to 2. Hence, we divide our dependent variable by X_{i}^{2}/X_{i} = X_{i} , and our independent variable by its reciprocal:
Estimating the Housing Model Using WLS
We weight the values for each census tract’s housing data accordingly:
OwnRatio_Income |
IncomeReciprocal |
0.000290 |
0.000040 |
0.000092 |
0.000084 |
0.000186 |
0.000052 |
0.000259 |
0.000049 |
0.000174 |
0.000050 |
0.000051 |
0.000065 |
0.000124 |
0.000067 |
0.000274 |
0.000053 |
0.000115 |
0.000052 |
0.000090 |
0.000047 |
0.000061 |
0.000066 |
0.000121 |
0.000064 |
0.000129 |
0.000081 |
0.000090 |
0.000099 |
0.000025 |
0.000196 |
0.000007 |
0.000123 |
0.000005 |
0.000227 |
0.000032 |
0.000185 |
0.000096 |
0.000105 |
0.000097 |
0.000076 |
0.000088 |
0.000086 |
0.000134 |
0.000079 |
0.000170 |
0.000078 |
0.000187 |
0.000066 |
0.000191 |
0.000063 |
0.000163 |
0.000071 |
0.000071 |
0.000082 |
0.000039 |
0.000096 |
0.000083 |
0.000072 |
0.000090 |
0.000070 |
0.000111 |
0.000063 |
0.000268 |
0.000053 |
0.000245 |
0.000057 |
0.000227 |
0.000059 |
0.000135 |
0.000067 |
0.000116 |
0.000052 |
0.000135 |
0.000055 |
0.000212 |
0.000070 |
0.000136 |
0.000063 |
0.000237 |
0.000046 |
0.000237 |
0.000052 |
0.000171 |
0.000046 |
0.000162 |
0.000044 |
0.000272 |
0.000044 |
0.000228 |
0.000051 |
0.000125 |
0.000059 |
0.000061 |
0.000078 |
0.000026 |
0.000102 |
0.000073 |
0.000059 |
0.000140 |
0.000042 |
0.000026 |
0.000109 |
0.000063 |
0.000045 |
0.000112 |
0.000051 |
0.000228 |
0.000040 |
0.000280 |
0.000055 |
0.000067 |
0.000047 |
0.000335 |
0.000045 |
0.000290 |
0.000051 |
0.000103 |
0.000075 |
And we run a regression, to get a model of this form:
OwnRatio_Income_{i} = α^{*} + β_{1}^{*}IncomeReciprocal_{i} + ε_{i}^{*}_{ }
Notice the asterisks for each of the parameter estimates. They denote the transformed model. Performing our transformed regression, we get:
We get an R^{2} of .596 for the transformed model, not much different from that of our original model. However, notice the intercept of our transformed model and look at the coefficient of INCOME from our original model. Notice that they are almost equal. That’s because when you divided each observation by X_{i} , you essentially divided 0.000297*INCOME by INCOME, turning the slope into the intercept! Since heteroscedasticity doesn’t bias parameter estimates, we would expect the slope of our original model and the intercept of our transformed model to be equivalent. This is because those parameter estimates are averages. Heteroscedasticity doesn’t bias the average, but the variance.
Note the t-ratio for the intercept in our transformed model is much stronger than that of the coefficient for INCOME in our transformed model (12.19 vs. 9.182), suggesting that the transformed model has generated a more efficient estimate of the slope parameter. That’s because the standard error of the estimate (read VARIANCE) is smaller in our transformed model. We divide the parameter estimate by the standard error of the estimate to get our t-ratios. Because the standard error is smaller, our estimates are more trustworthy.
Recap
This concludes our discussions of all the violations that can occur with regression analysis and the problems these violations can cause. You now understand that omitting important independent variables, multicollinearity, autocorrelation, and heteroscedasticity can all cause you to generate models that produce unacceptable forecasts and prediction. You now know how to diagnose these violations and how to correct them. One thing you’ve probably also noticed as we went through these discussions is that data is never perfect. No matter how good our data is, we must still work with it and adapt it in a way that we can derive actionable insights from it.
Forecast Friday Will Resume Two Weeks From Today
Next week is the weekend before Labor Day, and I am forecasting that many of you will be leaving the office early for the long weekend, so I have decided to make the next edition of Forecast Friday for September 9. The other two posts that appear earlier in the week will continue as scheduled. Beginning with the September 9 Forecast Friday post, we will talk about additional regression analysis topics that are much less theoretical than these last few posts’ topics, and much more practical. Until then, Analysights wishes you and your family a great Labor Day weekend!
****************************************************
Help us Reach 200 Fans on Facebook by Tomorrow!
Thanks to all of you, Analysights now has over 160 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!