(Seventeenth in a series)
Recall that one of the important assumptions in regression analysis is that a regression equation exhibit homoscedasticity: the condition that the error terms have a constant variance. Today we discuss heteroscedasticity, the violation of that assumption.
Heteroscedasticity, like autocorrelation and multicollinearity, results in inefficient parameter estimates. The standard errors of the parameter estimates tend to be biased, which means that the t-ratios and confidence intervals calculated around the suspect independent variable will not be valid, and will generate dubious predictions.
Heteroscedasticity occurs mostly in cross-sectional, as opposed to time series, data and mostly in large data sets. When data sets are large, the range of values for an independent variable can be quite wide. This is especially the case in data where income or other measures of wealth are used as independent variables. Persons with low income have few options about how to spend their money while persons with high incomes have many options. If you were trying to predict that the conviction rate for crimes was different in low income counties vs. high income counties, your model may exhibit heteroscedasticity because a low-income person may not have the funds for an adequate defense, and may be restricted to a public defender, or other inexpensive attorney. A wealthy individual, on the other hand, can hire the very best defense lawyer money could buy; or he could choose an inexpensive lawyer, or even the public defender. The wealthy individual may even be able to make restitution in lieu of a conviction.
How does this disparity affect your model? Recall from our earlier discussions on regression analysis that the least-squares method places more weight on extreme values. When outliers exist in data, they generate large residuals that get scattered out from those of the remaining observations. While heteroscedastic error terms will still have a mean of zero, their variance is greatly out of whack, resulting in inefficient parameter estimates.
In today’s Forecast Friday post, we will look at a data set for a regional housing market, perform a regression, and show how to detect heteroscedasticity visually.
Heteroscedasticity in the Housing Market
The best depiction of heteroscedasticity comes from my college econometrics textbook, Introducing Econometrics, by William S. Brown. In the chapter on heteroscedasticity, Brown provides a data set of housing statistics from the 1980 Census for Pierce County, Washington, which I am going to use for our model. The housing market is certainly one market where heteroscedasticity is deeply entrenched, since there is a dramatic range for both incomes and home market values. In our data set, we have 59 census tracts within Pierce County. Our independent variable is the median family income for the census tract; our dependent variable is the OwnRatio – the ratio of the number of families who own their homes to the number of families who rent. Our data set is as follows:
Housing Data |
||
Tract |
Income |
Ownratio |
601 |
$24,909 |
7.220 |
602 |
$11,875 |
1.094 |
603 |
$19,308 |
3.587 |
604 |
$20,375 |
5.279 |
605 |
$20,132 |
3.508 |
606 |
$15,351 |
0.789 |
607 |
$14,821 |
1.837 |
608 |
$18,816 |
5.150 |
609 |
$19,179 |
2.201 |
609 |
$21,434 |
1.932 |
610 |
$15,075 |
0.919 |
611 |
$15,634 |
1.898 |
612 |
$12,307 |
1.584 |
613 |
$10,063 |
0.901 |
614 |
$5,090 |
0.128 |
615 |
$8,110 |
0.059 |
616 |
$4,399 |
0.022 |
616 |
$5,411 |
0.172 |
617 |
$9,541 |
0.916 |
618 |
$13,095 |
1.265 |
619 |
$11,638 |
1.019 |
620 |
$12,711 |
1.698 |
621 |
$12,839 |
2.188 |
623 |
$15,202 |
2.850 |
624 |
$15,932 |
3.049 |
625 |
$14,178 |
2.307 |
626 |
$12,244 |
0.873 |
627 |
$10,391 |
0.410 |
628 |
$13,934 |
1.151 |
629 |
$14,201 |
1.274 |
630 |
$15,784 |
1.751 |
631 |
$18,917 |
5.074 |
632 |
$17,431 |
4.272 |
633 |
$17,044 |
3.868 |
634 |
$14,870 |
2.009 |
635 |
$19,384 |
2.256 |
701 |
$18,250 |
2.471 |
705 |
$14,212 |
3.019 |
706 |
$15,817 |
2.154 |
710 |
$21,911 |
5.190 |
711 |
$19,282 |
4.579 |
712 |
$21,795 |
3.717 |
713 |
$22,904 |
3.720 |
713 |
$22,507 |
6.127 |
714 |
$19,592 |
4.468 |
714 |
$16,900 |
2.110 |
718 |
$12,818 |
0.782 |
718 |
$9,849 |
0.259 |
719 |
$16,931 |
1.233 |
719 |
$23,545 |
3.288 |
720 |
$9,198 |
0.235 |
721 |
$22,190 |
1.406 |
721 |
$19,646 |
2.206 |
724 |
$24,750 |
5.650 |
726 |
$18,140 |
5.078 |
728 |
$21,250 |
1.433 |
731 |
$22,231 |
7.452 |
731 |
$19,788 |
5.738 |
735 |
$13,269 |
1.364 |
Data taken from U.S. Bureau of Census 1980 Pierce County, WA; Reprinted in Brown, W.S., Introducing Econometrics, St. Paul (1991): 198-200.
When we run our regression, we get the following equation:
Ŷ= 0.000297*Income – 2.221
Both the intercept and independent variable’s parameter estimates are significant, with the intercept parameter having a t-ratio of -4.094 and the income estimate having one of 9.182. R^{2} is 0.597, and the F-statistic is a strong 84.31. The model seems to be pretty good – strong t-ratios and F-statistic, a high coefficient of determination, and the sign on the parameter estimate for Income is positive, as we would expect. Generally, the higher the income, the greater the Own-to-rent ratio. So far so good.
The problem comes when we do a visual inspection of our data: first the independent variable against the dependent variable and the independent variable against the regression residuals. First, let’s take a look at the scatter plot of Income and OwnRatio:
Without even looking at the residuals, we can see that as median family income increases, the data points begin to spread out. Look at what happens to the distance between data points above and below the line when median family incomes reach $20,000: OwnRatios vary drastically.
Now let’s plot Income against the regression’s residuals:
This scatter plot shows essentially the same phenomenon as the previous graph, but from a different perspective. We can clearly see the error terms fanning out as Income increases. In fact, we can see the residuals diverging at increasing rates once Income starts moving from $10,000 to $15,000, and just compounding as incomes go higher. Roughly half the residuals fall on both the positive and the negative side, allowing us to meet the regression assumption of our residuals having a mean of zero, hence our parameter estimates are not biased. However, because we violated the constant variance assumption, the standard error of our regression is biased, so our parameter estimates are suspect.
Visual Inspection Only Gets You So Far
By visually inspecting our residuals, we can clearly see that our error terms are not homoscedastic. When you have a regression model, especially for cross-sectional data sets like this, you should visually inspect every independent variable against the dependent variable and against the error terms in order to get a priori indication of heteroscedasticity. However, visual inspection alone is not a guarantee that heteroscedasticity exists. There are three particularly simple methods to detecting heteroscedasticity which we will discuss in next week’s Forecast Friday post: the Park Test, the Goldfeld-Quandt Test, and the Breusch-Pagan Test.
*************************
Help us Reach 200 Fans on Facebook by Tomorrow!
Thanks to all of you, Analysights now has 150 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!