## Posts Tagged ‘econometrics’

### Forecast Friday Topic: The Identification Problem

November 11, 2010

(Twenty-ninth in a series)

When we work with regression analysis, it is assumed that outside factors determine each of the independent variables in the model; these factors are said to be exogenous to the system. This is especially of interest to economists, who have long used econometric models to forecast demand and supply for various goods. The price the market will bear for a good or service, for example, is not determined by a single equation, but by the interaction of the equations for both supply and demand. If price was what we were trying to forecast, then a single equation would do us little good. In fact, since price is part of a multi-equation system, performing regression analysis for just demand without supply or vice-versa will result in biased parameter estimates.

This post begins our three-part “series within a series” on “Simultaneous Equations and Two-Stage Least Squares Regression”. Although this topic sounds intimidating, I will not be covering it in much technical detail. My purpose in discussing it is to make you aware of these concepts, so that you can determine when to look beyond a simple regression analysis.

Hence, we start with the most basic concept of simultaneous equations: the Identification problem. Let’s assume that you are the supply chain manager for a beer company. You need to forecast the price of barley, so your company can budget how much money it needs to spend in order to have enough barley to produce its beer; determine whether the price is on an upward trend, so that it could purchase derivatives to hedge its risk; and determine the final price for its beer.

You have statistics for the price and traded quantity of barley for the last several years. You also remember three concepts from your college economics class:

1. The price and quantity supplied of a good have a direct relationship – producers supply more as the price goes up and less as the price goes down;
2. The price and quantity demanded of a good have an inverse relationship – consumers purchase less as the price goes up and vice-versa; and
3. The market price is determined by the interaction of the supply and demand equations.

Since price and quantity are positively sloped for supply and negatively sloped for demand, with only the two variables of quantity and price, you cannot determine – that is identify – the supply and demand equations using regression analysis; the information is insufficient. However, if you can identify variables that are in one equation and not the other, you will be able to identify the individual relations.

In agriculture, the supply of a crop is greatly affected by weather. If you can obtain information on the amount of rainfall in barley producing regions during the years for which you have data, you might be able to identify the different equations. Moreover, production costs impact supply. So if you can obtain information on the costs of planting and harvesting the barley, that too would help. On the demand side, barley’s quantity can be influenced by changes in tastes. If beer demand goes up, so too will the demand for barley; if farm animal raising increases, farmers may need to purchase more barley for animal fodder; and various health fads may emerge, increasing the demands for barley breads and soups. If you can obtain these kinds of information, you are on your way to identifying the supply and demand curves.

Exogenous and Endogenous Variables

Since rainfall affects the supply of barley, but the barley market does not influence the amount of rainfall, rainfall is said to be an exogenous variable, because its value is determined by factors outside of the equation system. Since the demand for beer helps derive the demand for barley, but not the other way around, beer demand is an exogenous variable.

Because price and quantity of barley are part of a demand and supply system, they are determined by the interaction of the two equations – that is by the equation system – so they are said to be endogenous variables.

Identifying an Equation

If you are trying to identify an equation that is part of a multi-equation system, you must have a minimum of one less variable than you do equations excluded from that equation. Hence, if you have a two-equation system, you must have at least one variable excluded from the model you’re trying to identify, that is included in the other equation; if your system has three equations, you need to have at least two variables excluded from the model you want to identify, and so on.

When you have just enough exogenous variables in one equation that is not in the other equation(s), then your equation is just identified. You can use several econometric techniques to estimate just identified systems, however they are quite rare in practice. When you have no exogenous variables that are unique to one equation in the system, your equations are under identified and cannot be estimated with any econometric techniques. Most often, equations are over identified, because there are more exogenous variables excluded from one equation than required by the number of equations in the system. When over identification is the case, then two-stage least squares (the topic of the third post of this miniseries) is required in order to tell which of the variables is causing your supply (or demand) curve to shift along the fixed demand (or supply) curve.

Next Forecast Friday Topic: Structural and Reduced Forms

Next week’s Forecast Friday topic builds on today’s topic with a discussion of structural and reduced forms of equations. These are the first steps in Two-Stage Least Squares Regression analysis, and are part of the effort to solve the identification problem.

*************************

Thanks to all of you, Analysights now has nearly 200 fans on Facebook … and we’d love more!  If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you and they will be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

### Forecast Friday Topic: Detecting Heteroscedasticity – Analytical Approaches

August 19, 2010

(Eighteenth in a series)

Last week, we discussed the violation of the homoscedasticity assumption of regression analysis: the assumption that the error terms have a constant variance. When the error terms do not exhibit a constant variance, they are said to be heteroscedastic. A model that exhibits heteroscedasticity produces parameter estimates that are not biased, but rather inefficient. Heteroscedasticity most often appears in cross-sectional data and is frequently caused by a wide range of possible values for one or more independent variables.

Last week, we showed you how to detect heteroscedasticity by visually inspecting the plot of the error terms against the independent variable. Today, we are going to discuss three simple, but very powerful, analytical approaches to detecting heteroscedasticity: the Goldfeld-Quandt test, the Breusch-Pagan test, and the Park test. These approaches are quite simple, but can be a bid tedious to employ.

Reviewing Our Model

Recall our model from last week. We were trying to determine the relationship between a census tract’s median family income (INCOME) and the ratio of the number of families who own their homes to the number of families who rent (OWNRATIO). Our hypothesis was that census tracts with higher median family incomes had a higher proportion of families who owned their homes. I snatched an example from my college econometrics textbook, which pulled INCOME and OWNRATIOs from 59 census tracts in Pierce County, Washington, which were compiled during the 1980 Census. We had the following data:

 Housing Data Tract Income Ownratio 601 \$24,909 7.220 602 \$11,875 1.094 603 \$19,308 3.587 604 \$20,375 5.279 605 \$20,132 3.508 606 \$15,351 0.789 607 \$14,821 1.837 608 \$18,816 5.150 609 \$19,179 2.201 609 \$21,434 1.932 610 \$15,075 0.919 611 \$15,634 1.898 612 \$12,307 1.584 613 \$10,063 0.901 614 \$5,090 0.128 615 \$8,110 0.059 616 \$4,399 0.022 616 \$5,411 0.172 617 \$9,541 0.916 618 \$13,095 1.265 619 \$11,638 1.019 620 \$12,711 1.698 621 \$12,839 2.188 623 \$15,202 2.850 624 \$15,932 3.049 625 \$14,178 2.307 626 \$12,244 0.873 627 \$10,391 0.410 628 \$13,934 1.151 629 \$14,201 1.274 630 \$15,784 1.751 631 \$18,917 5.074 632 \$17,431 4.272 633 \$17,044 3.868 634 \$14,870 2.009 635 \$19,384 2.256 701 \$18,250 2.471 705 \$14,212 3.019 706 \$15,817 2.154 710 \$21,911 5.190 711 \$19,282 4.579 712 \$21,795 3.717 713 \$22,904 3.720 713 \$22,507 6.127 714 \$19,592 4.468 714 \$16,900 2.110 718 \$12,818 0.782 718 \$9,849 0.259 719 \$16,931 1.233 719 \$23,545 3.288 720 \$9,198 0.235 721 \$22,190 1.406 721 \$19,646 2.206 724 \$24,750 5.650 726 \$18,140 5.078 728 \$21,250 1.433 731 \$22,231 7.452 731 \$19,788 5.738 735 \$13,269 1.364

Data taken from U.S. Bureau of Census 1980 Pierce County, WA; Reprinted in Brown, W.S., Introducing Econometrics, St. Paul (1991): 198-200.

And we got the following regression equation:

Ŷ= 0.000297*Income – 2.221

With an R2=0.597, an F-ratio of 84.31, the t-ratios for INCOME (9.182) and the intercept (-4.094) both solidly significant, and the positive sign on the parameter estimate for INCOME, our model appeared to do very well. However, visual inspection of the regression residuals suggested the presence of heteroscedasticity. Unfortunately, visual inspection can only suggest; we need more objective ways of determining the presence of heteroscedasticity. Hence our three tests below.

The Goldfeld-Quandt Test

The Goldfeld-Quandt test is a computationally simple, and perhaps the most commonly used, method for detecting heteroscedasticity. Since a model with heteroscedastic error terms does not have a constant variance, the Goldfeld-Quandt test postulates that the variances associated with high values of the independent variable, X, are statistically significant from those associated with low values. Essentially, you would run separate regression analyses for the low values of X and the high values, and then compare their F-ratios.

The Goldfeld-Quandt test has four steps:

Step #1: Sort the data

Take the independent variable you suspect to be the source of the heteroscedasticity and sort your data set by the X value in low-to-high order:

 Housing Data Tract Income Ownratio 616 \$4,399 0.022 614 \$5,090 0.128 616 \$5,411 0.172 615 \$8,110 0.059 720 \$9,198 0.235 617 \$9,541 0.916 718 \$9,849 0.259 613 \$10,063 0.901 627 \$10,391 0.410 619 \$11,638 1.019 602 \$11,875 1.094 626 \$12,244 0.873 612 \$12,307 1.584 620 \$12,711 1.698 718 \$12,818 0.782 621 \$12,839 2.188 618 \$13,095 1.265 735 \$13,269 1.364 628 \$13,934 1.151 625 \$14,178 2.307 629 \$14,201 1.274 705 \$14,212 3.019 607 \$14,821 1.837 634 \$14,870 2.009 610 \$15,075 0.919 623 \$15,202 2.850 606 \$15,351 0.789 611 \$15,634 1.898 630 \$15,784 1.751 706 \$15,817 2.154 624 \$15,932 3.049 714 \$16,900 2.110 719 \$16,931 1.233 633 \$17,044 3.868 632 \$17,431 4.272 726 \$18,140 5.078 701 \$18,250 2.471 608 \$18,816 5.150 631 \$18,917 5.074 609 \$19,179 2.201 711 \$19,282 4.579 603 \$19,308 3.587 635 \$19,384 2.256 714 \$19,592 4.468 721 \$19,646 2.206 731 \$19,788 5.738 605 \$20,132 3.508 604 \$20,375 5.279 728 \$21,250 1.433 609 \$21,434 1.932 712 \$21,795 3.717 710 \$21,911 5.190 721 \$22,190 1.406 731 \$22,231 7.452 713 \$22,507 6.127 713 \$22,904 3.720 719 \$23,545 3.288 724 \$24,750 5.650 601 \$24,909 7.220

Step #2: Omit the middle observations

Next, take out the observations in the middle. This usually amounts between one-fifth to one-third of your observations. There’s no hard and fast rule about how many variables to omit, and if your data set is small, you may not be able to omit any. In our example, we can omit 13 observations (highlighted in orange):

 Housing Data Tract Income Ownratio 616 \$4,399 0.022 614 \$5,090 0.128 616 \$5,411 0.172 615 \$8,110 0.059 720 \$9,198 0.235 617 \$9,541 0.916 718 \$9,849 0.259 613 \$10,063 0.901 627 \$10,391 0.410 619 \$11,638 1.019 602 \$11,875 1.094 626 \$12,244 0.873 612 \$12,307 1.584 620 \$12,711 1.698 718 \$12,818 0.782 621 \$12,839 2.188 618 \$13,095 1.265 735 \$13,269 1.364 628 \$13,934 1.151 625 \$14,178 2.307 629 \$14,201 1.274 705 \$14,212 3.019 607 \$14,821 1.837 634 \$14,870 2.009 610 \$15,075 0.919 623 \$15,202 2.850 606 \$15,351 0.789 611 \$15,634 1.898 630 \$15,784 1.751 706 \$15,817 2.154 624 \$15,932 3.049 714 \$16,900 2.110 719 \$16,931 1.233 633 \$17,044 3.868 632 \$17,431 4.272 726 \$18,140 5.078 Tract Income Ownratio 701 \$18,250 2.471 608 \$18,816 5.150 631 \$18,917 5.074 609 \$19,179 2.201 711 \$19,282 4.579 603 \$19,308 3.587 635 \$19,384 2.256 714 \$19,592 4.468 721 \$19,646 2.206 731 \$19,788 5.738 605 \$20,132 3.508 604 \$20,375 5.279 728 \$21,250 1.433 609 \$21,434 1.932 712 \$21,795 3.717 710 \$21,911 5.190 721 \$22,190 1.406 731 \$22,231 7.452 713 \$22,507 6.127 713 \$22,904 3.720 719 \$23,545 3.288 724 \$24,750 5.650 601 \$24,909 7.220

Step #3: Run two separate regressions, one for the low values, one for the high

We ran separate regressions for the 23 observations with the lowest values for INCOME and the 23 observations with the highest values. In these regressions, we weren’t concerned with whether the t-ratios of the parameter estimates were significant. Rather, we wanted to look at their Error Sum of Squares (ESS). Each model has 21 degrees of freedom.

Step #4: Divide the ESS of the higher value regression by the ESS of the lower value regression, and compare quotient to the F-table.

The higher value regression produced an ESS of 61.489 and the lower value regression produced an ESS of 5.189. Dividing the former by the latter, we get a quotient of 11.851. Now, we need to go to the F-table and check the critical F-value for a 95% significance level and 21 degrees of freedom, which is a value of 2.10. Since our quotient of 11.851 is greater than that of the critical F-value, we can conclude there is strong evidence of heteroscedasticity in the model.

The Breusch-Pagan Test

The Breusch-Pagan test is also pretty simple, but it’s a very powerful test, in that it can be used to detect whether more than one independent variable is causing the heteroscedasticity. Since it can involve multiple variables, the Breusch-Pagan test relies on critical values of chi-squared (χ2) to determine the presence of heteroscedasticity, and works best with large sample sets. There are five steps to the Breusch-Pagan test:

Step #1:
Run the regular regression model and collect the residuals

Step #2: Estimate the variance of the regression residuals

To do this, we square each residual, sum it up and then divide it by the number of observations. Our formula is:

Our residuals and their squares are as follows:

 Observation Predicted Ownratio Residuals Residuals Squared 1 5.165 2.055 4.222 2 1.300 (0.206) 0.043 3 3.504 0.083 0.007 4 3.821 1.458 2.126 5 3.749 (0.241) 0.058 6 2.331 (1.542) 2.378 7 2.174 (0.337) 0.113 8 3.358 1.792 3.209 9 3.466 (1.265) 1.601 10 4.135 (2.203) 4.852 11 2.249 (1.330) 1.769 12 2.415 (0.517) 0.267 13 1.428 0.156 0.024 14 0.763 0.138 0.019 15 (0.712) 0.840 0.705 16 0.184 (0.125) 0.016 17 (0.917) 0.939 0.881 18 (0.617) 0.789 0.622 19 0.608 0.308 0.095 20 1.662 (0.397) 0.158 21 1.230 (0.211) 0.045 22 1.548 0.150 0.022 23 1.586 0.602 0.362 24 2.287 0.563 0.317 25 2.503 0.546 0.298 26 1.983 0.324 0.105 27 1.410 (0.537) 0.288 28 0.860 (0.450) 0.203 29 1.911 (0.760) 0.577 30 1.990 (0.716) 0.513 31 2.459 (0.708) 0.502 32 3.388 1.686 2.841 33 2.948 1.324 1.754 34 2.833 1.035 1.071 35 2.188 (0.179) 0.032 36 3.527 (1.271) 1.615 37 3.191 (0.720) 0.518 38 1.993 1.026 1.052 39 2.469 (0.315) 0.099 40 4.276 0.914 0.835 41 3.497 1.082 1.171 42 4.242 (0.525) 0.275 43 4.571 (0.851) 0.724 44 4.453 1.674 2.802 45 3.589 0.879 0.773 46 2.790 (0.680) 0.463 47 1.580 (0.798) 0.637 48 0.699 (0.440) 0.194 49 2.800 (1.567) 2.454 50 4.761 (1.473) 2.169 51 0.506 (0.271) 0.074 52 4.359 (2.953) 8.720 53 3.605 (1.399) 1.956 54 5.118 0.532 0.283 55 3.158 1.920 3.686 56 4.080 (2.647) 7.008 57 4.371 3.081 9.492 58 3.647 2.091 4.373 59 1.714 (0.350) 0.122

Summing the last column, we get 83.591. We divide this by 59, and get 1.417.

Step #3: Compute the square of the standardized residuals

Now that we know the variance of the regression residuals – 1.417 – we compute the standardized residuals by dividing each residual by 1.417 and then squaring the results, so that we get our square of standardized residuals, si2:

 Obs. Predicted Ownratio Residuals Standardized Residuals Square of Standardized Residuals 1 5.165 2.055 1.450 2.103 2 1.300 (0.206) (0.146) 0.021 3 3.504 0.083 0.058 0.003 4 3.821 1.458 1.029 1.059 5 3.749 (0.241) (0.170) 0.029 6 2.331 (1.542) (1.088) 1.185 7 2.174 (0.337) (0.238) 0.057 8 3.358 1.792 1.264 1.599 9 3.466 (1.265) (0.893) 0.797 10 4.135 (2.203) (1.555) 2.417 11 2.249 (1.330) (0.939) 0.881 12 2.415 (0.517) (0.365) 0.133 13 1.428 0.156 0.110 0.012 14 0.763 0.138 0.097 0.009 15 (0.712) 0.840 0.593 0.351 16 0.184 (0.125) (0.088) 0.008 17 (0.917) 0.939 0.662 0.439 18 (0.617) 0.789 0.557 0.310 19 0.608 0.308 0.217 0.047 20 1.662 (0.397) (0.280) 0.079 21 1.230 (0.211) (0.149) 0.022 22 1.548 0.150 0.106 0.011 23 1.586 0.602 0.425 0.180 24 2.287 0.563 0.397 0.158 25 2.503 0.546 0.385 0.148 26 1.983 0.324 0.229 0.052 27 1.410 (0.537) (0.379) 0.143 28 0.860 (0.450) (0.318) 0.101 29 1.911 (0.760) (0.536) 0.288 30 1.990 (0.716) (0.505) 0.255 31 2.459 (0.708) (0.500) 0.250 32 3.388 1.686 1.190 1.415 33 2.948 1.324 0.935 0.874 34 2.833 1.035 0.730 0.534 35 2.188 (0.179) (0.127) 0.016 36 3.527 (1.271) (0.897) 0.805 37 3.191 (0.720) (0.508) 0.258 38 1.993 1.026 0.724 0.524 39 2.469 (0.315) (0.222) 0.049 40 4.276 0.914 0.645 0.416 41 3.497 1.082 0.764 0.584 42 4.242 (0.525) (0.370) 0.137 43 4.571 (0.851) (0.600) 0.361 44 4.453 1.674 1.182 1.396 45 3.589 0.879 0.621 0.385 46 2.790 (0.680) (0.480) 0.231 47 1.580 (0.798) (0.563) 0.317 48 0.699 (0.440) (0.311) 0.097 49 2.800 (1.567) (1.106) 1.223 50 4.761 (1.473) (1.040) 1.081 51 0.506 (0.271) (0.192) 0.037 52 4.359 (2.953) (2.084) 4.344 53 3.605 (1.399) (0.987) 0.974 54 5.118 0.532 0.375 0.141 55 3.158 1.920 1.355 1.836 56 4.080 (2.647) (1.868) 3.491 57 4.371 3.081 2.175 4.728 58 3.647 2.091 1.476 2.179 59 1.714 (0.350) (0.247) 0.061

Step #4: Run another regression with all your independent variables using the sum of standardized residuals as the dependent variable

In this case, we had only one independent variable, INCOME. We will now run a regression substituting the last column of the table above for OWNRATIO, and making it the dependent variable. Again, we’re not interested in the parameter estimates. We are, however, interested in the regression sum of squares (RSS), which is 15.493.

Step #5: Divide the RSS by 2 and compare with the χ2 table’s critical value for the appropriate degrees of freedom

Dividing the RSS by 2, we get 7.747. We look up the critical χ2 value for one degree of freedom and in the table, for a 5% significance level, we get 3.84. Since our χ2 value exceeds our critical, we can conclude there is strong evidence of heteroscedasticity present.

The Park Test

Last, but certainly not least comes the Park test. I saved this one for last because it is the simplest of the three methods and unlike the other two, provides information that can help eliminate the heteroscedasticity. The Park Test assumes there is a relationship between the error variance and one of the regression model’s independent variables. The steps involved are as follows:

Step #1: Run your original regression model and collect the residuals

Done.

Step #2: Square the regression residuals and compute the logs of the squared residuals and the values of the suspected independent variable.

We’ll square the regression residuals, and take their natural log. We will also take the natural log of INCOME:

 Tract Residual Squared LnResidual Squared LnIncome 601 4.222 1.440 10.123 602 0.043 (3.157) 9.382 603 0.007 (4.987) 9.868 604 2.126 0.754 9.922 605 0.058 (2.848) 9.910 606 2.378 0.866 9.639 607 0.113 (2.176) 9.604 608 3.209 1.166 9.842 609 1.601 0.470 9.862 609 4.852 1.579 9.973 610 1.769 0.571 9.621 611 0.267 (1.320) 9.657 612 0.024 (3.720) 9.418 613 0.019 (3.960) 9.217 614 0.705 (0.349) 8.535 615 0.016 (4.162) 9.001 616 0.881 (0.127) 8.389 616 0.622 (0.475) 8.596 617 0.095 (2.356) 9.163 618 0.158 (1.847) 9.480 619 0.045 (3.112) 9.362 620 0.022 (3.796) 9.450 621 0.362 (1.015) 9.460 623 0.317 (1.148) 9.629 624 0.298 (1.211) 9.676 625 0.105 (2.255) 9.559 626 0.288 (1.245) 9.413 627 0.203 (1.596) 9.249 628 0.577 (0.549) 9.542 629 0.513 (0.668) 9.561 630 0.502 (0.689) 9.667 631 2.841 1.044 9.848 632 1.754 0.562 9.766 633 1.071 0.069 9.744 634 0.032 (3.437) 9.607 635 1.615 0.479 9.872 701 0.518 (0.658) 9.812 705 1.052 0.051 9.562 706 0.099 (2.309) 9.669 710 0.835 (0.180) 9.995 711 1.171 0.158 9.867 712 0.275 (1.289) 9.989 713 0.724 (0.323) 10.039 713 2.802 1.030 10.022 714 0.773 (0.257) 9.883 714 0.463 (0.770) 9.735 718 0.637 (0.452) 9.459 718 0.194 (1.640) 9.195 719 2.454 0.898 9.737 719 2.169 0.774 10.067 720 0.074 (2.608) 9.127 721 8.720 2.166 10.007 721 1.956 0.671 9.886 724 0.283 (1.263) 10.117 726 3.686 1.305 9.806 728 7.008 1.947 9.964 731 9.492 2.250 10.009 731 4.373 1.476 9.893 735 0.122 (2.102) 9.493

Step #3: Run the regression equation using the log of the squared residuals as the dependent variable and the log of the suspected independent variable as the dependent variable

That results in the following regression equation:

Ln(e2) = 1.957(LnIncome) – 19.592

Step #4: If the t-ratio for the transformed independent variable is significant, you can conclude heteroscedasticity is present.

The parameter estimate for the LnIncome is significant, with a t-ratio of 3.499, so we conclude heteroscedasticity.

Next Forecast Friday Topic: Correcting Heteroscedasticity

Thanks for your patience! Now you know the three most common methods for detecting heteroscedasticity: the Goldfeld-Quandt test, the Breusch-Pagan test, and the Park test. As you will see in next week’s Forecast Friday post, the Park test will be beneficial in helping us eliminate the heteroscedasticity. We will discuss the most common approach to correcting heteroscedasticity: weighted least squares (WLS) regression, and show you how to apply it. Next week’s Forecast Friday post will conclude our discussion of regression violations, and allow us to resume discussions of more practical applications in forecasting.

*************************

Help us Reach 200 Fans on Facebook by Tomorrow!

Thanks to all of you, Analysights now has over 160 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

### Forecast Friday Topic: Heteroscedasticity

August 12, 2010

(Seventeenth in a series)

Recall that one of the important assumptions in regression analysis is that a regression equation exhibit homoscedasticity: the condition that the error terms have a constant variance. Today we discuss heteroscedasticity, the violation of that assumption.

Heteroscedasticity, like autocorrelation and multicollinearity, results in inefficient parameter estimates. The standard errors of the parameter estimates tend to be biased, which means that the t-ratios and confidence intervals calculated around the suspect independent variable will not be valid, and will generate dubious predictions.

Heteroscedasticity occurs mostly in cross-sectional, as opposed to time series, data and mostly in large data sets. When data sets are large, the range of values for an independent variable can be quite wide. This is especially the case in data where income or other measures of wealth are used as independent variables. Persons with low income have few options about how to spend their money while persons with high incomes have many options. If you were trying to predict that the conviction rate for crimes was different in low income counties vs. high income counties, your model may exhibit heteroscedasticity because a low-income person may not have the funds for an adequate defense, and may be restricted to a public defender, or other inexpensive attorney. A wealthy individual, on the other hand, can hire the very best defense lawyer money could buy; or he could choose an inexpensive lawyer, or even the public defender. The wealthy individual may even be able to make restitution in lieu of a conviction.

How does this disparity affect your model? Recall from our earlier discussions on regression analysis that the least-squares method places more weight on extreme values. When outliers exist in data, they generate large residuals that get scattered out from those of the remaining observations. While heteroscedastic error terms will still have a mean of zero, their variance is greatly out of whack, resulting in inefficient parameter estimates.

In today’s Forecast Friday post, we will look at a data set for a regional housing market, perform a regression, and show how to detect heteroscedasticity visually.

Heteroscedasticity in the Housing Market

The best depiction of heteroscedasticity comes from my college econometrics textbook, Introducing Econometrics, by William S. Brown. In the chapter on heteroscedasticity, Brown provides a data set of housing statistics from the 1980 Census for Pierce County, Washington, which I am going to use for our model. The housing market is certainly one market where heteroscedasticity is deeply entrenched, since there is a dramatic range for both incomes and home market values. In our data set, we have 59 census tracts within Pierce County. Our independent variable is the median family income for the census tract; our dependent variable is the OwnRatio – the ratio of the number of families who own their homes to the number of families who rent. Our data set is as follows:

 Housing Data Tract Income Ownratio 601 \$24,909 7.220 602 \$11,875 1.094 603 \$19,308 3.587 604 \$20,375 5.279 605 \$20,132 3.508 606 \$15,351 0.789 607 \$14,821 1.837 608 \$18,816 5.150 609 \$19,179 2.201 609 \$21,434 1.932 610 \$15,075 0.919 611 \$15,634 1.898 612 \$12,307 1.584 613 \$10,063 0.901 614 \$5,090 0.128 615 \$8,110 0.059 616 \$4,399 0.022 616 \$5,411 0.172 617 \$9,541 0.916 618 \$13,095 1.265 619 \$11,638 1.019 620 \$12,711 1.698 621 \$12,839 2.188 623 \$15,202 2.850 624 \$15,932 3.049 625 \$14,178 2.307 626 \$12,244 0.873 627 \$10,391 0.410 628 \$13,934 1.151 629 \$14,201 1.274 630 \$15,784 1.751 631 \$18,917 5.074 632 \$17,431 4.272 633 \$17,044 3.868 634 \$14,870 2.009 635 \$19,384 2.256 701 \$18,250 2.471 705 \$14,212 3.019 706 \$15,817 2.154 710 \$21,911 5.190 711 \$19,282 4.579 712 \$21,795 3.717 713 \$22,904 3.720 713 \$22,507 6.127 714 \$19,592 4.468 714 \$16,900 2.110 718 \$12,818 0.782 718 \$9,849 0.259 719 \$16,931 1.233 719 \$23,545 3.288 720 \$9,198 0.235 721 \$22,190 1.406 721 \$19,646 2.206 724 \$24,750 5.650 726 \$18,140 5.078 728 \$21,250 1.433 731 \$22,231 7.452 731 \$19,788 5.738 735 \$13,269 1.364

Data taken from U.S. Bureau of Census 1980 Pierce County, WA; Reprinted in Brown, W.S., Introducing Econometrics, St. Paul (1991): 198-200.

When we run our regression, we get the following equation:

Ŷ= 0.000297*Income – 2.221

Both the intercept and independent variable’s parameter estimates are significant, with the intercept parameter having a t-ratio of -4.094 and the income estimate having one of 9.182. R2 is 0.597, and the F-statistic is a strong 84.31. The model seems to be pretty good – strong t-ratios and F-statistic, a high coefficient of determination, and the sign on the parameter estimate for Income is positive, as we would expect. Generally, the higher the income, the greater the Own-to-rent ratio. So far so good.

The problem comes when we do a visual inspection of our data: first the independent variable against the dependent variable and the independent variable against the regression residuals. First, let’s take a look at the scatter plot of Income and OwnRatio:

Without even looking at the residuals, we can see that as median family income increases, the data points begin to spread out. Look at what happens to the distance between data points above and below the line when median family incomes reach \$20,000: OwnRatios vary drastically.

Now let’s plot Income against the regression’s residuals:

This scatter plot shows essentially the same phenomenon as the previous graph, but from a different perspective. We can clearly see the error terms fanning out as Income increases. In fact, we can see the residuals diverging at increasing rates once Income starts moving from \$10,000 to \$15,000, and just compounding as incomes go higher. Roughly half the residuals fall on both the positive and the negative side, allowing us to meet the regression assumption of our residuals having a mean of zero, hence our parameter estimates are not biased. However, because we violated the constant variance assumption, the standard error of our regression is biased, so our parameter estimates are suspect.

Visual Inspection Only Gets You So Far

By visually inspecting our residuals, we can clearly see that our error terms are not homoscedastic. When you have a regression model, especially for cross-sectional data sets like this, you should visually inspect every independent variable against the dependent variable and against the error terms in order to get a priori indication of heteroscedasticity. However, visual inspection alone is not a guarantee that heteroscedasticity exists. There are three particularly simple methods to detecting heteroscedasticity which we will discuss in next week’s Forecast Friday post: the Park Test, the Goldfeld-Quandt Test, and the Breusch-Pagan Test.

*************************

Help us Reach 200 Fans on Facebook by Tomorrow!

Thanks to all of you, Analysights now has 150 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights!  By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter.   Thanks for your help!

### Objective and Subjective Forecasting Approaches

May 3, 2010

(second in a series)

Today we discuss the various categories of forecasting methods that are available to businesses.  Forecasting methods can be either objective (using quantitative approaches) or subjective (using more intuitive or qualitative approaches), depending on what data is available and the distance into the future for which a forecast is desired.  Forecasting approaches will typically be more objective for nearer term forecasting horizons and for events where there is plenty of quantitative data available.  More distant time periods, or events with a lack of historical quantitative data will often call for more subjective approaches.  We will discuss these two classes of forecasting methods, and the categories within each.

Objective Forecasting Approaches

Objective forecasting approaches are quantitative in nature and lend themselves well to an abundance of data.  There are three categories of objective forecasting methods: time series, causal/econometric,  and artificial intelligence.  AI approaches are outside my experience, so I won’t be covering them in this series, but mention them as another alternative, in case you wish to investigate them on your own.

Time Series Methods

Time series methods attempt to estimate future outcomes on the basis of historical data.  In many cases, prior sales of a product can be a good predictor of upcoming sales because of prior period marketing efforts, repeat business, brand awareness, and other factors.  When an analyst employs time series methods, he/she is assuming that the future will continue to look like the past.  In rapidly changing industries or environments, time series forecasts are not ideal, and may be useless.

Because time series data are historical, they exhibit four components that emerge over time: trend, seasonal, cyclical, and random (or irregular).  Before any forecasting is done on time series data, the data must be adjusted for each of these components.  Decomposing time series data will be discussed later in this series.

The most common time series methods include moving average (both straight and weighted), exponential smoothing, and regression analysis.  Each of these approaches will be discussed later in the series.

Causal/Econometric Methods

Causal or econometric forecasting methods attempt to predict outcomes based on changes in factors that are known – or believed – to impact those outcomes.  For example, temperature may be used to forecast sales of ice cream; advertising expenditures may be used to predict sales; or the unemployment rate might be used to forecast the incidence of crime in a neighborhood.  It is important to note, however, that just because a model finds two events that are correlated (e.g., occur together), it does not necessarily mean that one event has caused the other.

Regression analysis also falls under the causal/econometric umbrella, as it can be used to predict an outcome based on changes in other factors (e.g., SAT score may be used to measure likelihood of being accepted to a college).  Econometric forecasting methods include  Autoregressive Moving Average (ARMA) and Autoregressive Integrated Moving Average (ARIMA) models.  ARIMA was previously known as Box-Jenkins.  ARMA and ARIMA models are used in certain cases, but most of the time are unnecessary.  Although these two methods won’t be covered in much depth later in the series, there will be a brief description of them and when they are needed.

Subjective Forecasting Approaches

Subjective forecasts are more qualitative.  These approaches rely most heavily on judgment and educated guesses, since there is little data available for forecasting.  This is especially the case in long-range forecasting.  It’s easy to forecast next week’s sales of ice cream – and possibly even of individual flavors, since you’ll likely have months or years of past weekly ice cream sales data.  However, if you’re trying to get an idea of what ice cream consumption or flavor preferences will be 10 years from now, quantitative approaches will be of little use.  Changes in tastes, technology, and political, economic, and social factors occur and can dramatically alter the course of trends.  Hence, the opinion of subject matter experts is often called upon.  There is essentially only one category of subjective forecast approaches – and it is rightly called “Judgmental” forecasts.

Judgmental Methods

Judgmental forecasting methods rely much on expert opinion and educated guesses.  But just because they have little quantitative or objective basis doesn’t mean they should be dismissed or not measured for accuracy.  The most common types of of judgmental forecasting methods are composite forecasts, extrapolation, surveys, Delphi method, scenario writing, and simulation.  Each of these methods will be discussed in detail later in the series.

Introducing “Forecast Fridays” – ON THURSDAYS!!!

Beginning with part 3, which will discuss moving average forecasts, the forecasting series will begin posting weekly so that the remaining days of the week can still be devoted to other topics in the marketing research and analytics field.  The weekly post will be called “Forecast Friday.”  However, it will be posted every Thursday!  Why?  Find out in tomorrow’s post!