Posts Tagged ‘independent variable’

Forecast Friday Topic: Heteroscedasticity

August 12, 2010

(Seventeenth in a series)

Recall that one of the important assumptions in regression analysis is that a regression equation exhibit homoscedasticity: the condition that the error terms have a constant variance. Today we discuss heteroscedasticity, the violation of that assumption.

Heteroscedasticity, like autocorrelation and multicollinearity, results in inefficient parameter estimates. The standard errors of the parameter estimates tend to be biased, which means that the t-ratios and confidence intervals calculated around the suspect independent variable will not be valid, and will generate dubious predictions.

Heteroscedasticity occurs mostly in cross-sectional, as opposed to time series, data and mostly in large data sets. When data sets are large, the range of values for an independent variable can be quite wide. This is especially the case in data where income or other measures of wealth are used as independent variables. Persons with low income have few options about how to spend their money while persons with high incomes have many options. If you were trying to predict that the conviction rate for crimes was different in low income counties vs. high income counties, your model may exhibit heteroscedasticity because a low-income person may not have the funds for an adequate defense, and may be restricted to a public defender, or other inexpensive attorney. A wealthy individual, on the other hand, can hire the very best defense lawyer money could buy; or he could choose an inexpensive lawyer, or even the public defender. The wealthy individual may even be able to make restitution in lieu of a conviction.

How does this disparity affect your model? Recall from our earlier discussions on regression analysis that the least-squares method places more weight on extreme values. When outliers exist in data, they generate large residuals that get scattered out from those of the remaining observations. While heteroscedastic error terms will still have a mean of zero, their variance is greatly out of whack, resulting in inefficient parameter estimates.

In today’s Forecast Friday post, we will look at a data set for a regional housing market, perform a regression, and show how to detect heteroscedasticity visually.

Heteroscedasticity in the Housing Market

The best depiction of heteroscedasticity comes from my college econometrics textbook, Introducing Econometrics, by William S. Brown. In the chapter on heteroscedasticity, Brown provides a data set of housing statistics from the 1980 Census for Pierce County, Washington, which I am going to use for our model. The housing market is certainly one market where heteroscedasticity is deeply entrenched, since there is a dramatic range for both incomes and home market values. In our data set, we have 59 census tracts within Pierce County. Our independent variable is the median family income for the census tract; our dependent variable is the OwnRatio – the ratio of the number of families who own their homes to the number of families who rent. Our data set is as follows:

 Housing Data Tract Income Ownratio 601 \$24,909 7.220 602 \$11,875 1.094 603 \$19,308 3.587 604 \$20,375 5.279 605 \$20,132 3.508 606 \$15,351 0.789 607 \$14,821 1.837 608 \$18,816 5.150 609 \$19,179 2.201 609 \$21,434 1.932 610 \$15,075 0.919 611 \$15,634 1.898 612 \$12,307 1.584 613 \$10,063 0.901 614 \$5,090 0.128 615 \$8,110 0.059 616 \$4,399 0.022 616 \$5,411 0.172 617 \$9,541 0.916 618 \$13,095 1.265 619 \$11,638 1.019 620 \$12,711 1.698 621 \$12,839 2.188 623 \$15,202 2.850 624 \$15,932 3.049 625 \$14,178 2.307 626 \$12,244 0.873 627 \$10,391 0.410 628 \$13,934 1.151 629 \$14,201 1.274 630 \$15,784 1.751 631 \$18,917 5.074 632 \$17,431 4.272 633 \$17,044 3.868 634 \$14,870 2.009 635 \$19,384 2.256 701 \$18,250 2.471 705 \$14,212 3.019 706 \$15,817 2.154 710 \$21,911 5.190 711 \$19,282 4.579 712 \$21,795 3.717 713 \$22,904 3.720 713 \$22,507 6.127 714 \$19,592 4.468 714 \$16,900 2.110 718 \$12,818 0.782 718 \$9,849 0.259 719 \$16,931 1.233 719 \$23,545 3.288 720 \$9,198 0.235 721 \$22,190 1.406 721 \$19,646 2.206 724 \$24,750 5.650 726 \$18,140 5.078 728 \$21,250 1.433 731 \$22,231 7.452 731 \$19,788 5.738 735 \$13,269 1.364

Data taken from U.S. Bureau of Census 1980 Pierce County, WA; Reprinted in Brown, W.S., Introducing Econometrics, St. Paul (1991): 198-200.

When we run our regression, we get the following equation:

Ŷ= 0.000297*Income – 2.221

Both the intercept and independent variable’s parameter estimates are significant, with the intercept parameter having a t-ratio of -4.094 and the income estimate having one of 9.182. R2 is 0.597, and the F-statistic is a strong 84.31. The model seems to be pretty good – strong t-ratios and F-statistic, a high coefficient of determination, and the sign on the parameter estimate for Income is positive, as we would expect. Generally, the higher the income, the greater the Own-to-rent ratio. So far so good.

The problem comes when we do a visual inspection of our data: first the independent variable against the dependent variable and the independent variable against the regression residuals. First, let’s take a look at the scatter plot of Income and OwnRatio:

Without even looking at the residuals, we can see that as median family income increases, the data points begin to spread out. Look at what happens to the distance between data points above and below the line when median family incomes reach \$20,000: OwnRatios vary drastically.

Now let’s plot Income against the regression’s residuals:

This scatter plot shows essentially the same phenomenon as the previous graph, but from a different perspective. We can clearly see the error terms fanning out as Income increases. In fact, we can see the residuals diverging at increasing rates once Income starts moving from \$10,000 to \$15,000, and just compounding as incomes go higher. Roughly half the residuals fall on both the positive and the negative side, allowing us to meet the regression assumption of our residuals having a mean of zero, hence our parameter estimates are not biased. However, because we violated the constant variance assumption, the standard error of our regression is biased, so our parameter estimates are suspect.

Visual Inspection Only Gets You So Far

By visually inspecting our residuals, we can clearly see that our error terms are not homoscedastic. When you have a regression model, especially for cross-sectional data sets like this, you should visually inspect every independent variable against the dependent variable and against the error terms in order to get a priori indication of heteroscedasticity. However, visual inspection alone is not a guarantee that heteroscedasticity exists. There are three particularly simple methods to detecting heteroscedasticity which we will discuss in next week’s Forecast Friday post: the Park Test, the Goldfeld-Quandt Test, and the Breusch-Pagan Test.

*************************

Help us Reach 200 Fans on Facebook by Tomorrow!

Thanks to all of you, Analysights now has 150 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights!  By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter.   Thanks for your help!

Forecast Friday Topic: Multicollinearity – Correcting and Accepting it

July 22, 2010

(Fourteenth in a series)

In last week’s Forecast Friday post, we discussed how to detect multicollinearity in a regression model and how dropping a suspect variable or variables from the model can be one approach to reducing or eliminating multicollinearity. However, removing variables can cause other problems – particularly specification bias – if the suspect variable is indeed an important predictor. Today we will discuss two additional approaches to correcting multicollinearity – obtaining more data and transforming variables – and will discuss when it’s best to just accept the multicollinearity.

Obtaining More Data

Multicollinearity is really an issue with the sample, not the population. Sometimes, sampling produces a data set that might be too homogeneous. One way to remedy this would be to add more observations to the data set. Enlarging the sample will introduce more variation in the data series, which reduces the effect of sampling error and helps increase precision when estimating various properties of the data. Increased sample sizes can reduce either the presence or the impact of multicollinearity, or both. Obtaining more data is often the best way to remedy multicollinearity.

Obtaining more data does have problems, however. Sometimes, additional data just isn’t available. This is especially the case with time series data, which can be limited or otherwise finite. If you need to obtain that additional information through great effort, it can be costly and time consuming. Also, the additional data you add to your sample could be quite similar to your original data set, so there would be no benefit to enlarging your data set. The new data could even make problems worse!

Transforming Variables

Another way statisticians and modelers go about eliminating multicollinearity is through data transformation. This can be done in a number of ways.

Combine Some Variables

The most obvious way would be to find a way to combine some of the variables. After all, multicollinearity suggests that two or more independent variables are strongly correlated. Perhaps you can multiply two variables together and use the product of those two variables in place of them.

So, in our example of the donor history, we had the two variables “Average Contribution in Last 12 Months” and “Times Donated in Last 12 Months.” We can multiply them to create a composite variable, “Total Contributions in Last 12 Months,” and then use that new variable, along with the variable “Months Since Last Donation” to perform the regression. In fact, if we did that with our model, we end up with a model (not shown here) that has an R2=0.895, and this time the coefficient for “Months Since Last Donation” is significant, as is our “Total Contribution” variable. Our F statistic is a little over 72. Essentially, the R2 and F statistics are only slightly lower than in our original model, suggesting that the transformation was useful. However, looking at the correlation matrix, we still see a strong negative correlation between our two independent variables, suggesting that we still haven’t eliminated multicollinearity.

Centered Interaction Terms

Sometimes we can reduce multicollinearity by creating an interaction term between variables in question. In a model trying to predict performance on a test based on hours spent studying and hours of sleep, you might find that hours spent studying appears to be related with hours of sleep. So, you create a third independent variable, Sleep_Study_Interaction. You do this by computing the average value for both the hours of sleep and hours of studying variables. For each observation, you subtract each independent variable’s mean from its respective value for that observation. Once you’ve done that for each observation, multiply their differences together. This is your interaction term, Sleep_Study_Interaction. Run the regression now with the original two variables and the interaction term. When you subtract the means from the variables in question, you are in effect centering interaction term, which means you’re taking into account central tendency in your data.

Differencing Data

If you’re working with time series data, one way to reduce multicollinearity is to run your regression using differences. To do this, you take every variable – dependent and independent – and, beginning with the second observation – subtract the immediate prior observation’s values for those variables from the current observation. Now, instead of working with original data, you are working with the change in data from one period to the next. Differencing eliminates multicollinearity by removing the trend component of the time series. If all independent variables had followed more or less the same trend, they could end up highly correlated. Sometimes, however, trends can build on themselves for several periods, so multiple differencing may be required. In this case, subtracting the period before was taking a “first difference.” If we subtracted two periods before, it’s a “second difference,” and so on. Note also that with differencing, we lose the first observations in the data, depending on how many periods we have to difference, so if you have a small data set, differencing can reduce your degrees of freedom and increase your risk of making a Type I Error: concluding that an independent variable is not statistically significant when, in truth it is.

Other Transformations

Sometimes, it makes sense to take a look at a scatter plot of each independent variable’s values with that of the dependent variable to see if the relationship is fairly linear. If it is not, that’s a cue to transform an independent variable. If an independent variable appears to have a logarithmic relationship, you might substitute its natural log. Also, depending on the relationship, you can use other transformations: square root, square, negative reciprocal, etc.

Another consideration: if you’re predicting the impact of violent crime on a city’s median family income, instead of using the number of violent crimes committed in the city, you might instead divide it by the city’s population and come up with a per-capita figure. That will give more useful insights into the incidence of crime in the city.

Transforming data in these ways helps reduce multicollinearity by representing independent variables differently, so that they are less correlated with other independent variables.

Limits of Data Transformation

Transforming data has its own pitfalls. First, transforming data also transforms the model. A model that uses a per-capita crime figure for an independent variable has a very different interpretation than one using an aggregate crime figure. Also, interpretations of models and their results get more complicated as data is transformed. Ideally, models are supposed to be parsimonious – that is, they explain a great deal about the relationship as simply as possible. Typically, parsimony means as few independent variables as possible, but it also means as few transformations as possible. You also need to do more work. If you try to plug in new data to your resulting model for forecasting, you must remember to take the values for your data and transform them accordingly.

Living With Multicollinearity

Multicollinearity is par for the course when a model consists of two or more independent variables, so often the question isn’t whether multicollinearity exists, but rather how severe it is. Multicollinearity doesn’t bias your parameter estimates, but it inflates their variance, making them inefficient or untrustworthy. As you have seen from the remedies offered in this post, the cures can be worse than the disease. Correcting multicollinearity can also be an iterative process; the benefit of reducing multicollinearity may not justify the time and resources required to do so. Sometimes, any effort to reduce multicollinearity is futile. Generally, for the purposes of forecasting, it might be perfectly OK to disregard the multicollinearity. If, however, you’re using regression analysis to explain relationships, then you must try to reduce the multicollinearity.

A good approach is to run a couple of different models, some using variations of the remedies we’ve discussed here, and comparing their degree of multicollinearity with that of the original model. It is also important to compare the forecast accuracy of each. After all, if all you’re trying to do is forecast, then a model with slightly less multicollinearity but a higher degree of forecast error is probably not preferable to a more precise forecasting model with higher degrees of multicollinearity.

The Takeaways:

1. Where you have multiple regression, you almost always have multicollinearity, especially in time series data.
2. A correlation matrix is a good way to detect multicollinearity. Multicollinearity can be very serious if the correlation matrix shows that some of the independent variables are more highly correlated with each other than they are with the dependent variable.
3. You should suspect multicollinearity if:
1. You have a high R2 but low t-statistics;
2. The sign for a coefficient is opposite of what is normally expected (a relationship that should be positive is negative, and vice-versa).
4. Multicollinearity doesn’t bias parameter estimates, but makes them untrustworthy by enlarging their variance.
5. There are several ways of remedying multicollinearity, with obtaining more data often being the best approach. Each remedy for multicollinearity contributes a new set of problems and limitations, so you must weigh the benefit of reduced multicollinearity on time and resources needed to do so, and the resulting impact on your forecast accuracy.

Next Forecast Friday Topic: Autocorrelation

These past two weeks, we discussed the problem of multicollinearity. Next week, we will discuss the problem of autocorrelation – the phenomenon that occurs when we violate the assumption that the error terms are not correlated with each other. We will discuss how to detect autocorrelation, discuss in greater depth the Durbin-Watson statistic’s use as a measure of the presence of autocorrelation, and how to correct for autocorrelation.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Forecast Friday Topic: Multicollinearity – How to Detect it; How to Correct it

July 15, 2010

(Thirteenth in a series)

In last week’s Forecast Friday post, we explored how to perform regression analysis using Excel. We looked at the giving history of 20 contributors to a nonprofit organization, and developed a model based on the recency, frequency, and monetary value (RFM) of their past donations. We derived the following regression equation:

We were pleased to see that our model had a coefficient of determination – or R2=0.933, indicating that our model explained 93.3% of the change in the donor’s current contribution (our Ŷ). But we were a little disheartened when we looked at the t-statistics of each of our regression coefficients. Recall that we found our recency coefficient was not significant:

 Parameter Coefficient T-statistic Significant? Intercept 87.27 4.32 Yes Months since Last (1.80) (1.44) No Times Donated 2.45 2.87 Yes Average Contribution 0.35 3.26 Yes

Yet, most direct marketing professionals know clearly that RFM theory postulates that all three variables are significant indicators of whether and how much a donor will give (or a customer will buy). When our model doesn’t replicate what a tried and true theory has long maintained, there could possibly be something wrong.

Multicollinearity

Most times, when something doesn’t look right in the results of a regression model, it is safe to assume that one of the regression assumptions has been violated. The problem is trying to determine which assumption – or assumptions – was violated. Since the coefficient for “Months Since Last Contribution” has a t-statistic that indicates it isn’t statistically significant, we might suspect that the specification assumption is violated: that is, we may believe that “Months Since Last Contribution” is an extraneous, irrelevant variable that should not have been included in the model and, thus, be removed.

But is that really the case? There can be other reasons why a parameter estimate does not come up significant. If two or more independent variables are highly correlated, the resulting multicollinearity can cause the regression model to assign a statistically insignificant parameter estimate to an important independent variable. So, how can we detect multicollinearity?

Detecting Multicollinearity: Correlation Matrix

The first step in detecting multicollinearity is to examine the correlation among the independent variables. We do this by looking at a correlation matrix. You can run a correlation matrix in Excel by using its Data Analysis ToolPak. Looking at the correlation matrix for our variables, we find:

 Correlation Matrix – Original Variables Variable Contribution Y Months Since Last Donation X1 Times Donated in last 12 months X2 Average Contribution in last 12 months X3 Contribution (Y) 1.00 Months Since Last Donation – X1 -0.93 1.00 Times Donated in last 12 months – X2 0.89 -0.88 1.00 Average Contribution Last 12 mo. – X3 0.88 -0.84 0.69 1.00

A correlation of 1.00 means two variables are perfectly correlated; a correlation of 0.00 means there is absolutely no correlation. The cells in the matrix above, where the correlation is 1.00, shows the correlation of an independent variable with itself – we would expect a perfectly correlated relationship. What is most important to us are the numbers below the 1.00 correlations. The first column shows our dependent variable, “Contribution”. As you go down the column, row by row, you see that each of our independent variables is strongly correlated with the dependent variable, indicating that they are all strong predictors.

The correlation between “Months Since Last Donation” (X1) and the donor’s Contribution (Y) shows a correlation that is almost perfectly negative (-0.93), while those correlations of the dependent variable with each of the other two independent variables is almost perfectly positive with the contribution (0.89 and 0.88). When writing these in shorthand, we use the Greek letter rho, ρ, to denote correlation. Hence, to show the correlation between each independent variable with the dependent variable, we would express them as follows:

 ρX1Y = -0.93 ρX2Y = 0.89 ρX3Y = 0.88

But now, let’s look at the correlations among our independent variables:

 ρX1X2= -0.88 ρX1X3= -0.84 ρX2X3= 0.69

Notice that all of our independent variables are highly correlated with one another. The relationship between “Times Donated in Last 12 Months” and “Average Contribution in Last 12 Months” is not as strong as the correlation between those individual variables with “Months Since Last Donation,” but the correlation is still very strong.

Hence, we can conclude that multicollinearity is present in this model.

Correcting Multicollinearity: Dropping Variables

In today’s post, we will discuss one of the remedies for multicollinearity – dropping a highly correlated independent variable. Next week, we’ll discuss the other approaches to correcting multicollinearity. Sometimes, when a variable is “iffy,” we can save ourselves some trouble and just kick it out. If we were to ignore “Months Since Last Donation,” and run our regression with the remaining two variables, we end up with the following regression equation:

Ŷ= 60.68 + 3.37X2 + 0.45X3

We get R2 =0.924, suggesting that we didn’t lose much explanatory power by excluding “Months Since Last Donation.” We also get an F statistic of 103.36, much higher than the 73.90 we had in our original model. A higher F-statistic indicates a model that is more statistically valid. It also reflects the exclusion of one or more extraneous variables. Also, the t-statistics for both independent variables are significant, and they’re even higher than they were in the original model, further indicating increased validity:

 Parameter Coefficient T-statistic Significant? Intercept 60.68 7.24 Yes Times Donated 3.37 5.83 Yes Average Contribution 0.45 5.49 Yes

Dropping “Months Since Last Donation” from our analysis worked here. However, dropping variables without a rational decision process can cause new problems. In some cases, dropping a variable can result in specification bias, as we saw in our previous example of predicting profit margin for savings and loan associations a few weeks ago. So, consider dropping variables cautiously.

Next Forecast Friday Topic: More Multicollinearity Remedies

Today, we described one of the ways to remedy multicollinearity – dropping variables. Next week, we will explore two other ways of correcting multicollinearity: obtaining more data and transforming variables. We will also discuss the pitfalls of all three of these remedies, and we will discuss when it’s not worth it to reduce the impact of multicollinearity.

*************************************

Let Analysights Take the Pain out of Forecasting!

Multicollinearity is but one of the many problems you can encounter when forecasting. Let Analysights walk you through the forecasting process so that you can spend more time making strategic decisions and less time trying to guess first where business is going. We will make your forecasting efforts seamless, so you can concentrate on running your business. Check out our Web site or call (847) 895-2565.

Forecast Friday Topic: Multiple Regression Analysis (continued)

June 24, 2010

(Tenth in a series)

Today we resume our discussion of multiple regression analysis. Last week, we built a model to determine the extent of any relationship between U.S. savings & loan associations’ percent profit margin and two independent variables, net revenues per deposit dollar and number of S&L offices. Today, we will compute the 95% confidence interval for each parameter estimate; determine whether the model is valid; check for autocorrelation; and use the model to forecast. Recall that our resulting model was:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

Where Yt is the percent profit margin for the S&L in Year t; X1t is the net revenues per deposit dollar in Year t; and X2t is the number of S&L offices in the U.S. in Year t. Recall that the R2 is .865, indicating that 86.5% of the change in percentage profit margin is explained by changes in net revenues per deposit dollar and number of S&L offices.

Determining the 95% Confidence Interval for the Partial Slope Coefficients

In multiple regression analysis, since there are multiple independent variables, the parameter estimates for each independent variable both impact the slope of the line; hence the coefficients β1t and β2t are referred to as partial slope estimates. As with simple linear regression, we need to determine the 95% confidence interval for each parameter estimate, so that we could get an idea where the true population parameter lies. Recall from our June 3 post, we did that by determining the equation for the standard error of the estimate, sε, and then the standard error of the regression slope, sb. That worked well for simple regression, but for multiple regression, it is more complicated. Unfortunately, deriving the standard error of the partial regression coefficients requires the use of linear algebra, and would be too complicated to discuss here. Several statistical programs and Excel compute these values for us. So, we will state the values of sb1 and sb2 and go from there.

Sb1=0.05556

Sb2=0.00003

Also, we need our critical-t value for 22 degrees of freedom, which is 2.074.

Hence, our 95% confidence interval for β1 is denoted as:

0.23720 ± 2.074 × 0.05556

=0.12197 to 0.35243

Hence, we are saying that we can be 95% confident that the true parameter β1 lies somewhere between the values of 0.12197 and 0.35243.

Similarly, for β2, the procedure is similar:

-0.000249 ± 2.074 × 0.00003

=-0.00032 to -0.00018

Hence, we can be 95% confident that the true parameter β2 lies somewhere between the values of -0.00032 and -0.00018. Also, the confidence interval for the intercept, α, ranges from 1.40 to 1.73.

Note that in all of these cases, the confidence interval does not contain a value of zero within its range. The confidence intervals for α and β1 are positive; that for β2 is negative. If any parameter’s confidence interval ranges crossed zero, then the parameter estimate would not be significant.

Is Our Model Valid?

The next thing we want to do is determine if our model is valid. When validating our model we are trying to prove that our independent variables explain the variation in the dependent variable. So we start with a hypothesis test:

H0: β1 = β2 = 0

HA: at least one β ≠ 0

Our null hypothesis says that our independent variables, net revenue per deposit dollar and number of S&L offices, explain nothing of the variation in an S&L percentage profit margin, and hence, that our model is not valid. Our alternative hypothesis says that at least one of our independent variable explains some of the variation in an S&L’s percentage profit margin, and thus is valid.

So how do we do it? Enter the F-test. Like the T-test, the F-test is a means for hypothesis testing. Let’s first start by calculating our F-statistic for our model. We do that with the following equation:

Remember that RSS is the regression sum of squares and ESS is the error sum of squares. The May 27th Forecast Friday post showed you how to calculate RSS and ESS. For this model, our RSS=0.4015, and our ESS=0.0625; k is the number of independent variables, and n is the sample. Our equation reduces to:

= 70.66

If our Fcalc is greater than the critical F value for the distribution, then we can reject our null hypothesis and conclude that there is strong evidence that at least one of our independent variables explains some of the variation in an S&L’s percentage profit margin. How do we determine our critical F? There is yet another table in any statistics book or statistics Web site called the “F Distribution” table. In it, you look for two sets of degrees of freedom – one for the numerator and one for the denominator of your Fcalc equation. In the numerator, we have two degrees of freedom; in the denominator, 22. So we look at the F Distribution table notice the columns represent numerator degrees of freedom, and the rows, denominator degrees of freedom. When we find column (2), row (22), we end up with an F-value of 5.72.

Our Fcalc is greater than that, so we can conclude that our model is valid.

Is Our Model Free of Autocorrelation?

Recall from our assumptions that none of our error terms should be correlated with one another. If they are, autocorrelation results, rendering our parameter estimates inefficient. Check for autocorrelation, we need to look at our error terms, when we compare our predicted percentage profit margin, Ŷ, with our actual, Y:

 Year Percentage Profit Margin Actual (Yt) Predicted by Model (Ŷt) Error 1 0.75 0.68 (0.0735) 2 0.71 0.71 0.0033 3 0.66 0.70 0.0391 4 0.61 0.67 0.0622 5 0.7 0.68 (0.0162) 6 0.72 0.71 (0.0124) 7 0.77 0.74 (0.0302) 8 0.74 0.76 0.0186 9 0.9 0.79 (0.1057) 10 0.82 0.79 (0.0264) 11 0.75 0.80 0.0484 12 0.77 0.83 0.0573 13 0.78 0.80 0.0222 14 0.84 0.80 (0.0408) 15 0.79 0.75 (0.0356) 16 0.7 0.73 0.0340 17 0.68 0.70 0.0249 18 0.72 0.69 (0.0270) 19 0.55 0.64 0.0851 20 0.63 0.61 (0.0173) 21 0.56 0.57 0.0101 22 0.41 0.48 0.0696 23 0.51 0.44 (0.0725) 24 0.47 0.40 (0.0746) 25 0.32 0.38 0.0574

The next thing we need to do is subtract the previous period’s error from the current period’s error. After that, we square our result. Note that we will only have 24 observations (we can’t subtract anything from the first observation):

 Year Error Difference in Errors Squared Difference in Errors 1 (0.07347) 2 0.00334 0.07681 0.00590 3 0.03910 0.03576 0.00128 4 0.06218 0.02308 0.00053 5 (0.01624) (0.07842) 0.00615 6 (0.01242) 0.00382 0.00001 7 (0.03024) (0.01781) 0.00032 8 0.01860 0.04883 0.00238 9 (0.10569) (0.12429) 0.01545 10 (0.02644) 0.07925 0.00628 11 0.04843 0.07487 0.00561 12 0.05728 0.00884 0.00008 13 0.02217 (0.03511) 0.00123 14 (0.04075) (0.06292) 0.00396 15 (0.03557) 0.00519 0.00003 16 0.03397 0.06954 0.00484 17 0.02489 (0.00909) 0.00008 18 (0.02697) (0.05185) 0.00269 19 0.08509 0.11206 0.01256 20 (0.01728) (0.10237) 0.01048 21 0.01012 0.02740 0.00075 22 0.06964 0.05952 0.00354 23 (0.07252) (0.14216) 0.02021 24 (0.07460) (0.00208) 0.00000 25 0.05738 0.13198 0.01742

If we sum up the last column, we will get .1218, if we then divide that by our ESS of 0.0625, we get a value of 1.95. What does this mean?

We have just computed what is known as the Durbin-Watson Statistic, which is used to detect the presence of autocorrelation. The Durbin-Watson statistic, d, can be anywhere from zero to 4. Generally, when d is close to zero, it suggests the presence of positive autocorrelation; a value close to 2 indicates no autocorrelation; while a value close to 4 indicates negative autocorrelation. In any case, you want your Durbin-Watson statistic to be as close to two as possible, and ours is.

Hence, our model seems to be free of autocorrelation.

Now, Let’s Go Forecast!

Now that we have validated our model, and saw that it was free of autocorrelation, we can be comfortable forecasting. Let’s say that for years 26 and 27, we have the following forecasts for net revenues per deposit dollar, X1t and number of S&L offices, X2t. They are as follows:

X1,26 = 4.70 and X2,26 = 9,350

X1,27 = 4.80 and X2,27 = 9,400

Plugging each of these into our equations, we generate the following forecasts:

Ŷ26 = 1.56450 + 0.23720 * 4.70 – 0.000249 * 9,350

=0.3504

Ŷ27 = 1.56450 + 0.23720 * 4.80 – 0.000249 * 9,400

=0.3617

Next Week’s Forecast Friday Topic: The Effect of Omitting an Important Variable

Now that we’ve walked you through this process, you know how to forecast and run multiple regression. Next week, we will discuss what happens when a key independent variable is omitted from a regression model and all the problems it causes when we violate the regression assumption that “all relevant and no irrelevant independent variables are included in the model.” Next week’s post will show a complete demonstration of such an impact. Stay tuned!

Forecast Friday Topic: Multiple Regression Analysis

June 17, 2010

(Ninth in a series)

Quite often, when we try to forecast sales, more than one variable is often involved. Sales depends on how much advertising we do, the price of our products, the price of competitors’ products, the time of the year (if our product is seasonal), and also demographics of the buyers. And there can be many more factors. Hence, we need to measure the impact of all relevant variables that we know drive our sales or other dependent variable. That brings us to the need for multiple regression analysis. Because of its complexity, we will be spending the next several weeks discussing multiple regression analysis in easily digestible parts. Multiple regression is a highly useful technique, but is quite easy to forget if not used often.

Another thing to note, regression analysis is often used for both time series and cross-sectional analysis. Time series is what we have focused on all along. Cross-sectional analysis involves using regression to analyze variables on static data (such as predicting how much money a person will spend on a car based on income, race, age, etc.). We will use examples of both in our discussions of multiple regression.

Determining Parameter Estimates for Multiple Regression

When it comes to deriving the parameter estimates in a multiple regression, the process gets both complicated and tedious, even if you have just two independent variables. We strongly advise you to use the regression features of MS-Excel, or some statistical analysis tool like SAS, SPSS, or MINITAB. In fact, we will not work out the derivation of the parameters with the data sets, but will provide you the results. You are free to run the data we provide on your own to replicate the results we display. I do, however, want to show you the equations for computing the parameter estimates for a three-variable (two independent variables and one dependent variable), and point out something very important.

Let’s assume that sales is your dependent variable, Y, and advertising expenditures and price are your independent variables, X1 and X2, respectively. Also, the coefficients – your parameter estimates will have similar subscripts to correspond to their respective independent variable. Hence, your model will take on the form:

Now, how do you go about computing α, β1 and β2? The process is similar to that of a two-variable model, but a little more involved. Take a look:

The subscript “i” represents the individual oberservation.  In time series, the subscript can also be represented with a “t“.

What do you notice about the formulas for computing β1 and β2? First, you notice that the independent variables, X1 and X2, are included in the calculation for each coefficient. Why is this? Because when two or more independent variables are used to estimate the dependent variable, the independent variables themselves are likely to be related linearly as well. In fact, they need to be in order to perform multiple regression analysis. If either β1 or β2 turned out to be zero, then simple regression would be appropriate. However, if we omit one or more independent variables from the model that are related to those variables in the model, we run into serious problems, namely:

Specification Bias (Regression Assumptions Revisited)

Recall from last week’s Forecast Friday discussion on regression assumptions that 1) our equation must correctly specify the true regression model, namely that all relevant variables and no irrelevant variables are included in the model and 2) the independent variables must not be correlated with the error term. If either of these assumptions is violated, the parameter estimates you get will be biased. Looking at the above equations for β1 and β2, we can see that if we excluded one of the independent variables, say X2, from the model, the value derived for β1 will be incorrect because X1 has some relationship with X2. Moreover, X2‘s values are likely to be accounted for in the error terms, and because of its relationship with X1, X1 will be correlated with the error term, violating the second assumption above. Hence, you will end up with incorrect, biased estimators for your regression coefficient, β1.

Omitted Variables are Bad, but Excessive Variables Aren’t Much Better

Since omitting relevant variables can lead to biased parameter estimates, many analysts have a tendency to include any variable that might have any chance of affecting the dependent variable, Y. This is also bad. Additional variables means that you need to estimate more parameters, and that reduces your model’s degrees of freedom and the efficiency (trustworthiness) of your parameter estimates. Generally, for each variable – both dependent and independent – you are considering, you should have at least five data points. So, for a model with three independent variables, your data set should have 20 observations.

Another Important Regression Assumption

One last thing about multiple regression analysis – another assumption, which I deliberately left out of last week’s discussion, since it applies exclusively to multiple regression:

No combination of independent variables should have an exact linear relationship with one another.

OK, so what does this mean? Let’s assume you’re doing a model to forecast the effect of temperature on the speed at which ice melts. You use two independent variables: Celsius temperature and Fahrenheit temperature. What’s the problem here? There is a perfect linear relationship between these two variables. Every time you use a particular value of Fahrenheit temperature, you will get the same value of Celsius temperature. In this case, you will end up with multicollinearity, an assumption violation that results in inefficient parameter estimates. A relationship between independent variables need not be perfectly linear for multicollinearity to exist. Highly correlated variables can do the same thing. For example, independent variables such as “Husband Age” and “Wife Age,” or “Home Value” and “Home Square Footage” are examples of independent variables that are highly correlated.

You want to be sure that you do not put variables in the model that need not be there, because doing so could lead to multicollinearity.

Now Can We Get Into Multiple Regression????

Wasn’t that an ordeal? Well, now the fun can begin! I’m going to use an example from one of my old graduate school textbooks, because it’s good for several lessons in multiple regression. This data set is 25 annual observations to predict the percentage profit margin (Y) for U.S. savings and loan associations, based on changes in net revenues per deposit dollar (X1) and number of offices (X2). The data are as follows:

 Year Percentage Profit Margin (Yt) Net Revenues Per Deposit Dollar (X1t) Number of Offices (X2t) 1 0.75 3.92 7,298 2 0.71 3.61 6,855 3 0.66 3.32 6,636 4 0.61 3.07 6,506 5 0.70 3.06 6,450 6 0.72 3.11 6,402 7 0.77 3.21 6,368 8 0.74 3.26 6,340 9 0.90 3.42 6,349 10 0.82 3.42 6,352 11 0.75 3.45 6,361 12 0.77 3.58 6,369 13 0.78 3.66 6,546 14 0.84 3.78 6,672 15 0.79 3.82 6,890 16 0.70 3.97 7,115 17 0.68 4.07 7,327 18 0.72 4.25 7,546 19 0.55 4.41 7,931 20 0.63 4.49 8,097 21 0.56 4.70 8,468 22 0.41 4.58 8,717 23 0.51 4.69 8,991 24 0.47 4.71 9,179 25 0.32 4.78 9,318

Data taken from Spellman, L.J., “Entry and profitability in a rate-free savings and loan market.” Quarterly Review of Economics and Business, 18, no. 2 (1978): 87-95, Reprinted in Newbold, P. and Bos, T., Introductory Business & Economic Forecasting, 2nd Edition, Cincinnati (1994): 136-137

What is the relationship between the S&Ls’ profit margin percentage and the number of S&L offices? How about between the margin percentage and the net revenues per deposit dollar? Is the relationship positive (that is, profit margin percentage moves in the same direction as its independent variable(s))? Or negative (the dependent and independent variables move in opposite directions)? Let’s look at each independent variable’s individual relationship with the dependent variable.

Net Revenue Per Deposit Dollar (X1) and Percentage Profit Margin (Y)

Generally, if revenue per deposit dollar goes up, would we not expect the percentage profit margin to also go up? After all, if the S & L is making more revenue on the same dollar, it suggests more efficiency. Hence, we expect a positive relationship. So, in the resulting regression equation, we would expect the coefficient, β1, for net revenue per deposit dollar to have a “+” sign.

Number of S&L Offices (X2) and Percentage Profit Margin (Y)

Generally, if there are more S&L offices, would that not suggest either higher overhead, increased competition, or some combination of the two? Those would cut into profit margins. Hence, we expect a negative relationship. So, in the resulting regression equation, we would expect the coefficient, β2, for number of S&L offices to have a “-” sign.

Are our Expectations Correct?

Do our relationship expectations hold up?  They certainly do. The estimated multiple regression model is:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

What do the Parameter Estimates Mean?

Essentially, the model says that if net revenues per deposit dollar (X1t) increase by one unit, then percentage profit margin (Yt) will – on average – increase by 0.23720 percentage points, when the number of S&L offices is fixed. If the number of offices (X2t) increases by one, then percentage profit margin (Yt) will decrease by an average of 0.000249 percentage points, when net revenues are fixed.

Do Changes in the Independent Variables Explain Changes in The Dependent Variable?

We compute the coefficient of determination, R2, and get 0.865, indicating that changes in the number of S&L offices and in the net revenue per deposit dollar explain 86.5% of the variation in S&L percentage profit margin.

Are the Parameter Estimates Statistically Significant?

We have 25 observations, and three parameters – two coefficients for the independent variables, and one intercept – hence we have 22 degrees of freedom (25-3). If we choose a 95% confidence interval, we are saying that if we resampled and replicated this analysis 100 times, the average of our parameter estimates will be contain the true parameter approximately 95 times. To do this, we need to look at the t-values for each parameter estimate. For a two-tailed 95% significance test with 22 degrees of freedom, our critical t-value is 2.074. That means that if the t-statistic for a parameter estimate is greater than 2.074, then there is a strong positive relationship between the independent variable and the dependent variable; if the t-statistic for the parameter estimate is less than -2.074, then there is a strong negative relationship. This is what we get:

 Parameter Value T-Statistic Significant? Intercept 1.5645000 19.70 Yes B1t 0.2372000 4.27 Yes B2t (0.0002490) (7.77) Yes

So, yes, all our parameter estimates are significant.

Next Forecast Friday: Building on What You Learned

I think you’ve had enough for this week! But we are still not finished. We’re going to stop here and continue with further analysis of this example next week. Next week, we will discuss computing the 95% confidence interval for the parameter estimates; determining whether the model is valid; and checking for autocorrelation. The following Forecast Friday (July 1) blog post will discuss specification bias in greater detail, demonstrating the impact of omitting a key independent variable from the model.