(Eleventh in a series)
In last week’s Forecast Friday post, we discussed several of the important checks you must do to ensure that your model is valid. You always want to be sure that your model does not violate the assumptions we discussed earlier. Today we are going to see what happens when we violate the specification assumption, which says that we do not omit relevant independent variables from our regression model. You will see that when we leave out an important independent variable from a regression model, quite misleading results can emerge. You will also see that violating one assumption can trigger violations of other assumptions.
Revisiting our Multiple Regression Example
Recall our data set of 25 annual observations of U.S. Savings and Loan profit margin data, shown in the table below:
Year |
Percentage Profit Margin (Y_{t}) |
Net Revenues Per Deposit Dollar (X_{1t}) |
Number of Offices (X_{2t}) |
1 |
0.75 |
3.92 |
7,298 |
2 |
0.71 |
3.61 |
6,855 |
3 |
0.66 |
3.32 |
6,636 |
4 |
0.61 |
3.07 |
6,506 |
5 |
0.70 |
3.06 |
6,450 |
6 |
0.72 |
3.11 |
6,402 |
7 |
0.77 |
3.21 |
6,368 |
8 |
0.74 |
3.26 |
6,340 |
9 |
0.90 |
3.42 |
6,349 |
10 |
0.82 |
3.42 |
6,352 |
11 |
0.75 |
3.45 |
6,361 |
12 |
0.77 |
3.58 |
6,369 |
13 |
0.78 |
3.66 |
6,546 |
14 |
0.84 |
3.78 |
6,672 |
15 |
0.79 |
3.82 |
6,890 |
16 |
0.70 |
3.97 |
7,115 |
17 |
0.68 |
4.07 |
7,327 |
18 |
0.72 |
4.25 |
7,546 |
19 |
0.55 |
4.41 |
7,931 |
20 |
0.63 |
4.49 |
8,097 |
21 |
0.56 |
4.70 |
8,468 |
22 |
0.41 |
4.58 |
8,717 |
23 |
0.51 |
4.69 |
8,991 |
24 |
0.47 |
4.71 |
9,179 |
25 |
0.32 |
4.78 |
9,318 |
Data taken from Spellman, L.J., “Entry and profitability in a rate-free savings and loan market.” Quarterly Review of Economics and Business, 18, no. 2 (1978): 87-95, Reprinted in Newbold, P. and Bos, T., Introductory Business & Economic Forecasting, 2^{nd} Edition, Cincinnati (1994): 136-137
Also, recall that we built a model that hypothesized that S&L percentage profit margin (our dependent variable, Y_{t}) was positively related to net revenues per deposit dollar (one of our independent variables, X_{1t}), and negatively related to the number of S&L offices (our other independent variable, X_{2t}). When we ran our regression, we got the following model:
Y_{t} = 1.56450 + 0.23720X_{1t} – 0.000249X_{2t}
We also checked to see if the model parameters were significant, and obtained the following information:
Parameter |
Value |
T-Statistic |
Significant? |
Intercept |
1.5645000 |
19.70 |
Yes |
B_{1t} |
0.2372000 |
4.27 |
Yes |
B_{2t} |
(0.0002490) |
(7.77) |
Yes |
We also had a coefficient of determination – R^{2} – of 0.865, indicating that the model explains about 86.5% of the variation in S&L percentage profit margin.
Welcome to the World of Specification Bias…
Let’s deliberately leave out the number of S&L offices (X_{2t}) from our model, and do just a simple regression with the net revenues per deposit dollar. This is the model we get:
Y_{t} = 1.32616 – 0.16913X_{1t}
We also get an R^{2} of 0.495. The t-statistics for our intercept and parameter B_{1t} are as follows:
Parameter |
Value |
T-Statistic |
Significant? |
Intercept |
1.32616 |
9.57 |
Yes |
B_{1t} |
(0.16913) |
(4.75) |
Yes |
Compare these new results with our previous results and what do you notice? The results of our second regression are in sharp contrast to those of our first regression. Our new model has far less explanatory power – R^{2} dropped from 0.865 to 0.495 – and the sign of the parameter estimate for net revenue per deposit dollar has changed: The coefficient of X_{1t} was significant and positive in the first model, and now it is significant and negative! As a result, we end up with a biased regression model.
… and to the Land of Autocorrelation…
Recall another of the regression assumptions: that error terms should not be correlated with one another. When error terms are correlated with one another, we end up with autocorrelation, which renders our parameter estimates inefficient. Recall that last week, we computed the Durbin-Watson test statistic, d, which is an indicator of autocorrelation. It is bad to have either positive autocorrelation (d close to zero), or negative autocorrelation (d close to 4). Generally, we want d to be approximately 2. In our first model, d was 1.95, so autocorrelation was pretty much nonexistent. In our second model, d=0.85, suggesting the presence of significant positive autocorrelation!
How did this happen? Basically, when an important variable is omitted from regression, its impact on the dependent variable gets incorporated into the error term. If the omitted independent variable is correlated with any of the included independent variables, the error terms will also be correlated.
…Which Leads to Yet Another Violation!
The presence of autocorrelation in our second regression reveals the presence of another violation, not in the incomplete regression, but in the full regression. As the sentence above read: “if the independent variable is correlated with any of the included independent variables…” Remember the other assumption: “no linear relationship between two or more independent variables?” Basically, the detection of autocorrelation in the incomplete regression revealed that the full regression violated this very assumption – and thus exhibits multicollinearity! Generally, a coefficient changing between positive and negative (either direction) when one or more variables is omitted is an indicator of multicollinearity.
So was the full regression wrong too? Not terribly. As you will find in upcoming posts, avoiding multicollinearity is nearly impossible, especially with time series data. That’s because multicollinearity is typically a data problem. The severity of multicollinearity can often be reduced by increasing the number of observations in the data set. This is often not a problem with cross-sectional data, where data sets can have thousands, if not millions of observations. However, with time series data, the number of observations available is limited to how many periods of data have been recorded.
Moreover, the longer your time series, the more you risk structural changes in your data over the course of your time series. For instance, if you were examining annual patterns in bank lending within a particular census tract between 1990 and 2010, you might have a reliable model to work with. But let’s say you widen your time series to go back as far as 1970. You will see dramatic shifts in patterns in your data set. That’s because prior to 1977, when Congress passed the Community Reinvestment Act, many banks engaged in a practice called “redlining,” where they literally drew red lines around some neighborhoods, usually where minorities and low-income households were, and did not lend there. In this case, increasing the size of the data set might reduce multicollinearity, but actually cause other modeling problems.
And as you’ve probably guessed, one way of reducing multicollinearity can be dropping variables from the regression. But look what happened when we dropped the number of S&L offices from our regression: we might have eliminated multicollinearity, but we gained autocorrelation and specification bias!
Bottom Line:
The lesson, for us as forecasters and analysts, therefore is that we must accept that models are far from perfect and we must weigh the impact of various regression model specifications. Is the multicollinearity that is present in our model tolerable? Can we add more observations without causing new problems? Can we drop a variable from a regression without causing either specification bias or material differences in explanatory power, parameter estimates, model validity, or even forecast accuracy? Building the model is easy – but it’s these normative considerations that’s challenging.
Next Forecast Friday Topic: Building Regression Models Using Excel
In next week’s Forecast Friday post, we will take a break from discussing the theory of regression analysis and look at a demonstration of how to use the “Regression Analysis” tool in Microsoft Excel. This demonstration is intended to show you how easy running a regression is, so that you can start applying the concepts and building forecasts for your business. Until then, thanks again for reading Forecast Friday, and I wish you and your family a great 4^{th} of July weekend!
*************************
Analysights is now on Facebook!
Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page.
Tags: autocorrelation, coefficient of determination, Community Reinvestment Act, durbin-watson, Entry and profitability in a rate-free savings and loan market, forecast, Forecast Friday, Forecasting, Introductory Business & Economic Forecasting, L.J. Spellman, multicollinearity, multiple regression, multiple regression analysis, parameter estimates, Paul Newbold, Quarterly Review of Economics and Business, r-squared, redlining, S&L, savings and loan, specification bias, statistical significance, Theodore Bos, time series, time series analysis
Leave a Reply