Multiple Regression: Specification Bias

(Eleventh in a series)

In last week’s Forecast Friday post, we discussed several of the important checks you must do to ensure that your model is valid. You always want to be sure that your model does not violate the assumptions we discussed earlier. Today we are going to see what happens when we violate the specification assumption, which says that we do not omit relevant independent variables from our regression model. You will see that when we leave out an important independent variable from a regression model, quite misleading results can emerge. You will also see that violating one assumption can trigger violations of other assumptions.

Revisiting our Multiple Regression Example

Recall our data set of 25 annual observations of U.S. Savings and Loan profit margin data, shown in the table below:

Year

Percentage Profit Margin (Yt)

Net Revenues Per Deposit Dollar (X1t)

Number of Offices (X2t)

1

0.75

3.92

7,298

2

0.71

3.61

6,855

3

0.66

3.32

6,636

4

0.61

3.07

6,506

5

0.70

3.06

6,450

6

0.72

3.11

6,402

7

0.77

3.21

6,368

8

0.74

3.26

6,340

9

0.90

3.42

6,349

10

0.82

3.42

6,352

11

0.75

3.45

6,361

12

0.77

3.58

6,369

13

0.78

3.66

6,546

14

0.84

3.78

6,672

15

0.79

3.82

6,890

16

0.70

3.97

7,115

17

0.68

4.07

7,327

18

0.72

4.25

7,546

19

0.55

4.41

7,931

20

0.63

4.49

8,097

21

0.56

4.70

8,468

22

0.41

4.58

8,717

23

0.51

4.69

8,991

24

0.47

4.71

9,179

25

0.32

4.78

9,318

Data taken from Spellman, L.J., “Entry and profitability in a rate-free savings and loan market.” Quarterly Review of Economics and Business, 18, no. 2 (1978): 87-95, Reprinted in Newbold, P. and Bos, T., Introductory Business & Economic Forecasting, 2nd Edition, Cincinnati (1994): 136-137

Also, recall that we built a model that hypothesized that S&L percentage profit margin (our dependent variable, Yt) was positively related to net revenues per deposit dollar (one of our independent variables, X1t), and negatively related to the number of S&L offices (our other independent variable, X2t). When we ran our regression, we got the following model:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

We also checked to see if the model parameters were significant, and obtained the following information:

Parameter

Value

T-Statistic

Significant?

Intercept

1.5645000

19.70

Yes

B1t

0.2372000

4.27

Yes

B2t

(0.0002490)

(7.77)

Yes

We also had a coefficient of determination – R2 – of 0.865, indicating that the model explains about 86.5% of the variation in S&L percentage profit margin.

Welcome to the World of Specification Bias…

Let’s deliberately leave out the number of S&L offices (X2t) from our model, and do just a simple regression with the net revenues per deposit dollar. This is the model we get:

Yt = 1.32616 – 0.16913X1t

We also get an R2 of 0.495. The t-statistics for our intercept and parameter B1t are as follows:

Parameter

Value

T-Statistic

Significant?

Intercept

1.32616

9.57

Yes

B1t

(0.16913)

(4.75)

Yes

 

Compare these new results with our previous results and what do you notice? The results of our second regression are in sharp contrast to those of our first regression. Our new model has far less explanatory power – R2 dropped from 0.865 to 0.495 – and the sign of the parameter estimate for net revenue per deposit dollar has changed: The coefficient of X1t was significant and positive in the first model, and now it is significant and negative! As a result, we end up with a biased regression model.

… and to the Land of Autocorrelation…

Recall another of the regression assumptions: that error terms should not be correlated with one another. When error terms are correlated with one another, we end up with autocorrelation, which renders our parameter estimates inefficient. Recall that last week, we computed the Durbin-Watson test statistic, d, which is an indicator of autocorrelation. It is bad to have either positive autocorrelation (d close to zero), or negative autocorrelation (d close to 4). Generally, we want d to be approximately 2. In our first model, d was 1.95, so autocorrelation was pretty much nonexistent. In our second model, d=0.85, suggesting the presence of significant positive autocorrelation!

How did this happen? Basically, when an important variable is omitted from regression, its impact on the dependent variable gets incorporated into the error term. If the omitted independent variable is correlated with any of the included independent variables, the error terms will also be correlated.

…Which Leads to Yet Another Violation!

The presence of autocorrelation in our second regression reveals the presence of another violation, not in the incomplete regression, but in the full regression. As the sentence above read: “if the independent variable is correlated with any of the included independent variables…” Remember the other assumption: “no linear relationship between two or more independent variables?” Basically, the detection of autocorrelation in the incomplete regression revealed that the full regression violated this very assumption – and thus exhibits multicollinearity! Generally, a coefficient changing between positive and negative (either direction) when one or more variables is omitted is an indicator of multicollinearity.

So was the full regression wrong too? Not terribly. As you will find in upcoming posts, avoiding multicollinearity is nearly impossible, especially with time series data. That’s because multicollinearity is typically a data problem. The severity of multicollinearity can often be reduced by increasing the number of observations in the data set. This is often not a problem with cross-sectional data, where data sets can have thousands, if not millions of observations. However, with time series data, the number of observations available is limited to how many periods of data have been recorded.

Moreover, the longer your time series, the more you risk structural changes in your data over the course of your time series. For instance, if you were examining annual patterns in bank lending within a particular census tract between 1990 and 2010, you might have a reliable model to work with. But let’s say you widen your time series to go back as far as 1970. You will see dramatic shifts in patterns in your data set. That’s because prior to 1977, when Congress passed the Community Reinvestment Act, many banks engaged in a practice called “redlining,” where they literally drew red lines around some neighborhoods, usually where minorities and low-income households were, and did not lend there. In this case, increasing the size of the data set might reduce multicollinearity, but actually cause other modeling problems.

And as you’ve probably guessed, one way of reducing multicollinearity can be dropping variables from the regression. But look what happened when we dropped the number of S&L offices from our regression: we might have eliminated multicollinearity, but we gained autocorrelation and specification bias!

Bottom Line:

The lesson, for us as forecasters and analysts, therefore is that we must accept that models are far from perfect and we must weigh the impact of various regression model specifications. Is the multicollinearity that is present in our model tolerable? Can we add more observations without causing new problems? Can we drop a variable from a regression without causing either specification bias or material differences in explanatory power, parameter estimates, model validity, or even forecast accuracy? Building the model is easy – but it’s these normative considerations that’s challenging.

Next Forecast Friday Topic: Building Regression Models Using Excel

In next week’s Forecast Friday post, we will take a break from discussing the theory of regression analysis and look at a demonstration of how to use the “Regression Analysis” tool in Microsoft Excel. This demonstration is intended to show you how easy running a regression is, so that you can start applying the concepts and building forecasts for your business. Until then, thanks again for reading Forecast Friday, and I wish you and your family a great 4th of July weekend!

*************************

Analysights is now on Facebook!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page.

Tags: , , , , , , , , , , , , , , , , , , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: