Posts Tagged ‘multiple regression analysis’

Forecast Friday Topic: Multicollinearity – How to Detect it; How to Correct it

July 15, 2010

(Thirteenth in a series)

In last week’s Forecast Friday post, we explored how to perform regression analysis using Excel. We looked at the giving history of 20 contributors to a nonprofit organization, and developed a model based on the recency, frequency, and monetary value (RFM) of their past donations. We derived the following regression equation:

We were pleased to see that our model had a coefficient of determination – or R2=0.933, indicating that our model explained 93.3% of the change in the donor’s current contribution (our Ŷ). But we were a little disheartened when we looked at the t-statistics of each of our regression coefficients. Recall that we found our recency coefficient was not significant:

Parameter

Coefficient

T-statistic

Significant?

Intercept

87.27

4.32

Yes

Months since Last

(1.80)

(1.44)

No

Times Donated

2.45

2.87

Yes

Average Contribution

0.35

3.26

Yes

Yet, most direct marketing professionals know clearly that RFM theory postulates that all three variables are significant indicators of whether and how much a donor will give (or a customer will buy). When our model doesn’t replicate what a tried and true theory has long maintained, there could possibly be something wrong.

Multicollinearity

Most times, when something doesn’t look right in the results of a regression model, it is safe to assume that one of the regression assumptions has been violated. The problem is trying to determine which assumption – or assumptions – was violated. Since the coefficient for “Months Since Last Contribution” has a t-statistic that indicates it isn’t statistically significant, we might suspect that the specification assumption is violated: that is, we may believe that “Months Since Last Contribution” is an extraneous, irrelevant variable that should not have been included in the model and, thus, be removed.

But is that really the case? There can be other reasons why a parameter estimate does not come up significant. If two or more independent variables are highly correlated, the resulting multicollinearity can cause the regression model to assign a statistically insignificant parameter estimate to an important independent variable. So, how can we detect multicollinearity?

Detecting Multicollinearity: Correlation Matrix

The first step in detecting multicollinearity is to examine the correlation among the independent variables. We do this by looking at a correlation matrix. You can run a correlation matrix in Excel by using its Data Analysis ToolPak. Looking at the correlation matrix for our variables, we find:

Correlation Matrix – Original Variables

Variable

Contribution Y

Months Since Last Donation X1

Times Donated in last 12 months

X2

Average Contribution in last 12 months

X3

Contribution (Y)

1.00

  

  

  

Months Since Last Donation – X1

-0.93

1.00

  

  

Times Donated in last 12 months – X2

0.89

-0.88

1.00

  

Average Contribution Last 12 mo. – X3

0.88

-0.84

0.69

1.00

 

A correlation of 1.00 means two variables are perfectly correlated; a correlation of 0.00 means there is absolutely no correlation. The cells in the matrix above, where the correlation is 1.00, shows the correlation of an independent variable with itself – we would expect a perfectly correlated relationship. What is most important to us are the numbers below the 1.00 correlations. The first column shows our dependent variable, “Contribution”. As you go down the column, row by row, you see that each of our independent variables is strongly correlated with the dependent variable, indicating that they are all strong predictors.

The correlation between “Months Since Last Donation” (X1) and the donor’s Contribution (Y) shows a correlation that is almost perfectly negative (-0.93), while those correlations of the dependent variable with each of the other two independent variables is almost perfectly positive with the contribution (0.89 and 0.88). When writing these in shorthand, we use the Greek letter rho, ρ, to denote correlation. Hence, to show the correlation between each independent variable with the dependent variable, we would express them as follows:

ρX1Y = -0.93

ρX2Y = 0.89

ρX3Y = 0.88

But now, let’s look at the correlations among our independent variables:

ρX1X2= -0.88

ρX1X3= -0.84

ρX2X3= 0.69

 

Notice that all of our independent variables are highly correlated with one another. The relationship between “Times Donated in Last 12 Months” and “Average Contribution in Last 12 Months” is not as strong as the correlation between those individual variables with “Months Since Last Donation,” but the correlation is still very strong.

Hence, we can conclude that multicollinearity is present in this model.

Correcting Multicollinearity: Dropping Variables

In today’s post, we will discuss one of the remedies for multicollinearity – dropping a highly correlated independent variable. Next week, we’ll discuss the other approaches to correcting multicollinearity. Sometimes, when a variable is “iffy,” we can save ourselves some trouble and just kick it out. If we were to ignore “Months Since Last Donation,” and run our regression with the remaining two variables, we end up with the following regression equation:

Ŷ= 60.68 + 3.37X2 + 0.45X3

We get R2 =0.924, suggesting that we didn’t lose much explanatory power by excluding “Months Since Last Donation.” We also get an F statistic of 103.36, much higher than the 73.90 we had in our original model. A higher F-statistic indicates a model that is more statistically valid. It also reflects the exclusion of one or more extraneous variables. Also, the t-statistics for both independent variables are significant, and they’re even higher than they were in the original model, further indicating increased validity:

Parameter

Coefficient

T-statistic

Significant?

Intercept

60.68

7.24

Yes

Times Donated

3.37

5.83

Yes

Average Contribution

0.45

5.49

Yes

Dropping “Months Since Last Donation” from our analysis worked here. However, dropping variables without a rational decision process can cause new problems. In some cases, dropping a variable can result in specification bias, as we saw in our previous example of predicting profit margin for savings and loan associations a few weeks ago. So, consider dropping variables cautiously.

Next Forecast Friday Topic: More Multicollinearity Remedies

Today, we described one of the ways to remedy multicollinearity – dropping variables. Next week, we will explore two other ways of correcting multicollinearity: obtaining more data and transforming variables. We will also discuss the pitfalls of all three of these remedies, and we will discuss when it’s not worth it to reduce the impact of multicollinearity.

*************************************

Let Analysights Take the Pain out of Forecasting!

Multicollinearity is but one of the many problems you can encounter when forecasting. Let Analysights walk you through the forecasting process so that you can spend more time making strategic decisions and less time trying to guess first where business is going. We will make your forecasting efforts seamless, so you can concentrate on running your business. Check out our Web site or call (847) 895-2565.

Multiple Regression: Specification Bias

July 1, 2010

(Eleventh in a series)

In last week’s Forecast Friday post, we discussed several of the important checks you must do to ensure that your model is valid. You always want to be sure that your model does not violate the assumptions we discussed earlier. Today we are going to see what happens when we violate the specification assumption, which says that we do not omit relevant independent variables from our regression model. You will see that when we leave out an important independent variable from a regression model, quite misleading results can emerge. You will also see that violating one assumption can trigger violations of other assumptions.

Revisiting our Multiple Regression Example

Recall our data set of 25 annual observations of U.S. Savings and Loan profit margin data, shown in the table below:

Year

Percentage Profit Margin (Yt)

Net Revenues Per Deposit Dollar (X1t)

Number of Offices (X2t)

1

0.75

3.92

7,298

2

0.71

3.61

6,855

3

0.66

3.32

6,636

4

0.61

3.07

6,506

5

0.70

3.06

6,450

6

0.72

3.11

6,402

7

0.77

3.21

6,368

8

0.74

3.26

6,340

9

0.90

3.42

6,349

10

0.82

3.42

6,352

11

0.75

3.45

6,361

12

0.77

3.58

6,369

13

0.78

3.66

6,546

14

0.84

3.78

6,672

15

0.79

3.82

6,890

16

0.70

3.97

7,115

17

0.68

4.07

7,327

18

0.72

4.25

7,546

19

0.55

4.41

7,931

20

0.63

4.49

8,097

21

0.56

4.70

8,468

22

0.41

4.58

8,717

23

0.51

4.69

8,991

24

0.47

4.71

9,179

25

0.32

4.78

9,318

Data taken from Spellman, L.J., “Entry and profitability in a rate-free savings and loan market.” Quarterly Review of Economics and Business, 18, no. 2 (1978): 87-95, Reprinted in Newbold, P. and Bos, T., Introductory Business & Economic Forecasting, 2nd Edition, Cincinnati (1994): 136-137

Also, recall that we built a model that hypothesized that S&L percentage profit margin (our dependent variable, Yt) was positively related to net revenues per deposit dollar (one of our independent variables, X1t), and negatively related to the number of S&L offices (our other independent variable, X2t). When we ran our regression, we got the following model:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

We also checked to see if the model parameters were significant, and obtained the following information:

Parameter

Value

T-Statistic

Significant?

Intercept

1.5645000

19.70

Yes

B1t

0.2372000

4.27

Yes

B2t

(0.0002490)

(7.77)

Yes

We also had a coefficient of determination – R2 – of 0.865, indicating that the model explains about 86.5% of the variation in S&L percentage profit margin.

Welcome to the World of Specification Bias…

Let’s deliberately leave out the number of S&L offices (X2t) from our model, and do just a simple regression with the net revenues per deposit dollar. This is the model we get:

Yt = 1.32616 – 0.16913X1t

We also get an R2 of 0.495. The t-statistics for our intercept and parameter B1t are as follows:

Parameter

Value

T-Statistic

Significant?

Intercept

1.32616

9.57

Yes

B1t

(0.16913)

(4.75)

Yes

 

Compare these new results with our previous results and what do you notice? The results of our second regression are in sharp contrast to those of our first regression. Our new model has far less explanatory power – R2 dropped from 0.865 to 0.495 – and the sign of the parameter estimate for net revenue per deposit dollar has changed: The coefficient of X1t was significant and positive in the first model, and now it is significant and negative! As a result, we end up with a biased regression model.

… and to the Land of Autocorrelation…

Recall another of the regression assumptions: that error terms should not be correlated with one another. When error terms are correlated with one another, we end up with autocorrelation, which renders our parameter estimates inefficient. Recall that last week, we computed the Durbin-Watson test statistic, d, which is an indicator of autocorrelation. It is bad to have either positive autocorrelation (d close to zero), or negative autocorrelation (d close to 4). Generally, we want d to be approximately 2. In our first model, d was 1.95, so autocorrelation was pretty much nonexistent. In our second model, d=0.85, suggesting the presence of significant positive autocorrelation!

How did this happen? Basically, when an important variable is omitted from regression, its impact on the dependent variable gets incorporated into the error term. If the omitted independent variable is correlated with any of the included independent variables, the error terms will also be correlated.

…Which Leads to Yet Another Violation!

The presence of autocorrelation in our second regression reveals the presence of another violation, not in the incomplete regression, but in the full regression. As the sentence above read: “if the independent variable is correlated with any of the included independent variables…” Remember the other assumption: “no linear relationship between two or more independent variables?” Basically, the detection of autocorrelation in the incomplete regression revealed that the full regression violated this very assumption – and thus exhibits multicollinearity! Generally, a coefficient changing between positive and negative (either direction) when one or more variables is omitted is an indicator of multicollinearity.

So was the full regression wrong too? Not terribly. As you will find in upcoming posts, avoiding multicollinearity is nearly impossible, especially with time series data. That’s because multicollinearity is typically a data problem. The severity of multicollinearity can often be reduced by increasing the number of observations in the data set. This is often not a problem with cross-sectional data, where data sets can have thousands, if not millions of observations. However, with time series data, the number of observations available is limited to how many periods of data have been recorded.

Moreover, the longer your time series, the more you risk structural changes in your data over the course of your time series. For instance, if you were examining annual patterns in bank lending within a particular census tract between 1990 and 2010, you might have a reliable model to work with. But let’s say you widen your time series to go back as far as 1970. You will see dramatic shifts in patterns in your data set. That’s because prior to 1977, when Congress passed the Community Reinvestment Act, many banks engaged in a practice called “redlining,” where they literally drew red lines around some neighborhoods, usually where minorities and low-income households were, and did not lend there. In this case, increasing the size of the data set might reduce multicollinearity, but actually cause other modeling problems.

And as you’ve probably guessed, one way of reducing multicollinearity can be dropping variables from the regression. But look what happened when we dropped the number of S&L offices from our regression: we might have eliminated multicollinearity, but we gained autocorrelation and specification bias!

Bottom Line:

The lesson, for us as forecasters and analysts, therefore is that we must accept that models are far from perfect and we must weigh the impact of various regression model specifications. Is the multicollinearity that is present in our model tolerable? Can we add more observations without causing new problems? Can we drop a variable from a regression without causing either specification bias or material differences in explanatory power, parameter estimates, model validity, or even forecast accuracy? Building the model is easy – but it’s these normative considerations that’s challenging.

Next Forecast Friday Topic: Building Regression Models Using Excel

In next week’s Forecast Friday post, we will take a break from discussing the theory of regression analysis and look at a demonstration of how to use the “Regression Analysis” tool in Microsoft Excel. This demonstration is intended to show you how easy running a regression is, so that you can start applying the concepts and building forecasts for your business. Until then, thanks again for reading Forecast Friday, and I wish you and your family a great 4th of July weekend!

*************************

Analysights is now on Facebook!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page.