(Thirteenth in a series)
In last week’s Forecast Friday post, we explored how to perform regression analysis using Excel. We looked at the giving history of 20 contributors to a nonprofit organization, and developed a model based on the recency, frequency, and monetary value (RFM) of their past donations. We derived the following regression equation:
We were pleased to see that our model had a coefficient of determination – or R^{2}=0.933, indicating that our model explained 93.3% of the change in the donor’s current contribution (our Ŷ). But we were a little disheartened when we looked at the tstatistics of each of our regression coefficients. Recall that we found our recency coefficient was not significant:
Parameter 
Coefficient 
Tstatistic 
Significant? 
Intercept 
87.27 
4.32 
Yes 
Months since Last 
(1.80) 
(1.44) 
No 
Times Donated 
2.45 
2.87 
Yes 
Average Contribution 
0.35 
3.26 
Yes 
Yet, most direct marketing professionals know clearly that RFM theory postulates that all three variables are significant indicators of whether and how much a donor will give (or a customer will buy). When our model doesn’t replicate what a tried and true theory has long maintained, there could possibly be something wrong.
Multicollinearity
Most times, when something doesn’t look right in the results of a regression model, it is safe to assume that one of the regression assumptions has been violated. The problem is trying to determine which assumption – or assumptions – was violated. Since the coefficient for “Months Since Last Contribution” has a tstatistic that indicates it isn’t statistically significant, we might suspect that the specification assumption is violated: that is, we may believe that “Months Since Last Contribution” is an extraneous, irrelevant variable that should not have been included in the model and, thus, be removed.
But is that really the case? There can be other reasons why a parameter estimate does not come up significant. If two or more independent variables are highly correlated, the resulting multicollinearity can cause the regression model to assign a statistically insignificant parameter estimate to an important independent variable. So, how can we detect multicollinearity?
Detecting Multicollinearity: Correlation Matrix
The first step in detecting multicollinearity is to examine the correlation among the independent variables. We do this by looking at a correlation matrix. You can run a correlation matrix in Excel by using its Data Analysis ToolPak. Looking at the correlation matrix for our variables, we find:
Correlation Matrix – Original Variables 

Variable 
Contribution Y 
Months Since Last Donation X_{1} 
Times Donated in last 12 months X_{2} 
Average Contribution in last 12 months X_{3} 
Contribution (Y) 
1.00 



Months Since Last Donation – X_{1} 
0.93 
1.00 


Times Donated in last 12 months – X_{2} 
0.89 
0.88 
1.00 

Average Contribution Last 12 mo. – X_{3} 
0.88 
0.84 
0.69 
1.00 
A correlation of 1.00 means two variables are perfectly correlated; a correlation of 0.00 means there is absolutely no correlation. The cells in the matrix above, where the correlation is 1.00, shows the correlation of an independent variable with itself – we would expect a perfectly correlated relationship. What is most important to us are the numbers below the 1.00 correlations. The first column shows our dependent variable, “Contribution”. As you go down the column, row by row, you see that each of our independent variables is strongly correlated with the dependent variable, indicating that they are all strong predictors.
The correlation between “Months Since Last Donation” (X_{1}) and the donor’s Contribution (Y) shows a correlation that is almost perfectly negative (0.93), while those correlations of the dependent variable with each of the other two independent variables is almost perfectly positive with the contribution (0.89 and 0.88). When writing these in shorthand, we use the Greek letter rho, ρ, to denote correlation. Hence, to show the correlation between each independent variable with the dependent variable, we would express them as follows:
ρX_{1}Y = 0.93 
ρX_{2}Y = 0.89 
ρX_{3}Y = 0.88 
But now, let’s look at the correlations among our independent variables:
ρX_{1}X_{2}= 0.88 
ρX_{1}X_{3}= 0.84 
ρX_{2}X_{3}= 0.69 
Notice that all of our independent variables are highly correlated with one another. The relationship between “Times Donated in Last 12 Months” and “Average Contribution in Last 12 Months” is not as strong as the correlation between those individual variables with “Months Since Last Donation,” but the correlation is still very strong.
Hence, we can conclude that multicollinearity is present in this model.
Correcting Multicollinearity: Dropping Variables
In today’s post, we will discuss one of the remedies for multicollinearity – dropping a highly correlated independent variable. Next week, we’ll discuss the other approaches to correcting multicollinearity. Sometimes, when a variable is “iffy,” we can save ourselves some trouble and just kick it out. If we were to ignore “Months Since Last Donation,” and run our regression with the remaining two variables, we end up with the following regression equation:
Ŷ= 60.68 + 3.37X_{2} + 0.45X_{3}
We get R^{2} =0.924, suggesting that we didn’t lose much explanatory power by excluding “Months Since Last Donation.” We also get an F statistic of 103.36, much higher than the 73.90 we had in our original model. A higher Fstatistic indicates a model that is more statistically valid. It also reflects the exclusion of one or more extraneous variables. Also, the tstatistics for both independent variables are significant, and they’re even higher than they were in the original model, further indicating increased validity:
Parameter 
Coefficient 
Tstatistic 
Significant? 
Intercept 
60.68 
7.24 
Yes 
Times Donated 
3.37 
5.83 
Yes 
Average Contribution 
0.45 
5.49 
Yes 
Dropping “Months Since Last Donation” from our analysis worked here. However, dropping variables without a rational decision process can cause new problems. In some cases, dropping a variable can result in specification bias, as we saw in our previous example of predicting profit margin for savings and loan associations a few weeks ago. So, consider dropping variables cautiously.
Next Forecast Friday Topic: More Multicollinearity Remedies
Today, we described one of the ways to remedy multicollinearity – dropping variables. Next week, we will explore two other ways of correcting multicollinearity: obtaining more data and transforming variables. We will also discuss the pitfalls of all three of these remedies, and we will discuss when it’s not worth it to reduce the impact of multicollinearity.
*************************************
Let Analysights Take the Pain out of Forecasting!
Multicollinearity is but one of the many problems you can encounter when forecasting. Let Analysights walk you through the forecasting process so that you can spend more time making strategic decisions and less time trying to guess first where business is going. We will make your forecasting efforts seamless, so you can concentrate on running your business. Check out our Web site or call (847) 8952565.
Tags: correlation, correlation matrix, dependent variable, ftest, forecast, Forecast Friday, Forecasting, independent variable, modeling, multicollinearity, multiple linear regression, multiple regression, multiple regression analysis, parameter estimates, predictive modeling, regression, regression analysis, RFM, RFM analysis, specification bias, statistical modeling, statistical significance, tstatistic
July 19, 2010 at 8:19 am 
Having large correlations is an indicator of collinearity but is not a requirement.
July 22, 2010 at 12:08 am 
[…] last week’s Forecast Friday post, we discussed how to detect multicollinearity in a regression model and how dropping a suspect […]
July 17, 2014 at 4:57 pm 
Spot on with this writeup, I honestly believe
this site needs much more attention. I’ll probably be returning to read more, thanks for the advice!