Posts Tagged ‘correlation matrix’

Forecast Friday Topic: Selecting the Variables for a Regression

October 14, 2010

(Twenty-fifth in a series)

When it comes to building a regression model, for many companies there’s good news and bad news. The good news: there’s plenty of independent variables from which to choose. The bad news: there’s plenty of independent variables from which to choose! While it may be possible to run a regression with all possible independent variables, each one included in your model reduces your degrees of freedom and causes the model to overfit the data on which the model is built, resulting in less reliable forecasts when new data is introduced.

So how do you come up with your short list of independent variables?

Some analysts have tried plotting the dependent variable (Y) against individual independent variables (Xi) and selecting it if there’s some noticeable relationship. Another tried method is to produce a correlation matrix of all the independent variables and if a large correlation between two of them is discovered, drop one from consideration (so to avoid multicollinearity). Still another approach has been to perform a multiple linear regression on all possible explanatory variables and then dropping those who t values are insignificant. These approaches are often selected because they are quick and simple, but they are not reliable for coming up with a decent regression model.

Stepwise Regression

Other approaches are a bit more complex, but more reliable. Perhaps the most common of these approaches is stepwise regression. Stepwise regression works by first identifying the independent variable with the highest correlation with the dependent variable. Once that variable is identified, a one-variable regression model is run. The residuals of that model are then obtained. Recall from previous Forecast Friday posts that if an important variable is omitted from a regression model, its effect on the dependent variable gets factored into the residuals. Hence, the next step in a stepwise regression is to identify the one unselected independent variable with the highest correlation with the residuals. Now you have your second independent variable, and you run a two-variable regression model. You then look at the residuals to that model and select the independent variable with the highest correlation to them, and so forth. Repeat the process until no more variables can be added into the model.

Many statistical analysis packages do stepwise regression seamlessly. Stepwise regression is not guaranteed to produce the optimal set of variables for your model.

Other Approaches

Other approaches to variable selection include best subsets regression, which involves taking various subsets of the available independent variables and running models with them, choosing the subset with the best R2. Many statistical software packages have the capability of helping determine the various subsets to choose from. Principal components analysis of all the variables is another approach, but it is beyond the scope of this discussion.

Despite systematic techniques like stepwise regression, variable selection in regression models is as much an art as a science. Whatever variables you select for your model should have a valid rationale for being there.

Next Forecast Friday Topic: I haven’t decided yet!

Let me surprise you. In the meantime, have a great weekend and be well!

Forecast Friday Topic: Multicollinearity – How to Detect it; How to Correct it

July 15, 2010

(Thirteenth in a series)

In last week’s Forecast Friday post, we explored how to perform regression analysis using Excel. We looked at the giving history of 20 contributors to a nonprofit organization, and developed a model based on the recency, frequency, and monetary value (RFM) of their past donations. We derived the following regression equation:

We were pleased to see that our model had a coefficient of determination – or R2=0.933, indicating that our model explained 93.3% of the change in the donor’s current contribution (our Ŷ). But we were a little disheartened when we looked at the t-statistics of each of our regression coefficients. Recall that we found our recency coefficient was not significant:









Months since Last




Times Donated




Average Contribution




Yet, most direct marketing professionals know clearly that RFM theory postulates that all three variables are significant indicators of whether and how much a donor will give (or a customer will buy). When our model doesn’t replicate what a tried and true theory has long maintained, there could possibly be something wrong.


Most times, when something doesn’t look right in the results of a regression model, it is safe to assume that one of the regression assumptions has been violated. The problem is trying to determine which assumption – or assumptions – was violated. Since the coefficient for “Months Since Last Contribution” has a t-statistic that indicates it isn’t statistically significant, we might suspect that the specification assumption is violated: that is, we may believe that “Months Since Last Contribution” is an extraneous, irrelevant variable that should not have been included in the model and, thus, be removed.

But is that really the case? There can be other reasons why a parameter estimate does not come up significant. If two or more independent variables are highly correlated, the resulting multicollinearity can cause the regression model to assign a statistically insignificant parameter estimate to an important independent variable. So, how can we detect multicollinearity?

Detecting Multicollinearity: Correlation Matrix

The first step in detecting multicollinearity is to examine the correlation among the independent variables. We do this by looking at a correlation matrix. You can run a correlation matrix in Excel by using its Data Analysis ToolPak. Looking at the correlation matrix for our variables, we find:

Correlation Matrix – Original Variables


Contribution Y

Months Since Last Donation X1

Times Donated in last 12 months


Average Contribution in last 12 months


Contribution (Y)





Months Since Last Donation – X1





Times Donated in last 12 months – X2





Average Contribution Last 12 mo. – X3






A correlation of 1.00 means two variables are perfectly correlated; a correlation of 0.00 means there is absolutely no correlation. The cells in the matrix above, where the correlation is 1.00, shows the correlation of an independent variable with itself – we would expect a perfectly correlated relationship. What is most important to us are the numbers below the 1.00 correlations. The first column shows our dependent variable, “Contribution”. As you go down the column, row by row, you see that each of our independent variables is strongly correlated with the dependent variable, indicating that they are all strong predictors.

The correlation between “Months Since Last Donation” (X1) and the donor’s Contribution (Y) shows a correlation that is almost perfectly negative (-0.93), while those correlations of the dependent variable with each of the other two independent variables is almost perfectly positive with the contribution (0.89 and 0.88). When writing these in shorthand, we use the Greek letter rho, ρ, to denote correlation. Hence, to show the correlation between each independent variable with the dependent variable, we would express them as follows:

ρX1Y = -0.93

ρX2Y = 0.89

ρX3Y = 0.88

But now, let’s look at the correlations among our independent variables:

ρX1X2= -0.88

ρX1X3= -0.84

ρX2X3= 0.69


Notice that all of our independent variables are highly correlated with one another. The relationship between “Times Donated in Last 12 Months” and “Average Contribution in Last 12 Months” is not as strong as the correlation between those individual variables with “Months Since Last Donation,” but the correlation is still very strong.

Hence, we can conclude that multicollinearity is present in this model.

Correcting Multicollinearity: Dropping Variables

In today’s post, we will discuss one of the remedies for multicollinearity – dropping a highly correlated independent variable. Next week, we’ll discuss the other approaches to correcting multicollinearity. Sometimes, when a variable is “iffy,” we can save ourselves some trouble and just kick it out. If we were to ignore “Months Since Last Donation,” and run our regression with the remaining two variables, we end up with the following regression equation:

Ŷ= 60.68 + 3.37X2 + 0.45X3

We get R2 =0.924, suggesting that we didn’t lose much explanatory power by excluding “Months Since Last Donation.” We also get an F statistic of 103.36, much higher than the 73.90 we had in our original model. A higher F-statistic indicates a model that is more statistically valid. It also reflects the exclusion of one or more extraneous variables. Also, the t-statistics for both independent variables are significant, and they’re even higher than they were in the original model, further indicating increased validity:









Times Donated




Average Contribution




Dropping “Months Since Last Donation” from our analysis worked here. However, dropping variables without a rational decision process can cause new problems. In some cases, dropping a variable can result in specification bias, as we saw in our previous example of predicting profit margin for savings and loan associations a few weeks ago. So, consider dropping variables cautiously.

Next Forecast Friday Topic: More Multicollinearity Remedies

Today, we described one of the ways to remedy multicollinearity – dropping variables. Next week, we will explore two other ways of correcting multicollinearity: obtaining more data and transforming variables. We will also discuss the pitfalls of all three of these remedies, and we will discuss when it’s not worth it to reduce the impact of multicollinearity.


Let Analysights Take the Pain out of Forecasting!

Multicollinearity is but one of the many problems you can encounter when forecasting. Let Analysights walk you through the forecasting process so that you can spend more time making strategic decisions and less time trying to guess first where business is going. We will make your forecasting efforts seamless, so you can concentrate on running your business. Check out our Web site or call (847) 895-2565.