(Fourteenth in a series)
In last week’s Forecast Friday post, we discussed how to detect multicollinearity in a regression model and how dropping a suspect variable or variables from the model can be one approach to reducing or eliminating multicollinearity. However, removing variables can cause other problems – particularly specification bias – if the suspect variable is indeed an important predictor. Today we will discuss two additional approaches to correcting multicollinearity – obtaining more data and transforming variables – and will discuss when it’s best to just accept the multicollinearity.
Obtaining More Data
Multicollinearity is really an issue with the sample, not the population. Sometimes, sampling produces a data set that might be too homogeneous. One way to remedy this would be to add more observations to the data set. Enlarging the sample will introduce more variation in the data series, which reduces the effect of sampling error and helps increase precision when estimating various properties of the data. Increased sample sizes can reduce either the presence or the impact of multicollinearity, or both. Obtaining more data is often the best way to remedy multicollinearity.
Obtaining more data does have problems, however. Sometimes, additional data just isn’t available. This is especially the case with time series data, which can be limited or otherwise finite. If you need to obtain that additional information through great effort, it can be costly and time consuming. Also, the additional data you add to your sample could be quite similar to your original data set, so there would be no benefit to enlarging your data set. The new data could even make problems worse!
Another way statisticians and modelers go about eliminating multicollinearity is through data transformation. This can be done in a number of ways.
Combine Some Variables
The most obvious way would be to find a way to combine some of the variables. After all, multicollinearity suggests that two or more independent variables are strongly correlated. Perhaps you can multiply two variables together and use the product of those two variables in place of them.
So, in our example of the donor history, we had the two variables “Average Contribution in Last 12 Months” and “Times Donated in Last 12 Months.” We can multiply them to create a composite variable, “Total Contributions in Last 12 Months,” and then use that new variable, along with the variable “Months Since Last Donation” to perform the regression. In fact, if we did that with our model, we end up with a model (not shown here) that has an R2=0.895, and this time the coefficient for “Months Since Last Donation” is significant, as is our “Total Contribution” variable. Our F statistic is a little over 72. Essentially, the R2 and F statistics are only slightly lower than in our original model, suggesting that the transformation was useful. However, looking at the correlation matrix, we still see a strong negative correlation between our two independent variables, suggesting that we still haven’t eliminated multicollinearity.
Centered Interaction Terms
Sometimes we can reduce multicollinearity by creating an interaction term between variables in question. In a model trying to predict performance on a test based on hours spent studying and hours of sleep, you might find that hours spent studying appears to be related with hours of sleep. So, you create a third independent variable, Sleep_Study_Interaction. You do this by computing the average value for both the hours of sleep and hours of studying variables. For each observation, you subtract each independent variable’s mean from its respective value for that observation. Once you’ve done that for each observation, multiply their differences together. This is your interaction term, Sleep_Study_Interaction. Run the regression now with the original two variables and the interaction term. When you subtract the means from the variables in question, you are in effect centering interaction term, which means you’re taking into account central tendency in your data.
If you’re working with time series data, one way to reduce multicollinearity is to run your regression using differences. To do this, you take every variable – dependent and independent – and, beginning with the second observation – subtract the immediate prior observation’s values for those variables from the current observation. Now, instead of working with original data, you are working with the change in data from one period to the next. Differencing eliminates multicollinearity by removing the trend component of the time series. If all independent variables had followed more or less the same trend, they could end up highly correlated. Sometimes, however, trends can build on themselves for several periods, so multiple differencing may be required. In this case, subtracting the period before was taking a “first difference.” If we subtracted two periods before, it’s a “second difference,” and so on. Note also that with differencing, we lose the first observations in the data, depending on how many periods we have to difference, so if you have a small data set, differencing can reduce your degrees of freedom and increase your risk of making a Type I Error: concluding that an independent variable is not statistically significant when, in truth it is.
Sometimes, it makes sense to take a look at a scatter plot of each independent variable’s values with that of the dependent variable to see if the relationship is fairly linear. If it is not, that’s a cue to transform an independent variable. If an independent variable appears to have a logarithmic relationship, you might substitute its natural log. Also, depending on the relationship, you can use other transformations: square root, square, negative reciprocal, etc.
Another consideration: if you’re predicting the impact of violent crime on a city’s median family income, instead of using the number of violent crimes committed in the city, you might instead divide it by the city’s population and come up with a per-capita figure. That will give more useful insights into the incidence of crime in the city.
Transforming data in these ways helps reduce multicollinearity by representing independent variables differently, so that they are less correlated with other independent variables.
Limits of Data Transformation
Transforming data has its own pitfalls. First, transforming data also transforms the model. A model that uses a per-capita crime figure for an independent variable has a very different interpretation than one using an aggregate crime figure. Also, interpretations of models and their results get more complicated as data is transformed. Ideally, models are supposed to be parsimonious – that is, they explain a great deal about the relationship as simply as possible. Typically, parsimony means as few independent variables as possible, but it also means as few transformations as possible. You also need to do more work. If you try to plug in new data to your resulting model for forecasting, you must remember to take the values for your data and transform them accordingly.
Living With Multicollinearity
Multicollinearity is par for the course when a model consists of two or more independent variables, so often the question isn’t whether multicollinearity exists, but rather how severe it is. Multicollinearity doesn’t bias your parameter estimates, but it inflates their variance, making them inefficient or untrustworthy. As you have seen from the remedies offered in this post, the cures can be worse than the disease. Correcting multicollinearity can also be an iterative process; the benefit of reducing multicollinearity may not justify the time and resources required to do so. Sometimes, any effort to reduce multicollinearity is futile. Generally, for the purposes of forecasting, it might be perfectly OK to disregard the multicollinearity. If, however, you’re using regression analysis to explain relationships, then you must try to reduce the multicollinearity.
A good approach is to run a couple of different models, some using variations of the remedies we’ve discussed here, and comparing their degree of multicollinearity with that of the original model. It is also important to compare the forecast accuracy of each. After all, if all you’re trying to do is forecast, then a model with slightly less multicollinearity but a higher degree of forecast error is probably not preferable to a more precise forecasting model with higher degrees of multicollinearity.
- Where you have multiple regression, you almost always have multicollinearity, especially in time series data.
- A correlation matrix is a good way to detect multicollinearity. Multicollinearity can be very serious if the correlation matrix shows that some of the independent variables are more highly correlated with each other than they are with the dependent variable.
You should suspect multicollinearity if:
- You have a high R2 but low t-statistics;
- The sign for a coefficient is opposite of what is normally expected (a relationship that should be positive is negative, and vice-versa).
- Multicollinearity doesn’t bias parameter estimates, but makes them untrustworthy by enlarging their variance.
- There are several ways of remedying multicollinearity, with obtaining more data often being the best approach. Each remedy for multicollinearity contributes a new set of problems and limitations, so you must weigh the benefit of reduced multicollinearity on time and resources needed to do so, and the resulting impact on your forecast accuracy.
Next Forecast Friday Topic: Autocorrelation
These past two weeks, we discussed the problem of multicollinearity. Next week, we will discuss the problem of autocorrelation – the phenomenon that occurs when we violate the assumption that the error terms are not correlated with each other. We will discuss how to detect autocorrelation, discuss in greater depth the Durbin-Watson statistic’s use as a measure of the presence of autocorrelation, and how to correct for autocorrelation.
If you Like Our Posts, Then “Like” Us on Facebook and Twitter!
Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.
Tags: Analysights, autocorrelation, bias, business forecasting, centered interaction, centering, coefficient of determination, data transformation, dependent variable, durbin-watson, error term, forecast, Forecast Friday, Forecasting, increasing sample size, independent variable, interaction term, multicollinearity, multiple linear regression, multiple regression, parameter estimates, predictive modeling, r-squared, regression, regression analysis, residuals, sales forecasting, sample size, Sampling, statistical analysis, statistical significance, Statistics, t-value, time series, time series analysis, transforming data, type I error