(Twenty-fifth in a series)
When it comes to building a regression model, for many companies there’s good news and bad news. The good news: there’s plenty of independent variables from which to choose. The bad news: there’s plenty of independent variables from which to choose! While it may be possible to run a regression with all possible independent variables, each one included in your model reduces your degrees of freedom and causes the model to overfit the data on which the model is built, resulting in less reliable forecasts when new data is introduced.
So how do you come up with your short list of independent variables?
Some analysts have tried plotting the dependent variable (Y) against individual independent variables (Xi) and selecting it if there’s some noticeable relationship. Another tried method is to produce a correlation matrix of all the independent variables and if a large correlation between two of them is discovered, drop one from consideration (so to avoid multicollinearity). Still another approach has been to perform a multiple linear regression on all possible explanatory variables and then dropping those who t values are insignificant. These approaches are often selected because they are quick and simple, but they are not reliable for coming up with a decent regression model.
Other approaches are a bit more complex, but more reliable. Perhaps the most common of these approaches is stepwise regression. Stepwise regression works by first identifying the independent variable with the highest correlation with the dependent variable. Once that variable is identified, a one-variable regression model is run. The residuals of that model are then obtained. Recall from previous Forecast Friday posts that if an important variable is omitted from a regression model, its effect on the dependent variable gets factored into the residuals. Hence, the next step in a stepwise regression is to identify the one unselected independent variable with the highest correlation with the residuals. Now you have your second independent variable, and you run a two-variable regression model. You then look at the residuals to that model and select the independent variable with the highest correlation to them, and so forth. Repeat the process until no more variables can be added into the model.
Many statistical analysis packages do stepwise regression seamlessly. Stepwise regression is not guaranteed to produce the optimal set of variables for your model.
Other approaches to variable selection include best subsets regression, which involves taking various subsets of the available independent variables and running models with them, choosing the subset with the best R2. Many statistical software packages have the capability of helping determine the various subsets to choose from. Principal components analysis of all the variables is another approach, but it is beyond the scope of this discussion.
Despite systematic techniques like stepwise regression, variable selection in regression models is as much an art as a science. Whatever variables you select for your model should have a valid rationale for being there.
Next Forecast Friday Topic: I haven’t decided yet!
Let me surprise you. In the meantime, have a great weekend and be well!
Tags: Analysights, best subsets regression, correlation matrix, dependent variable, Forecast Friday, Forecasting, independent variables, multicollinearity, multiple regression, principal components, regression analysis, simple regression, stepwise regression