(Ninth in a series)
Quite often, when we try to forecast sales, more than one variable is often involved. Sales depends on how much advertising we do, the price of our products, the price of competitors’ products, the time of the year (if our product is seasonal), and also demographics of the buyers. And there can be many more factors. Hence, we need to measure the impact of all relevant variables that we know drive our sales or other dependent variable. That brings us to the need for multiple regression analysis. Because of its complexity, we will be spending the next several weeks discussing multiple regression analysis in easily digestible parts. Multiple regression is a highly useful technique, but is quite easy to forget if not used often.
Another thing to note, regression analysis is often used for both time series and cross-sectional analysis. Time series is what we have focused on all along. Cross-sectional analysis involves using regression to analyze variables on static data (such as predicting how much money a person will spend on a car based on income, race, age, etc.). We will use examples of both in our discussions of multiple regression.
Determining Parameter Estimates for Multiple Regression
When it comes to deriving the parameter estimates in a multiple regression, the process gets both complicated and tedious, even if you have just two independent variables. We strongly advise you to use the regression features of MS-Excel, or some statistical analysis tool like SAS, SPSS, or MINITAB. In fact, we will not work out the derivation of the parameters with the data sets, but will provide you the results. You are free to run the data we provide on your own to replicate the results we display. I do, however, want to show you the equations for computing the parameter estimates for a three-variable (two independent variables and one dependent variable), and point out something very important.
Let’s assume that sales is your dependent variable, Y, and advertising expenditures and price are your independent variables, X_{1} and X_{2}, respectively. Also, the coefficients – your parameter estimates will have similar subscripts to correspond to their respective independent variable. Hence, your model will take on the form:
Now, how do you go about computing α, β_{1} and β_{2}? The process is similar to that of a two-variable model, but a little more involved. Take a look:
The subscript “i” represents the individual oberservation. In time series, the subscript can also be represented with a “t“.
What do you notice about the formulas for computing β_{1} and β_{2}? First, you notice that the independent variables, X_{1} and X_{2}, are included in the calculation for each coefficient. Why is this? Because when two or more independent variables are used to estimate the dependent variable, the independent variables themselves are likely to be related linearly as well. In fact, they need to be in order to perform multiple regression analysis. If either β_{1} or β_{2} turned out to be zero, then simple regression would be appropriate. However, if we omit one or more independent variables from the model that are related to those variables in the model, we run into serious problems, namely:
Specification Bias (Regression Assumptions Revisited)
Recall from last week’s Forecast Friday discussion on regression assumptions that 1) our equation must correctly specify the true regression model, namely that all relevant variables and no irrelevant variables are included in the model and 2) the independent variables must not be correlated with the error term. If either of these assumptions is violated, the parameter estimates you get will be biased. Looking at the above equations for β_{1} and β_{2}, we can see that if we excluded one of the independent variables, say X_{2}, from the model, the value derived for β_{1} will be incorrect because X_{1 }has some relationship with X_{2}. Moreover, X_{2}‘s values are likely to be accounted for in the error terms, and because of its relationship with X_{1}, X_{1} will be correlated with the error term, violating the second assumption above. Hence, you will end up with incorrect, biased estimators for your regression coefficient, β_{1}.
Omitted Variables are Bad, but Excessive Variables Aren’t Much Better
Since omitting relevant variables can lead to biased parameter estimates, many analysts have a tendency to include any variable that might have any chance of affecting the dependent variable, Y. This is also bad. Additional variables means that you need to estimate more parameters, and that reduces your model’s degrees of freedom and the efficiency (trustworthiness) of your parameter estimates. Generally, for each variable – both dependent and independent – you are considering, you should have at least five data points. So, for a model with three independent variables, your data set should have 20 observations.
Another Important Regression Assumption
One last thing about multiple regression analysis – another assumption, which I deliberately left out of last week’s discussion, since it applies exclusively to multiple regression:
No combination of independent variables should have an exact linear relationship with one another.
OK, so what does this mean? Let’s assume you’re doing a model to forecast the effect of temperature on the speed at which ice melts. You use two independent variables: Celsius temperature and Fahrenheit temperature. What’s the problem here? There is a perfect linear relationship between these two variables. Every time you use a particular value of Fahrenheit temperature, you will get the same value of Celsius temperature. In this case, you will end up with multicollinearity, an assumption violation that results in inefficient parameter estimates. A relationship between independent variables need not be perfectly linear for multicollinearity to exist. Highly correlated variables can do the same thing. For example, independent variables such as “Husband Age” and “Wife Age,” or “Home Value” and “Home Square Footage” are examples of independent variables that are highly correlated.
You want to be sure that you do not put variables in the model that need not be there, because doing so could lead to multicollinearity.
Now Can We Get Into Multiple Regression????
Wasn’t that an ordeal? Well, now the fun can begin! I’m going to use an example from one of my old graduate school textbooks, because it’s good for several lessons in multiple regression. This data set is 25 annual observations to predict the percentage profit margin (Y) for U.S. savings and loan associations, based on changes in net revenues per deposit dollar (X_{1}) and number of offices (X_{2}). The data are as follows:
Year |
Percentage Profit Margin (Y_{t}) |
Net Revenues Per Deposit Dollar (X_{1t}) |
Number of Offices (X_{2t}) |
1 |
0.75 |
3.92 |
7,298 |
2 |
0.71 |
3.61 |
6,855 |
3 |
0.66 |
3.32 |
6,636 |
4 |
0.61 |
3.07 |
6,506 |
5 |
0.70 |
3.06 |
6,450 |
6 |
0.72 |
3.11 |
6,402 |
7 |
0.77 |
3.21 |
6,368 |
8 |
0.74 |
3.26 |
6,340 |
9 |
0.90 |
3.42 |
6,349 |
10 |
0.82 |
3.42 |
6,352 |
11 |
0.75 |
3.45 |
6,361 |
12 |
0.77 |
3.58 |
6,369 |
13 |
0.78 |
3.66 |
6,546 |
14 |
0.84 |
3.78 |
6,672 |
15 |
0.79 |
3.82 |
6,890 |
16 |
0.70 |
3.97 |
7,115 |
17 |
0.68 |
4.07 |
7,327 |
18 |
0.72 |
4.25 |
7,546 |
19 |
0.55 |
4.41 |
7,931 |
20 |
0.63 |
4.49 |
8,097 |
21 |
0.56 |
4.70 |
8,468 |
22 |
0.41 |
4.58 |
8,717 |
23 |
0.51 |
4.69 |
8,991 |
24 |
0.47 |
4.71 |
9,179 |
25 |
0.32 |
4.78 |
9,318 |
Data taken from Spellman, L.J., “Entry and profitability in a rate-free savings and loan market.” Quarterly Review of Economics and Business, 18, no. 2 (1978): 87-95, Reprinted in Newbold, P. and Bos, T., Introductory Business & Economic Forecasting, 2^{nd} Edition, Cincinnati (1994): 136-137
What is the relationship between the S&Ls’ profit margin percentage and the number of S&L offices? How about between the margin percentage and the net revenues per deposit dollar? Is the relationship positive (that is, profit margin percentage moves in the same direction as its independent variable(s))? Or negative (the dependent and independent variables move in opposite directions)? Let’s look at each independent variable’s individual relationship with the dependent variable.
Net Revenue Per Deposit Dollar (X_{1}) and Percentage Profit Margin (Y)
Generally, if revenue per deposit dollar goes up, would we not expect the percentage profit margin to also go up? After all, if the S & L is making more revenue on the same dollar, it suggests more efficiency. Hence, we expect a positive relationship. So, in the resulting regression equation, we would expect the coefficient, β_{1}, for net revenue per deposit dollar to have a “+” sign.
Number of S&L Offices (X_{2}) and Percentage Profit Margin (Y)
Generally, if there are more S&L offices, would that not suggest either higher overhead, increased competition, or some combination of the two? Those would cut into profit margins. Hence, we expect a negative relationship. So, in the resulting regression equation, we would expect the coefficient, β_{2}, for number of S&L offices to have a “-” sign.
Are our Expectations Correct?
Do our relationship expectations hold up? They certainly do. The estimated multiple regression model is:
Y_{t} = 1.56450 + 0.23720X_{1t} – 0.000249X_{2t}
What do the Parameter Estimates Mean?
Essentially, the model says that if net revenues per deposit dollar (X_{1t}) increase by one unit, then percentage profit margin (Y_{t}) will – on average – increase by 0.23720 percentage points, when the number of S&L offices is fixed. If the number of offices (X_{2t}) increases by one, then percentage profit margin (Y_{t}) will decrease by an average of 0.000249 percentage points, when net revenues are fixed.
Do Changes in the Independent Variables Explain Changes in The Dependent Variable?
We compute the coefficient of determination, R^{2}, and get 0.865, indicating that changes in the number of S&L offices and in the net revenue per deposit dollar explain 86.5% of the variation in S&L percentage profit margin.
Are the Parameter Estimates Statistically Significant?
We have 25 observations, and three parameters – two coefficients for the independent variables, and one intercept – hence we have 22 degrees of freedom (25-3). If we choose a 95% confidence interval, we are saying that if we resampled and replicated this analysis 100 times, the average of our parameter estimates will be contain the true parameter approximately 95 times. To do this, we need to look at the t-values for each parameter estimate. For a two-tailed 95% significance test with 22 degrees of freedom, our critical t-value is 2.074. That means that if the t-statistic for a parameter estimate is greater than 2.074, then there is a strong positive relationship between the independent variable and the dependent variable; if the t-statistic for the parameter estimate is less than -2.074, then there is a strong negative relationship. This is what we get:
Parameter |
Value |
T-Statistic |
Significant? |
Intercept |
1.5645000 |
19.70 |
Yes |
B_{1t} |
0.2372000 |
4.27 |
Yes |
B_{2t} |
(0.0002490) |
(7.77) |
Yes |
So, yes, all our parameter estimates are significant.
Next Forecast Friday: Building on What You Learned
I think you’ve had enough for this week! But we are still not finished. We’re going to stop here and continue with further analysis of this example next week. Next week, we will discuss computing the 95% confidence interval for the parameter estimates; determining whether the model is valid; and checking for autocorrelation. The following Forecast Friday (July 1) blog post will discuss specification bias in greater detail, demonstrating the impact of omitting a key independent variable from the model.