## Posts Tagged ‘Introductory Business & Economic Forecasting’

### Forecast Friday Topic: Procedures for Combining Forecasts

March 24, 2011

(Forty-first in a series)

We have gone through a series of different forecasting approaches over the last several months. Many times, companies will have multiple forecasts generated for the same item, usually generated by different people across the enterprise, often using different methodologies, assumptions, and data collection processes, and typically for different business problems. Rarely is one forecasting method or forecast superior to another, especially over time. Hence, many companies will opt to combine the forecasts they generate into a composite forecast.

Considerable empirical evidence suggests that combining forecasts works very well in practice. If all the forecasts generated by the alternative approaches are unbiased, then that lack of bias carries over into the composite forecast, a desirable outcome to have.

Two common procedures for combining forecasts include simple averaging and assigning weights inversely proportional to the sum of squares error. We will discuss both procedures in this post.

Simple Average

The quickest, easiest way to combine forecasts is to simply take the forecasts generated by each method and average them. With a simple average, each forecasting method is given equal weight. So, if you are presented with the following five forecasts: You’ll get the average of \$83,000 as your composite forecast.

The simplicity and quickness of this procedure is its main advantage. However, the chief drawback is if information is known that individual methods consistently predict superiorly or inferiorly, that information is disregarded in the combination. Moreover, look at the wide variation in the forecasts above. The forecasts range from \$50,000 to \$120,000. Clearly, one or more of these methods’ forecasts will be way off. While the combination of forecasts can dampen the impact of forecast error, the outliers can easily skew the composite forecast. If you suspect one or more forecasts may be inferior to the others, you may just choose to exclude them and apply simple averaging to the forecasts for which you have some reasonable degree of confidence.

Assigning Weights in (Inverse) Proportion to Sum of Squared Errors

If you know the past performance of individual forecasting methods available to you, and you need to combine multiple forecasts, it’s likely you will want to assign greater weights to those forecast methods that have performed best. You will also want to allow the weighting scheme to adapt over time, since the relative performance of forecasting methods can change. One way to do that would be to assign weights to each forecast in based on their inverse proportion to the sum of squared forecast errors.

Let’s assume you have 12 months of sales data, actual (Xt), and three forecasting methods, each generating a forecast for each month (f1t, f2t, and f2t). Each of those three methods have also generated forecasts for month 13, which you are trying to predict. The table below shows these 12 months of actuals and forecasts, along with each method’s forecasts for month 13: How much weight do you give each forecast? Calculate the sum squared error for each: To get the weight of the one forecast method, you need to divide the sum of the other two methods’ squared errors by the total sum of the squared errors for all three methods, and then divide by 2 (3 methods minus 1). You must then do the same for the other two methods, in order to get the weights to sum to 1. Hence, the weights are as follows:   Notice that the higher weights are given to the forecast methods with the lowest sum of squared error. So, since each method generated a forecast for month 13, our composite forecast would be: Hence, we would estimate approximately 795 as our composite forecast for month 13. When we obtain month 13’s actual sales, we would repeat this process for sum of squared errors from months 1-13 for each individual forecast, reassign the weights, and then apply them to each method’s forecasts for month 14. Also, notice the fraction ½ at the beginning of each weight equation. The denominator depends on the number of weights we are generating. In this case, we are generating three weights, so our denominator is (3-1)=2. If we would have used four methods, each weight equation above would have been one-third; and if we had only two methods, there would be no fraction, because it would be one.

Regression-Based Weights – Another Procedure

Another way to assign weights would be with regression, but that’s beyond the scope of this post. While the weighting approach above is simple, it’s also ad hoc. Regression-based weights can be much more theoretically correct. However, in most cases, you will not have many months of forecasts for estimating regression parameters. Also, you run the risk of autocorrelated errors, most certainly for forecasts beyond one step ahead. More information on regression-based weights can be found in Newbold & Bos, Introductory Business & Economic Forecasting, Second Edition, pp. 504-508.

Next Forecast Friday Topic: Effectiveness of Combining Forecasts

Next week, we’ll take a look at the effectiveness of combining forecasts, with a look at the empirical evidence that has been accumulated.

********************************************************

Follow us on Facebook and Twitter!

For the latest insights on marketing research, predictive modeling, and forecasting, be sure to check out Analysights on Facebook and Twitter! “Like-ing” us on Facebook and following us on Twitter will allow you to stay informed of each new Insight Central post published, new information about analytics, discussions Analysights will be hosting, and other opportunities for feedback. So check us out on Facebook and Twitter!

Advertisements

### Forecast Friday Changes; Resumes February 3

January 17, 2011

Readers,

We’re currently in the phase of the Forecast Friday series that discusses ARIMA models. This week’s post was to discuss the autoregressive (AR), moving average (MA) and autoregressive moving average (ARMA) models, and then the posts for the next three weeks would delve into ARIMA models. Given the complexity of the topic, along with increasing client load at Analysights, I no longer have the time to cover this topic in the detail it requires. Therefore, I have decided pull ARIMA out of the series. Forecast Friday will resume February 3, when we will begin our discussion of judgmental forecasting methods.

For those of you interested in learning about ARIMA, I invite you to check out some resources that have helped me through college and graduate school:

1. Introductory Business & Economic Forecasting, 2nd Edition. Newbold, P. and Bos, T., Chapter 7.
2. Forecasting Methods and Applications,3rd Edition. Makridakis,S., Wheelwright, S. and Hyndman, R., Chapters 7-8.
3. Introducing Econometrics. Brown, W., Chapter 9.

I apologize for this inconvenience, and thank you for your understanding.

Alex

### Forecast Friday Topic: Stationarity in Time Series Data

January 13, 2011

(Thirty-fifth in a series)

In last week’s Forecast Friday post, we began our coverage of ARIMA modeling with a discussion of the Autocorrelation Function (ACF). We also learned that in order to generate forecasts from a time series, the series needed to exhibit no trend (either up or down), fluctuate around a constant mean and variance, and have covariances between terms in the series that depended only on the time interval between the terms, and not their absolute locations in the time series. A time series that meets these criteria is said to be stationary. When a time series appears to have a constant mean, then it is said to be stationary in the mean. Similarly, if the variance of the series doesn’t appear to change, then the series is also stationary in the variance.

Stationarity is nothing new in our discussions of time series forecasting. While we may not have discussed it in detail, we did note that the absence of stationarity made moving average methods less accurate for short-term forecasting, which led into our discussion of exponential smoothing. When the time series exhibited a trend, we relied upon double exponential smoothing to adjust for nonstationarity; in our discussions of regression analysis, we ensured stationarity by decomposing the time series (removing the trend, seasonal, cyclical, and irregular components), adding seasonal dummy variables into the model, and lagging the dependent variable. The ACF is another way of detecting seasonality. And that is what we’ll discuss today.

Recall our ACF from last week’s Forecast Friday discussion: Because there is no discernable pattern, and because the lags pierce the ±1.96 standard error boundaries less than 5% (in fact, zero percent) of the time, this time series is stationary. Let’s do a simple plot of our time series: A simple eyeballing of the time series plot shows that the series’ mean and variance both seem to hold fairly constant for the duration of the data set. But now let’s take a look at another data set. In the table below, which I snatched from my graduate school forecasting textbook, we have 160 quarterly observations on real gross national product:

 160 Quarters of U.S. Real Gross Domestic Product t Xt t Xt t Xt t Xt 1 1,148.2 41 1,671.6 81 2,408.6 121 3,233.4 2 1,181.0 42 1,666.8 82 2,406.5 122 3,157.0 3 1,225.3 43 1,668.4 83 2,435.8 123 3,159.1 4 1,260.2 44 1,654.1 84 2,413.8 124 3,199.2 5 1,286.6 45 1,671.3 85 2,478.6 125 3,261.1 6 1,320.4 46 1,692.1 86 2,478.4 126 3,250.2 7 1,349.8 47 1,716.3 87 2,491.1 127 3,264.6 8 1,356.0 48 1,754.9 88 2,491.0 128 3,219.0 9 1,369.2 49 1,777.9 89 2,545.6 129 3,170.4 10 1,365.9 50 1,796.4 90 2,595.1 130 3,179.9 11 1,378.2 51 1,813.1 91 2,622.1 131 3,154.5 12 1,406.8 52 1,810.1 92 2,671.3 132 3,159.3 13 1,431.4 53 1,834.6 93 2,734.0 133 3,186.6 14 1,444.9 54 1,860.0 94 2,741.0 134 3,258.3 15 1,438.2 55 1,892.5 95 2,738.3 135 3,306.4 16 1,426.6 56 1,906.1 96 2,762.8 136 3,365.1 17 1,406.8 57 1,948.7 97 2,747.4 137 3,451.7 18 1,401.2 58 1,965.4 98 2,755.2 138 3,498.0 19 1,418.0 59 1,985.2 99 2,719.3 139 3,520.6 20 1,438.8 60 1,993.7 100 2,695.4 140 3,535.2 21 1,469.6 61 2,036.9 101 2,642.7 141 3,577.5 22 1,485.7 62 2,066.4 102 2,669.6 142 3,599.2 23 1,505.5 63 2,099.3 103 2,714.9 143 3,635.8 24 1,518.7 64 2,147.6 104 2,752.7 144 3,662.4 25 1,515.7 65 2,190.1 105 2,804.4 145 2,721.1 26 1,522.6 66 2,195.8 106 2,816.9 146 3,704.6 27 1,523.7 67 2,218.3 107 2,828.6 147 3,712.4 28 1,540.6 68 2,229.2 108 2,856.8 148 3,733.6 29 1,553.3 69 2,241.8 109 2,896.0 149 3,781.2 30 1,552.4 70 2,255.2 110 2,942.7 150 3,820.3 31 1,561.5 71 2,287.7 111 3,001.8 151 3,858.9 32 1,537.3 72 2,300.6 112 2,994.1 152 3,920.7 33 1,506.1 73 2,327.3 113 3,020.5 153 3,970.2 34 1,514.2 74 2,366.9 114 3,115.9 154 4,005.8 35 1,550.0 75 2,385.3 115 3,142.6 155 4,032.1 36 1,586.7 76 2,383.0 116 3,181.6 156 4,059.3 37 1,606.4 77 2,416.5 117 3,181.7 157 4,095.7 38 1,637.0 78 2,419.8 118 3,178.7 158 4,112.2 39 1,629.5 79 2,433.2 119 3,207.4 159 4,129.7 40 1,643.4 80 2,423.5 120 3,201.3 160 4,133.2

Reprinted from Introductory Business & Economic Forecasting, 2nd Ed., Newbold, P. and Bos, T., Cincinnati, 1994, pp. 362-3.

Let’s plot the series: As you can see, the series is on a steady, upward climb. The mean of the series appears to be changing, and moving upward; hence the series is likely not stationary. Let’s take a look at the ACF: Wow! The ACF for the real GDP is in sharp contrast to our random series example above. Notice the lags: they are not cutting off. Each lag is quite strong. And the fact that most of them pierce the ±1.96 standard error line is clearly proof that the series is not white noise. Since the lags in the ACF are declining very slowly, that means that terms in the series are correlated several periods in the past. Because this series is not stationary, we must transform it into a stationary time series so that we can build a model with it.

Removing Nonstationarity: Differencing

The most common way to remove nonstationarity is to difference the time series. We talked about differencing in our discussion on correcting multicollinearity, and we mentioned quasi-differencing in our discussion on correcting autocorrelation. The concept is the same here. Differencing a series is pretty straightforward. We subtract the first value from the second, the second value from the third, and so forth. Subtracting a period’s value from its immediate subsequent period’s value is called first differencing. The formula for a first difference is given as: Let’s try it with our series:

When we difference our series, our plot of the differenced data looks like this: As you can see, the differenced series is much smoother, except towards the end where we have two points where real GDP dropped or increased sharply. The ACF looks much better too: As you can see, only the first lag breaks through the ±1.96 standard errors line. Since it is only 5% of the lags displayed, we can conclude that the differenced series is stationary.

Second Order Differencing

Sometimes, first differencing doesn’t eliminate all nonstationarity, so a differencing must be performed on the differenced series. This is called second order differencing. Differencing can go on multiple times, but very rarely does an analyst need to go beyond second order differencing to achieve stationarity. The formula for second order differencing is as follows:   We won’t show an example of second order differencing in this post, and it is important to note that second order differencing is not to be confused with second differencing, which is to subtract the value two periods prior to the current period from the value of the current period.

Seasonal Differencing

Seasonality can greatly affect a time series and make it appear nonstationary. As a result, the data set must be differenced for seasonality, very similar to seasonally adjusting a time series before performing a regression analysis. We will discuss seasonal differencing later in this ARIMA miniseries.

Recap

Before we can generate forecasts upon a time series, we must be sure our data set is stationary. Trend and seasonal components must be removed in order to generate accurate forecasts. We built on last week’s discussion of the autocorrelation function (ACF) to show how it could be used to detect stationarity – or the absence of it. When a data series is not stationary, one of the key ways to remove the nonstationarity is through differencing. The concept behind differencing is not unlike the other methods we’ve used in past discussions on forecasting: seasonal adjustment, seasonal dummy variables, lagging dependent variables, and time series decomposition.

Next Forecast Friday Topic: MA, AR, and ARMA Models

Our discussion of ARIMA models begins to hit critical mass with next week’s discussion on moving average (MA), autoregressive (AR), and autoregressive moving average (ARMA) models. This is where we begin the process of identifying the model to build for a dataset, and how to use the ACF and partial ACF (PACF) to determine whether an MA, AR, or ARMA model is the best fit for the data. That discussion will lay the foundation for our next three Forecast Friday discussions, where we delve deeply into ARIMA models.

*************************

What is your biggest gripe about using data? Tell us in our discussion on Facebook!

Is there a recurring issue about data analysis – or manipulation – that always seems to rear its ugly head?  What issues about data always seem to frustrate you?  What do you do about it?  Readers of Insight Central would love to know.  Join our discussion on Facebook. Simply go to our Facebook page and click on the “Discussion” tab and share your thoughts!   While you’re there, be sure to “Like” Analysights’ Facebook page so that you can always stay on top of the latest insights on marketing research, predictive modeling, and forecasting, and be aware of each new Insight Central post and discussions!  You can even follow us on Twitter!  So get this New Year off right and check us out on Facebook and Twitter!

### Multiple Regression: Specification Bias

July 1, 2010

(Eleventh in a series)

In last week’s Forecast Friday post, we discussed several of the important checks you must do to ensure that your model is valid. You always want to be sure that your model does not violate the assumptions we discussed earlier. Today we are going to see what happens when we violate the specification assumption, which says that we do not omit relevant independent variables from our regression model. You will see that when we leave out an important independent variable from a regression model, quite misleading results can emerge. You will also see that violating one assumption can trigger violations of other assumptions.

Revisiting our Multiple Regression Example

Recall our data set of 25 annual observations of U.S. Savings and Loan profit margin data, shown in the table below:

 Year Percentage Profit Margin (Yt) Net Revenues Per Deposit Dollar (X1t) Number of Offices (X2t) 1 0.75 3.92 7,298 2 0.71 3.61 6,855 3 0.66 3.32 6,636 4 0.61 3.07 6,506 5 0.70 3.06 6,450 6 0.72 3.11 6,402 7 0.77 3.21 6,368 8 0.74 3.26 6,340 9 0.90 3.42 6,349 10 0.82 3.42 6,352 11 0.75 3.45 6,361 12 0.77 3.58 6,369 13 0.78 3.66 6,546 14 0.84 3.78 6,672 15 0.79 3.82 6,890 16 0.70 3.97 7,115 17 0.68 4.07 7,327 18 0.72 4.25 7,546 19 0.55 4.41 7,931 20 0.63 4.49 8,097 21 0.56 4.70 8,468 22 0.41 4.58 8,717 23 0.51 4.69 8,991 24 0.47 4.71 9,179 25 0.32 4.78 9,318

Data taken from Spellman, L.J., “Entry and profitability in a rate-free savings and loan market.” Quarterly Review of Economics and Business, 18, no. 2 (1978): 87-95, Reprinted in Newbold, P. and Bos, T., Introductory Business & Economic Forecasting, 2nd Edition, Cincinnati (1994): 136-137

Also, recall that we built a model that hypothesized that S&L percentage profit margin (our dependent variable, Yt) was positively related to net revenues per deposit dollar (one of our independent variables, X1t), and negatively related to the number of S&L offices (our other independent variable, X2t). When we ran our regression, we got the following model:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

We also checked to see if the model parameters were significant, and obtained the following information:

 Parameter Value T-Statistic Significant? Intercept 1.5645000 19.70 Yes B1t 0.2372000 4.27 Yes B2t (0.0002490) (7.77) Yes

We also had a coefficient of determination – R2 – of 0.865, indicating that the model explains about 86.5% of the variation in S&L percentage profit margin.

Welcome to the World of Specification Bias…

Let’s deliberately leave out the number of S&L offices (X2t) from our model, and do just a simple regression with the net revenues per deposit dollar. This is the model we get:

Yt = 1.32616 – 0.16913X1t

We also get an R2 of 0.495. The t-statistics for our intercept and parameter B1t are as follows:

 Parameter Value T-Statistic Significant? Intercept 1.32616 9.57 Yes B1t (0.16913) (4.75) Yes

Compare these new results with our previous results and what do you notice? The results of our second regression are in sharp contrast to those of our first regression. Our new model has far less explanatory power – R2 dropped from 0.865 to 0.495 – and the sign of the parameter estimate for net revenue per deposit dollar has changed: The coefficient of X1t was significant and positive in the first model, and now it is significant and negative! As a result, we end up with a biased regression model.

… and to the Land of Autocorrelation…

Recall another of the regression assumptions: that error terms should not be correlated with one another. When error terms are correlated with one another, we end up with autocorrelation, which renders our parameter estimates inefficient. Recall that last week, we computed the Durbin-Watson test statistic, d, which is an indicator of autocorrelation. It is bad to have either positive autocorrelation (d close to zero), or negative autocorrelation (d close to 4). Generally, we want d to be approximately 2. In our first model, d was 1.95, so autocorrelation was pretty much nonexistent. In our second model, d=0.85, suggesting the presence of significant positive autocorrelation!

How did this happen? Basically, when an important variable is omitted from regression, its impact on the dependent variable gets incorporated into the error term. If the omitted independent variable is correlated with any of the included independent variables, the error terms will also be correlated.

…Which Leads to Yet Another Violation!

The presence of autocorrelation in our second regression reveals the presence of another violation, not in the incomplete regression, but in the full regression. As the sentence above read: “if the independent variable is correlated with any of the included independent variables…” Remember the other assumption: “no linear relationship between two or more independent variables?” Basically, the detection of autocorrelation in the incomplete regression revealed that the full regression violated this very assumption – and thus exhibits multicollinearity! Generally, a coefficient changing between positive and negative (either direction) when one or more variables is omitted is an indicator of multicollinearity.

So was the full regression wrong too? Not terribly. As you will find in upcoming posts, avoiding multicollinearity is nearly impossible, especially with time series data. That’s because multicollinearity is typically a data problem. The severity of multicollinearity can often be reduced by increasing the number of observations in the data set. This is often not a problem with cross-sectional data, where data sets can have thousands, if not millions of observations. However, with time series data, the number of observations available is limited to how many periods of data have been recorded.

Moreover, the longer your time series, the more you risk structural changes in your data over the course of your time series. For instance, if you were examining annual patterns in bank lending within a particular census tract between 1990 and 2010, you might have a reliable model to work with. But let’s say you widen your time series to go back as far as 1970. You will see dramatic shifts in patterns in your data set. That’s because prior to 1977, when Congress passed the Community Reinvestment Act, many banks engaged in a practice called “redlining,” where they literally drew red lines around some neighborhoods, usually where minorities and low-income households were, and did not lend there. In this case, increasing the size of the data set might reduce multicollinearity, but actually cause other modeling problems.

And as you’ve probably guessed, one way of reducing multicollinearity can be dropping variables from the regression. But look what happened when we dropped the number of S&L offices from our regression: we might have eliminated multicollinearity, but we gained autocorrelation and specification bias!

Bottom Line:

The lesson, for us as forecasters and analysts, therefore is that we must accept that models are far from perfect and we must weigh the impact of various regression model specifications. Is the multicollinearity that is present in our model tolerable? Can we add more observations without causing new problems? Can we drop a variable from a regression without causing either specification bias or material differences in explanatory power, parameter estimates, model validity, or even forecast accuracy? Building the model is easy – but it’s these normative considerations that’s challenging.

Next Forecast Friday Topic: Building Regression Models Using Excel

In next week’s Forecast Friday post, we will take a break from discussing the theory of regression analysis and look at a demonstration of how to use the “Regression Analysis” tool in Microsoft Excel. This demonstration is intended to show you how easy running a regression is, so that you can start applying the concepts and building forecasts for your business. Until then, thanks again for reading Forecast Friday, and I wish you and your family a great 4th of July weekend!

*************************

Analysights is now on Facebook!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page.

### Forecast Friday Topic: Multiple Regression Analysis

June 17, 2010

(Ninth in a series)

Quite often, when we try to forecast sales, more than one variable is often involved. Sales depends on how much advertising we do, the price of our products, the price of competitors’ products, the time of the year (if our product is seasonal), and also demographics of the buyers. And there can be many more factors. Hence, we need to measure the impact of all relevant variables that we know drive our sales or other dependent variable. That brings us to the need for multiple regression analysis. Because of its complexity, we will be spending the next several weeks discussing multiple regression analysis in easily digestible parts. Multiple regression is a highly useful technique, but is quite easy to forget if not used often.

Another thing to note, regression analysis is often used for both time series and cross-sectional analysis. Time series is what we have focused on all along. Cross-sectional analysis involves using regression to analyze variables on static data (such as predicting how much money a person will spend on a car based on income, race, age, etc.). We will use examples of both in our discussions of multiple regression.

Determining Parameter Estimates for Multiple Regression

When it comes to deriving the parameter estimates in a multiple regression, the process gets both complicated and tedious, even if you have just two independent variables. We strongly advise you to use the regression features of MS-Excel, or some statistical analysis tool like SAS, SPSS, or MINITAB. In fact, we will not work out the derivation of the parameters with the data sets, but will provide you the results. You are free to run the data we provide on your own to replicate the results we display. I do, however, want to show you the equations for computing the parameter estimates for a three-variable (two independent variables and one dependent variable), and point out something very important.

Let’s assume that sales is your dependent variable, Y, and advertising expenditures and price are your independent variables, X1 and X2, respectively. Also, the coefficients – your parameter estimates will have similar subscripts to correspond to their respective independent variable. Hence, your model will take on the form: Now, how do you go about computing α, β1 and β2? The process is similar to that of a two-variable model, but a little more involved. Take a look:   The subscript “i” represents the individual oberservation.  In time series, the subscript can also be represented with a “t“.

What do you notice about the formulas for computing β1 and β2? First, you notice that the independent variables, X1 and X2, are included in the calculation for each coefficient. Why is this? Because when two or more independent variables are used to estimate the dependent variable, the independent variables themselves are likely to be related linearly as well. In fact, they need to be in order to perform multiple regression analysis. If either β1 or β2 turned out to be zero, then simple regression would be appropriate. However, if we omit one or more independent variables from the model that are related to those variables in the model, we run into serious problems, namely:

Specification Bias (Regression Assumptions Revisited)

Recall from last week’s Forecast Friday discussion on regression assumptions that 1) our equation must correctly specify the true regression model, namely that all relevant variables and no irrelevant variables are included in the model and 2) the independent variables must not be correlated with the error term. If either of these assumptions is violated, the parameter estimates you get will be biased. Looking at the above equations for β1 and β2, we can see that if we excluded one of the independent variables, say X2, from the model, the value derived for β1 will be incorrect because X1 has some relationship with X2. Moreover, X2‘s values are likely to be accounted for in the error terms, and because of its relationship with X1, X1 will be correlated with the error term, violating the second assumption above. Hence, you will end up with incorrect, biased estimators for your regression coefficient, β1.

Omitted Variables are Bad, but Excessive Variables Aren’t Much Better

Since omitting relevant variables can lead to biased parameter estimates, many analysts have a tendency to include any variable that might have any chance of affecting the dependent variable, Y. This is also bad. Additional variables means that you need to estimate more parameters, and that reduces your model’s degrees of freedom and the efficiency (trustworthiness) of your parameter estimates. Generally, for each variable – both dependent and independent – you are considering, you should have at least five data points. So, for a model with three independent variables, your data set should have 20 observations.

Another Important Regression Assumption

One last thing about multiple regression analysis – another assumption, which I deliberately left out of last week’s discussion, since it applies exclusively to multiple regression:

No combination of independent variables should have an exact linear relationship with one another.

OK, so what does this mean? Let’s assume you’re doing a model to forecast the effect of temperature on the speed at which ice melts. You use two independent variables: Celsius temperature and Fahrenheit temperature. What’s the problem here? There is a perfect linear relationship between these two variables. Every time you use a particular value of Fahrenheit temperature, you will get the same value of Celsius temperature. In this case, you will end up with multicollinearity, an assumption violation that results in inefficient parameter estimates. A relationship between independent variables need not be perfectly linear for multicollinearity to exist. Highly correlated variables can do the same thing. For example, independent variables such as “Husband Age” and “Wife Age,” or “Home Value” and “Home Square Footage” are examples of independent variables that are highly correlated.

You want to be sure that you do not put variables in the model that need not be there, because doing so could lead to multicollinearity.

Now Can We Get Into Multiple Regression????

Wasn’t that an ordeal? Well, now the fun can begin! I’m going to use an example from one of my old graduate school textbooks, because it’s good for several lessons in multiple regression. This data set is 25 annual observations to predict the percentage profit margin (Y) for U.S. savings and loan associations, based on changes in net revenues per deposit dollar (X1) and number of offices (X2). The data are as follows:

 Year Percentage Profit Margin (Yt) Net Revenues Per Deposit Dollar (X1t) Number of Offices (X2t) 1 0.75 3.92 7,298 2 0.71 3.61 6,855 3 0.66 3.32 6,636 4 0.61 3.07 6,506 5 0.70 3.06 6,450 6 0.72 3.11 6,402 7 0.77 3.21 6,368 8 0.74 3.26 6,340 9 0.90 3.42 6,349 10 0.82 3.42 6,352 11 0.75 3.45 6,361 12 0.77 3.58 6,369 13 0.78 3.66 6,546 14 0.84 3.78 6,672 15 0.79 3.82 6,890 16 0.70 3.97 7,115 17 0.68 4.07 7,327 18 0.72 4.25 7,546 19 0.55 4.41 7,931 20 0.63 4.49 8,097 21 0.56 4.70 8,468 22 0.41 4.58 8,717 23 0.51 4.69 8,991 24 0.47 4.71 9,179 25 0.32 4.78 9,318

Data taken from Spellman, L.J., “Entry and profitability in a rate-free savings and loan market.” Quarterly Review of Economics and Business, 18, no. 2 (1978): 87-95, Reprinted in Newbold, P. and Bos, T., Introductory Business & Economic Forecasting, 2nd Edition, Cincinnati (1994): 136-137

What is the relationship between the S&Ls’ profit margin percentage and the number of S&L offices? How about between the margin percentage and the net revenues per deposit dollar? Is the relationship positive (that is, profit margin percentage moves in the same direction as its independent variable(s))? Or negative (the dependent and independent variables move in opposite directions)? Let’s look at each independent variable’s individual relationship with the dependent variable.

Net Revenue Per Deposit Dollar (X1) and Percentage Profit Margin (Y)

Generally, if revenue per deposit dollar goes up, would we not expect the percentage profit margin to also go up? After all, if the S & L is making more revenue on the same dollar, it suggests more efficiency. Hence, we expect a positive relationship. So, in the resulting regression equation, we would expect the coefficient, β1, for net revenue per deposit dollar to have a “+” sign.

Number of S&L Offices (X2) and Percentage Profit Margin (Y)

Generally, if there are more S&L offices, would that not suggest either higher overhead, increased competition, or some combination of the two? Those would cut into profit margins. Hence, we expect a negative relationship. So, in the resulting regression equation, we would expect the coefficient, β2, for number of S&L offices to have a “-” sign.

Are our Expectations Correct?

Do our relationship expectations hold up?  They certainly do. The estimated multiple regression model is:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

What do the Parameter Estimates Mean?

Essentially, the model says that if net revenues per deposit dollar (X1t) increase by one unit, then percentage profit margin (Yt) will – on average – increase by 0.23720 percentage points, when the number of S&L offices is fixed. If the number of offices (X2t) increases by one, then percentage profit margin (Yt) will decrease by an average of 0.000249 percentage points, when net revenues are fixed.

Do Changes in the Independent Variables Explain Changes in The Dependent Variable?

We compute the coefficient of determination, R2, and get 0.865, indicating that changes in the number of S&L offices and in the net revenue per deposit dollar explain 86.5% of the variation in S&L percentage profit margin.

Are the Parameter Estimates Statistically Significant?

We have 25 observations, and three parameters – two coefficients for the independent variables, and one intercept – hence we have 22 degrees of freedom (25-3). If we choose a 95% confidence interval, we are saying that if we resampled and replicated this analysis 100 times, the average of our parameter estimates will be contain the true parameter approximately 95 times. To do this, we need to look at the t-values for each parameter estimate. For a two-tailed 95% significance test with 22 degrees of freedom, our critical t-value is 2.074. That means that if the t-statistic for a parameter estimate is greater than 2.074, then there is a strong positive relationship between the independent variable and the dependent variable; if the t-statistic for the parameter estimate is less than -2.074, then there is a strong negative relationship. This is what we get:

 Parameter Value T-Statistic Significant? Intercept 1.5645000 19.70 Yes B1t 0.2372000 4.27 Yes B2t (0.0002490) (7.77) Yes

So, yes, all our parameter estimates are significant.

Next Forecast Friday: Building on What You Learned

I think you’ve had enough for this week! But we are still not finished. We’re going to stop here and continue with further analysis of this example next week. Next week, we will discuss computing the 95% confidence interval for the parameter estimates; determining whether the model is valid; and checking for autocorrelation. The following Forecast Friday (July 1) blog post will discuss specification bias in greater detail, demonstrating the impact of omitting a key independent variable from the model.