Posts Tagged ‘durbin-watson’

Forecast Friday Topic: Correcting Autocorrelation

August 5, 2010

(Sixteenth in a series)

Last week, we discussed how to detect autocorrelation – the violation of the regression assumption that the error terms are not correlated with one another – in your forecasting model. Models exhibiting autocorrelation have parameter estimates that are inefficient, and R2s and t-ratios that seem overly inflated. As a result, your model generates forecasts that are too good to be true and has a tendency to miss turning points in your time series. In last week’s Forecast Friday post, we showed you how to diagnose autocorrelation: examining the model’s parameter estimates, visually inspecting the data, and computing the Durbin-Watson statistic. Today, we’re going to discuss how to correct it.

Revisiting our Data Set

Recall our data set: average hourly wages of textile and apparel workers for the 18 months from January 1986 through June 1987, as reported in the Survey of Current Business (September issues from 1986 and 1987), and reprinted in Data Analysis Using Microsoft ® Excel, by Michael R. Middleton, page 219:

Month

t

Wage

Jan-86

1

5.82

Feb-86

2

5.79

Mar-86

3

5.8

Apr-86

4

5.81

May-86

5

5.78

Jun-86

6

5.79

Jul-86

7

5.79

Aug-86

8

5.83

Sep-86

9

5.91

Oct-86

10

5.87

Nov-86

11

5.87

Dec-86

12

5.9

Jan-87

13

5.94

Feb-87

14

5.93

Mar-87

15

5.93

Apr-87

16

5.94

May-87

17

5.89

Jun-87

18

5.91

We generated the following regression model:

Ŷ = 5.7709 + 0.0095t

Our model had an R2 of .728, and t-ratios of about 368 for the intercept term and 6.55 for the parameter estimate, t. The Durbin-Watson statistic was 1.05, indicating positive autocorrelation. How do we correct for autocorrelation?

Lagging the Dependent Variable

One of the most common remedies for autocorrelation is to lag the dependent variable one or more periods and then make the lagged dependent variable the independent variable. So, in our data set above, you would take the first value of the dependent variable, $5.82, and make it the independent variable for period 2, with $5.79 being the dependent variable; in like manner, $5.79 will also become the independent variable for the next period, whose dependent variable has a value of $5.80, and so on. Since the error terms from one period to another exhibit correlation, by using the previous value of the dependent variable to predict the next one, you reduce that correlation of errors.

You can lag for as many periods as you need to; however, note that you lose the first observation when you lag one period (unless you know the previous period before the start of the data set, you have nothing to predict the first observation). You’ll lose two observations if you lag two periods, and so on. If you have a very small data set, the loss of degrees of freedom can lead to Type II error – failing to identify a parameter estimate as significant, when in fact it is. So, you must be careful here.

In this case, by lagging our data by one period, we have the following data set:

Month

Wage

Lag1 Wage

Feb-86

$5.79

$5.82

Mar-86

$5.80

$5.79

Apr-86

$5.81

$5.80

May-86

$5.78

$5.81

Jun-86

$5.79

$5.78

Jul-86

$5.79

$5.79

Aug-86

$5.83

$5.79

Sep-86

$5.91

$5.83

Oct-86

$5.87

$5.91

Nov-86

$5.87

$5.87

Dec-86

$5.90

$5.87

Jan-87

$5.94

$5.90

Feb-87

$5.93

$5.94

Mar-87

$5.93

$5.93

Apr-87

$5.94

$5.93

May-87

$5.89

$5.94

Jun-87

$5.91

$5.89

 

So, we have created a new independent variable, Lag1_Wage. Notice that we are not going to regress time period t as an independent variable. This doesn’t mean that we should or shouldn’t; in this case, we’re only trying to demonstrate the effect of the lagging.

Rerunning the Regression

Now we do our regression analysis. We come up with the following equation:

Ŷ = 0.8253 + 0.8600*Lag1_Wage

Apparently, from this model, each $1 change in hourly wage from the previous month is associated with an average $0.86 change in hourly wages for the current month. The R2 for this model was virtually unchanged, 0.730. However, the Durbin-Watson statistic is now 2.01 – just about the total eradication of autocorrelation. Unfortunately, the intercept has a t-ratio of 1.04, indicating it is not significant. The parameter estimate for Lag1_Wage is about 6.37, not much different than the parameter estimate for t in our previous model. However, we did get rid of the autocorrelation.

The statistically insignificant intercept term resulting from this lagging is a result of the Type II error involved with the loss of a degree of freedom in a small sample size. Perhaps if we had several more months of data, we might have had a significant intercept estimate.

Other Approaches to Correcting Autocorrelation

There are other approaches to correcting autocorrelation. One other important way might be to identify important independent variables that have been omitted from the model. Perhaps if we had data on the average years work experience of the textile and apparel labor force from month to month, that might have increased our R2, and reduced correlations in the error term. Another thing we could do is difference the data. Differencing works like lagging, only we subtract the value of the dependent and independent variables of the first observation from their respective values in the second observation; then we subtract those of the second observation’s original values from those of the third, and so on. Then we run a regression on the differences in observations. The problem here is that again, your data set is reduced by one observation and your transformed model will not have an intercept term, which can cause issues in some studies.

Other approaches to correcting autocorrelation include quasi-differencing, the Cochran-Orcutt Procedure, the Hildreth-Lu Procedure, and the Durbin Two-Step Method. These methods are iterative, require a lot of tedious effort and are beyond the scope of our post. But many college-level forecasting textbooks have sections on these procedures if you’re interested in further reading on them.

Next Forecast Friday Topic: Detecting Heteroscedasticity

Next week, we’ll discuss the last of the regression violations, heteroscedasticity, which is the violation of the assumption that error terms have a constant variance. We will discuss why heteroscedasticity exists and how to diagnose it. The week after that, we’ll discuss remedying heteroscedasticity. Once we have completed our discussions on the regression violations, we will spend a couple of weeks discussing regression modeling techniques like transforming independent variables, using categorical variables, adjusting for seasonality, and other regression techniques. These topics will be far less theoretical and more practical in terms of forecasting.

Forecast Friday Topic: Detecting Autocorrelation

July 29, 2010

(Fifteenth in a series)

We have spent the last few Forecast Friday posts discussing violations of different assumptions in regression analysis. So far, we have discussed the effects of specification bias and multicollinearity on parameter estimates, and their corresponding effect on your forecasts. Today, we will discuss another violation, autocorrelation, which occurs when sequential residual (error) terms are correlated with one another.

When working with time series data, autocorrelation is the most common problem forecasters face. When the assumption of uncorrelated residuals is violated, we end up with models that have inefficient parameter estimates and upwardly-biased t-ratios and R2 values. These inflated values make our forecasting model appear better than it really is, and can cause our model to miss turning points. Hence, if you’re model is predicting an increase in sales and you, in actuality, see sales plunge, it may be due to autocorrelation.

What Does Autocorrelation Look Like?

Autocorrelation can take on two types: positive or negative. In positive autocorrelation, consecutive errors usually have the same sign: positive residuals are almost always followed by positive residuals, while negative residuals are almost always followed by negative residuals. In negative autocorrelation, consecutive errors typically have opposite signs: positive residuals are almost always followed by negative residuals and vice versa.

In addition, there are different orders of autocorrelation. The simplest, most common kind of autocorrelation, first-order autocorrelation, occurs when the consecutive errors are correlated. Second-order autocorrelation occurs when error terms two periods apart are correlated, and so forth. Here, we will concentrate solely on first-order autocorrelation.

You will see a visual depiction of positive autocorrelation later in this post.

What Causes Autocorrelation?

The two main culprits for autocorrelation are sluggishness in the business cycle (also known as inertia) and omitted variables from the model. At various turning points in a time series, inertia is very common. At the time when a time series turns upward (downward), its observations build (lose) momentum, and continue going up (down) until the series reaches its peak (trough). As a result, successive observations and the error terms associated with them depend on each other.

Another example of inertia happens when forecasting a time series where the same observations can be in multiple successive periods. For example, I once developed a model to forecast enrollment for a community college, and found autocorrelation to be present in my initial model. This happened because many of the students enrolled during the spring term were also enrolled in the previous fall term. As a result, I needed to correct for that.

The other main cause of autocorrelation is omitted variables from the model. When an important independent variable is omitted from a model, its effect on the dependent variable becomes part of the error term. Hence, if the omitted variable has a positive correlation with the dependent variable, it is likely to cause error terms that are positively correlated.

How Do We Detect Autocorrelation?

To illustrate how we go about detecting autocorrelation, let’s first start with a data set. I have pulled the average hourly wages of textile and apparel workers for the 18 months from January 1986 through June 1987. The original source was the Survey of Current Business, September issues from 1986 and 1987, but this data set was reprinted in Data Analysis Using Microsoft ® Excel, by Michael R. Middleton, page 219:

Month

t

Wage

Jan-86

1

5.82

Feb-86

2

5.79

Mar-86

3

5.8

Apr-86

4

5.81

May-86

5

5.78

Jun-86

6

5.79

Jul-86

7

5.79

Aug-86

8

5.83

Sep-86

9

5.91

Oct-86

10

5.87

Nov-86

11

5.87

Dec-86

12

5.9

Jan-87

13

5.94

Feb-87

14

5.93

Mar-87

15

5.93

Apr-87

16

5.94

May-87

17

5.89

Jun-87

18

5.91

Now, let’s run a simple regression model, using time period t as the independent variable and Wage as the dependent variable. Using the data set above, we derive the following model:

Ŷ = 5.7709 + 0.0095t

Examine the Model Output

Notice also the following model diagnostic statistics:

R2=

0.728

Variable

Coefficient

t-ratio

Intercept

5.7709

367.62

t

0.0095

6.55

 

You can see that the R2 is a high number, with changes in t explaining nearly three-quarters the variation in average hourly wage. Note also the t-ratios for both the intercept and the parameter estimate for t. Both are very high. Recall that a high R2 and high t-ratios are symptoms of autocorrelation.

Visually Inspect Residuals

Just because a model has a high R2 and parameters with high t-ratios doesn’t mean autocorrelation is present. More work must be done to detect autocorrelation. Another way to check for autocorrelation is to visually inspect the residuals. The best way to do this is through plotting the average hourly wage predicted by the model against the actual average hourly wage, as Middleton has done:

Notice the green line representing the Predicted Wage. It is a straight, upward line. This is to be expected, since the independent variable is sequential and shows an increasing trend. The red line depicts the actual wage in the time series. Notice that the model’s forecast is higher than actual for months 5 through 8, and for months 17 and 18. The model also underpredicts for months 12 through 16. This clearly illustrates the presence of positive, first-order autocorrelation.

The Durbin-Watson Statistic

Examining the model components and visually inspecting the residuals are intuitive, but not definitive ways to diagnose autocorrelation. To really be sure if autocorrelation exists, we must compute the Durbin-Watson statistic, often denoted as d.

In our June 24 Forecast Friday post, we demonstrated how to calculate the Durbin-Watson statistic. The actual formula is:

That is, beginning with the error term for the second observation, we subtract the immediate previous error term from it; then we square the difference. We do this for each observation from the second one onward. Then we sum all of those squared differences together. Next, we square the error terms for each observation, and sum those together. Then we divide the sum of squared differences by the sum of squared error terms, to get our Durbin-Watson statistic.

For our example, we have the following:

t

Error

Squared Error

et-et-1

Squared Difference

1

0.0396

0.0016

     

2

0.0001

0.0000

(0.0395) 0.0016

3

0.0006

0.0000

0.0005 0.0000

4

0.0011

0.0000

0.0005 0.0000

5

(0.0384)

0.0015

(0.0395) 0.0016

6

(0.0379)

0.0014

0.0005 0.0000

7

(0.0474)

0.0022

(0.0095) 0.0001

8

(0.0169)

0.0003

0.0305 0.0009

9

0.0536

0.0029

0.0705 0.0050

10

0.0041

0.0000

(0.0495) 0.0024

11

(0.0054)

0.0000

(0.0095) 0.0001

12

0.0152

0.0002

0.0205 0.0004

13

0.0457

0.0021

0.0305 0.0009

14

0.0262

0.0007

(0.0195) 0.0004

15

0.0167

0.0003

(0.0095) 0.0001

16

0.0172

0.0003

0.0005 0.0000

17

(0.0423)

0.0018

(0.0595) 0.0035

18

(0.0318)

0.0010

0.0105 0.0001
  

Sum:

0.0163

  

0.0171

 

To obtain our Durbin-Watson statistic, we plug our sums into the formula:

= 1.050

What Does the Durbin-Watson Statistic Tell Us?

Our Durbin-Watson statistic is 1.050. What does that mean? The Durbin-Watson statistic is interpreted as follows:

  • If d is close to zero (0), then positive autocorrelation is probably present;
  • If d is close to two (2), then the model is likely free of autocorrelation; and
  • If d is close to four (4), then negative autocorrelation is probably present.

As we saw from our visual examination of the residuals, we appear to have positive autocorrelation, and the fact that our Durbin-Watson statistic is about halfway between zero and two suggests the presence of positive autocorrelation.

Next Forecast Friday Topic: Correcting Autocorrelation

Today we went through the process of understanding the causes and effect of autocorrelation, and how to suspect and detect its presence. Next week, we will discuss how to correct for autocorrelation and eliminate it so that we can have more efficient parameter estimates.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Forecast Friday Topic: Multicollinearity – Correcting and Accepting it

July 22, 2010

(Fourteenth in a series)

In last week’s Forecast Friday post, we discussed how to detect multicollinearity in a regression model and how dropping a suspect variable or variables from the model can be one approach to reducing or eliminating multicollinearity. However, removing variables can cause other problems – particularly specification bias – if the suspect variable is indeed an important predictor. Today we will discuss two additional approaches to correcting multicollinearity – obtaining more data and transforming variables – and will discuss when it’s best to just accept the multicollinearity.

Obtaining More Data

Multicollinearity is really an issue with the sample, not the population. Sometimes, sampling produces a data set that might be too homogeneous. One way to remedy this would be to add more observations to the data set. Enlarging the sample will introduce more variation in the data series, which reduces the effect of sampling error and helps increase precision when estimating various properties of the data. Increased sample sizes can reduce either the presence or the impact of multicollinearity, or both. Obtaining more data is often the best way to remedy multicollinearity.

Obtaining more data does have problems, however. Sometimes, additional data just isn’t available. This is especially the case with time series data, which can be limited or otherwise finite. If you need to obtain that additional information through great effort, it can be costly and time consuming. Also, the additional data you add to your sample could be quite similar to your original data set, so there would be no benefit to enlarging your data set. The new data could even make problems worse!

Transforming Variables

Another way statisticians and modelers go about eliminating multicollinearity is through data transformation. This can be done in a number of ways.

Combine Some Variables

The most obvious way would be to find a way to combine some of the variables. After all, multicollinearity suggests that two or more independent variables are strongly correlated. Perhaps you can multiply two variables together and use the product of those two variables in place of them.

So, in our example of the donor history, we had the two variables “Average Contribution in Last 12 Months” and “Times Donated in Last 12 Months.” We can multiply them to create a composite variable, “Total Contributions in Last 12 Months,” and then use that new variable, along with the variable “Months Since Last Donation” to perform the regression. In fact, if we did that with our model, we end up with a model (not shown here) that has an R2=0.895, and this time the coefficient for “Months Since Last Donation” is significant, as is our “Total Contribution” variable. Our F statistic is a little over 72. Essentially, the R2 and F statistics are only slightly lower than in our original model, suggesting that the transformation was useful. However, looking at the correlation matrix, we still see a strong negative correlation between our two independent variables, suggesting that we still haven’t eliminated multicollinearity.

Centered Interaction Terms

Sometimes we can reduce multicollinearity by creating an interaction term between variables in question. In a model trying to predict performance on a test based on hours spent studying and hours of sleep, you might find that hours spent studying appears to be related with hours of sleep. So, you create a third independent variable, Sleep_Study_Interaction. You do this by computing the average value for both the hours of sleep and hours of studying variables. For each observation, you subtract each independent variable’s mean from its respective value for that observation. Once you’ve done that for each observation, multiply their differences together. This is your interaction term, Sleep_Study_Interaction. Run the regression now with the original two variables and the interaction term. When you subtract the means from the variables in question, you are in effect centering interaction term, which means you’re taking into account central tendency in your data.

Differencing Data

If you’re working with time series data, one way to reduce multicollinearity is to run your regression using differences. To do this, you take every variable – dependent and independent – and, beginning with the second observation – subtract the immediate prior observation’s values for those variables from the current observation. Now, instead of working with original data, you are working with the change in data from one period to the next. Differencing eliminates multicollinearity by removing the trend component of the time series. If all independent variables had followed more or less the same trend, they could end up highly correlated. Sometimes, however, trends can build on themselves for several periods, so multiple differencing may be required. In this case, subtracting the period before was taking a “first difference.” If we subtracted two periods before, it’s a “second difference,” and so on. Note also that with differencing, we lose the first observations in the data, depending on how many periods we have to difference, so if you have a small data set, differencing can reduce your degrees of freedom and increase your risk of making a Type I Error: concluding that an independent variable is not statistically significant when, in truth it is.

Other Transformations

Sometimes, it makes sense to take a look at a scatter plot of each independent variable’s values with that of the dependent variable to see if the relationship is fairly linear. If it is not, that’s a cue to transform an independent variable. If an independent variable appears to have a logarithmic relationship, you might substitute its natural log. Also, depending on the relationship, you can use other transformations: square root, square, negative reciprocal, etc.

Another consideration: if you’re predicting the impact of violent crime on a city’s median family income, instead of using the number of violent crimes committed in the city, you might instead divide it by the city’s population and come up with a per-capita figure. That will give more useful insights into the incidence of crime in the city.

Transforming data in these ways helps reduce multicollinearity by representing independent variables differently, so that they are less correlated with other independent variables.

Limits of Data Transformation

Transforming data has its own pitfalls. First, transforming data also transforms the model. A model that uses a per-capita crime figure for an independent variable has a very different interpretation than one using an aggregate crime figure. Also, interpretations of models and their results get more complicated as data is transformed. Ideally, models are supposed to be parsimonious – that is, they explain a great deal about the relationship as simply as possible. Typically, parsimony means as few independent variables as possible, but it also means as few transformations as possible. You also need to do more work. If you try to plug in new data to your resulting model for forecasting, you must remember to take the values for your data and transform them accordingly.

Living With Multicollinearity

Multicollinearity is par for the course when a model consists of two or more independent variables, so often the question isn’t whether multicollinearity exists, but rather how severe it is. Multicollinearity doesn’t bias your parameter estimates, but it inflates their variance, making them inefficient or untrustworthy. As you have seen from the remedies offered in this post, the cures can be worse than the disease. Correcting multicollinearity can also be an iterative process; the benefit of reducing multicollinearity may not justify the time and resources required to do so. Sometimes, any effort to reduce multicollinearity is futile. Generally, for the purposes of forecasting, it might be perfectly OK to disregard the multicollinearity. If, however, you’re using regression analysis to explain relationships, then you must try to reduce the multicollinearity.

A good approach is to run a couple of different models, some using variations of the remedies we’ve discussed here, and comparing their degree of multicollinearity with that of the original model. It is also important to compare the forecast accuracy of each. After all, if all you’re trying to do is forecast, then a model with slightly less multicollinearity but a higher degree of forecast error is probably not preferable to a more precise forecasting model with higher degrees of multicollinearity.

The Takeaways:

  1. Where you have multiple regression, you almost always have multicollinearity, especially in time series data.
  2. A correlation matrix is a good way to detect multicollinearity. Multicollinearity can be very serious if the correlation matrix shows that some of the independent variables are more highly correlated with each other than they are with the dependent variable.
  3. You should suspect multicollinearity if:
    1. You have a high R2 but low t-statistics;
    2. The sign for a coefficient is opposite of what is normally expected (a relationship that should be positive is negative, and vice-versa).
  4. Multicollinearity doesn’t bias parameter estimates, but makes them untrustworthy by enlarging their variance.
  5. There are several ways of remedying multicollinearity, with obtaining more data often being the best approach. Each remedy for multicollinearity contributes a new set of problems and limitations, so you must weigh the benefit of reduced multicollinearity on time and resources needed to do so, and the resulting impact on your forecast accuracy.

Next Forecast Friday Topic: Autocorrelation

These past two weeks, we discussed the problem of multicollinearity. Next week, we will discuss the problem of autocorrelation – the phenomenon that occurs when we violate the assumption that the error terms are not correlated with each other. We will discuss how to detect autocorrelation, discuss in greater depth the Durbin-Watson statistic’s use as a measure of the presence of autocorrelation, and how to correct for autocorrelation.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Multiple Regression: Specification Bias

July 1, 2010

(Eleventh in a series)

In last week’s Forecast Friday post, we discussed several of the important checks you must do to ensure that your model is valid. You always want to be sure that your model does not violate the assumptions we discussed earlier. Today we are going to see what happens when we violate the specification assumption, which says that we do not omit relevant independent variables from our regression model. You will see that when we leave out an important independent variable from a regression model, quite misleading results can emerge. You will also see that violating one assumption can trigger violations of other assumptions.

Revisiting our Multiple Regression Example

Recall our data set of 25 annual observations of U.S. Savings and Loan profit margin data, shown in the table below:

Year

Percentage Profit Margin (Yt)

Net Revenues Per Deposit Dollar (X1t)

Number of Offices (X2t)

1

0.75

3.92

7,298

2

0.71

3.61

6,855

3

0.66

3.32

6,636

4

0.61

3.07

6,506

5

0.70

3.06

6,450

6

0.72

3.11

6,402

7

0.77

3.21

6,368

8

0.74

3.26

6,340

9

0.90

3.42

6,349

10

0.82

3.42

6,352

11

0.75

3.45

6,361

12

0.77

3.58

6,369

13

0.78

3.66

6,546

14

0.84

3.78

6,672

15

0.79

3.82

6,890

16

0.70

3.97

7,115

17

0.68

4.07

7,327

18

0.72

4.25

7,546

19

0.55

4.41

7,931

20

0.63

4.49

8,097

21

0.56

4.70

8,468

22

0.41

4.58

8,717

23

0.51

4.69

8,991

24

0.47

4.71

9,179

25

0.32

4.78

9,318

Data taken from Spellman, L.J., “Entry and profitability in a rate-free savings and loan market.” Quarterly Review of Economics and Business, 18, no. 2 (1978): 87-95, Reprinted in Newbold, P. and Bos, T., Introductory Business & Economic Forecasting, 2nd Edition, Cincinnati (1994): 136-137

Also, recall that we built a model that hypothesized that S&L percentage profit margin (our dependent variable, Yt) was positively related to net revenues per deposit dollar (one of our independent variables, X1t), and negatively related to the number of S&L offices (our other independent variable, X2t). When we ran our regression, we got the following model:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

We also checked to see if the model parameters were significant, and obtained the following information:

Parameter

Value

T-Statistic

Significant?

Intercept

1.5645000

19.70

Yes

B1t

0.2372000

4.27

Yes

B2t

(0.0002490)

(7.77)

Yes

We also had a coefficient of determination – R2 – of 0.865, indicating that the model explains about 86.5% of the variation in S&L percentage profit margin.

Welcome to the World of Specification Bias…

Let’s deliberately leave out the number of S&L offices (X2t) from our model, and do just a simple regression with the net revenues per deposit dollar. This is the model we get:

Yt = 1.32616 – 0.16913X1t

We also get an R2 of 0.495. The t-statistics for our intercept and parameter B1t are as follows:

Parameter

Value

T-Statistic

Significant?

Intercept

1.32616

9.57

Yes

B1t

(0.16913)

(4.75)

Yes

 

Compare these new results with our previous results and what do you notice? The results of our second regression are in sharp contrast to those of our first regression. Our new model has far less explanatory power – R2 dropped from 0.865 to 0.495 – and the sign of the parameter estimate for net revenue per deposit dollar has changed: The coefficient of X1t was significant and positive in the first model, and now it is significant and negative! As a result, we end up with a biased regression model.

… and to the Land of Autocorrelation…

Recall another of the regression assumptions: that error terms should not be correlated with one another. When error terms are correlated with one another, we end up with autocorrelation, which renders our parameter estimates inefficient. Recall that last week, we computed the Durbin-Watson test statistic, d, which is an indicator of autocorrelation. It is bad to have either positive autocorrelation (d close to zero), or negative autocorrelation (d close to 4). Generally, we want d to be approximately 2. In our first model, d was 1.95, so autocorrelation was pretty much nonexistent. In our second model, d=0.85, suggesting the presence of significant positive autocorrelation!

How did this happen? Basically, when an important variable is omitted from regression, its impact on the dependent variable gets incorporated into the error term. If the omitted independent variable is correlated with any of the included independent variables, the error terms will also be correlated.

…Which Leads to Yet Another Violation!

The presence of autocorrelation in our second regression reveals the presence of another violation, not in the incomplete regression, but in the full regression. As the sentence above read: “if the independent variable is correlated with any of the included independent variables…” Remember the other assumption: “no linear relationship between two or more independent variables?” Basically, the detection of autocorrelation in the incomplete regression revealed that the full regression violated this very assumption – and thus exhibits multicollinearity! Generally, a coefficient changing between positive and negative (either direction) when one or more variables is omitted is an indicator of multicollinearity.

So was the full regression wrong too? Not terribly. As you will find in upcoming posts, avoiding multicollinearity is nearly impossible, especially with time series data. That’s because multicollinearity is typically a data problem. The severity of multicollinearity can often be reduced by increasing the number of observations in the data set. This is often not a problem with cross-sectional data, where data sets can have thousands, if not millions of observations. However, with time series data, the number of observations available is limited to how many periods of data have been recorded.

Moreover, the longer your time series, the more you risk structural changes in your data over the course of your time series. For instance, if you were examining annual patterns in bank lending within a particular census tract between 1990 and 2010, you might have a reliable model to work with. But let’s say you widen your time series to go back as far as 1970. You will see dramatic shifts in patterns in your data set. That’s because prior to 1977, when Congress passed the Community Reinvestment Act, many banks engaged in a practice called “redlining,” where they literally drew red lines around some neighborhoods, usually where minorities and low-income households were, and did not lend there. In this case, increasing the size of the data set might reduce multicollinearity, but actually cause other modeling problems.

And as you’ve probably guessed, one way of reducing multicollinearity can be dropping variables from the regression. But look what happened when we dropped the number of S&L offices from our regression: we might have eliminated multicollinearity, but we gained autocorrelation and specification bias!

Bottom Line:

The lesson, for us as forecasters and analysts, therefore is that we must accept that models are far from perfect and we must weigh the impact of various regression model specifications. Is the multicollinearity that is present in our model tolerable? Can we add more observations without causing new problems? Can we drop a variable from a regression without causing either specification bias or material differences in explanatory power, parameter estimates, model validity, or even forecast accuracy? Building the model is easy – but it’s these normative considerations that’s challenging.

Next Forecast Friday Topic: Building Regression Models Using Excel

In next week’s Forecast Friday post, we will take a break from discussing the theory of regression analysis and look at a demonstration of how to use the “Regression Analysis” tool in Microsoft Excel. This demonstration is intended to show you how easy running a regression is, so that you can start applying the concepts and building forecasts for your business. Until then, thanks again for reading Forecast Friday, and I wish you and your family a great 4th of July weekend!

*************************

Analysights is now on Facebook!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page.

Forecast Friday Topic: Multiple Regression Analysis (continued)

June 24, 2010

(Tenth in a series)

Today we resume our discussion of multiple regression analysis. Last week, we built a model to determine the extent of any relationship between U.S. savings & loan associations’ percent profit margin and two independent variables, net revenues per deposit dollar and number of S&L offices. Today, we will compute the 95% confidence interval for each parameter estimate; determine whether the model is valid; check for autocorrelation; and use the model to forecast. Recall that our resulting model was:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

Where Yt is the percent profit margin for the S&L in Year t; X1t is the net revenues per deposit dollar in Year t; and X2t is the number of S&L offices in the U.S. in Year t. Recall that the R2 is .865, indicating that 86.5% of the change in percentage profit margin is explained by changes in net revenues per deposit dollar and number of S&L offices.

Determining the 95% Confidence Interval for the Partial Slope Coefficients

In multiple regression analysis, since there are multiple independent variables, the parameter estimates for each independent variable both impact the slope of the line; hence the coefficients β1t and β2t are referred to as partial slope estimates. As with simple linear regression, we need to determine the 95% confidence interval for each parameter estimate, so that we could get an idea where the true population parameter lies. Recall from our June 3 post, we did that by determining the equation for the standard error of the estimate, sε, and then the standard error of the regression slope, sb. That worked well for simple regression, but for multiple regression, it is more complicated. Unfortunately, deriving the standard error of the partial regression coefficients requires the use of linear algebra, and would be too complicated to discuss here. Several statistical programs and Excel compute these values for us. So, we will state the values of sb1 and sb2 and go from there.

Sb1=0.05556

Sb2=0.00003

Also, we need our critical-t value for 22 degrees of freedom, which is 2.074.

Hence, our 95% confidence interval for β1 is denoted as:

0.23720 ± 2.074 × 0.05556

=0.12197 to 0.35243

Hence, we are saying that we can be 95% confident that the true parameter β1 lies somewhere between the values of 0.12197 and 0.35243.

Similarly, for β2, the procedure is similar:

-0.000249 ± 2.074 × 0.00003

=-0.00032 to -0.00018

Hence, we can be 95% confident that the true parameter β2 lies somewhere between the values of -0.00032 and -0.00018. Also, the confidence interval for the intercept, α, ranges from 1.40 to 1.73.

Note that in all of these cases, the confidence interval does not contain a value of zero within its range. The confidence intervals for α and β1 are positive; that for β2 is negative. If any parameter’s confidence interval ranges crossed zero, then the parameter estimate would not be significant.

Is Our Model Valid?

The next thing we want to do is determine if our model is valid. When validating our model we are trying to prove that our independent variables explain the variation in the dependent variable. So we start with a hypothesis test:

H0: β1 = β2 = 0

HA: at least one β ≠ 0

Our null hypothesis says that our independent variables, net revenue per deposit dollar and number of S&L offices, explain nothing of the variation in an S&L percentage profit margin, and hence, that our model is not valid. Our alternative hypothesis says that at least one of our independent variable explains some of the variation in an S&L’s percentage profit margin, and thus is valid.

So how do we do it? Enter the F-test. Like the T-test, the F-test is a means for hypothesis testing. Let’s first start by calculating our F-statistic for our model. We do that with the following equation:

Remember that RSS is the regression sum of squares and ESS is the error sum of squares. The May 27th Forecast Friday post showed you how to calculate RSS and ESS. For this model, our RSS=0.4015, and our ESS=0.0625; k is the number of independent variables, and n is the sample. Our equation reduces to:


= 70.66

If our Fcalc is greater than the critical F value for the distribution, then we can reject our null hypothesis and conclude that there is strong evidence that at least one of our independent variables explains some of the variation in an S&L’s percentage profit margin. How do we determine our critical F? There is yet another table in any statistics book or statistics Web site called the “F Distribution” table. In it, you look for two sets of degrees of freedom – one for the numerator and one for the denominator of your Fcalc equation. In the numerator, we have two degrees of freedom; in the denominator, 22. So we look at the F Distribution table notice the columns represent numerator degrees of freedom, and the rows, denominator degrees of freedom. When we find column (2), row (22), we end up with an F-value of 5.72.

Our Fcalc is greater than that, so we can conclude that our model is valid.

Is Our Model Free of Autocorrelation?

Recall from our assumptions that none of our error terms should be correlated with one another. If they are, autocorrelation results, rendering our parameter estimates inefficient. Check for autocorrelation, we need to look at our error terms, when we compare our predicted percentage profit margin, Ŷ, with our actual, Y:

Year

Percentage Profit Margin

Actual (Yt)

Predicted by Model (Ŷt)

Error

1

0.75

0.68

(0.0735)

2

0.71

0.71

0.0033

3

0.66

0.70

0.0391

4

0.61

0.67

0.0622

5

0.7

0.68

(0.0162)

6

0.72

0.71

(0.0124)

7

0.77

0.74

(0.0302)

8

0.74

0.76

0.0186

9

0.9

0.79

(0.1057)

10

0.82

0.79

(0.0264)

11

0.75

0.80

0.0484

12

0.77

0.83

0.0573

13

0.78

0.80

0.0222

14

0.84

0.80

(0.0408)

15

0.79

0.75

(0.0356)

16

0.7

0.73

0.0340

17

0.68

0.70

0.0249

18

0.72

0.69

(0.0270)

19

0.55

0.64

0.0851

20

0.63

0.61

(0.0173)

21

0.56

0.57

0.0101

22

0.41

0.48

0.0696

23

0.51

0.44

(0.0725)

24

0.47

0.40

(0.0746)

25

0.32

0.38

0.0574

The next thing we need to do is subtract the previous period’s error from the current period’s error. After that, we square our result. Note that we will only have 24 observations (we can’t subtract anything from the first observation):

Year

Error

Difference in Errors

Squared Difference in Errors

1

(0.07347)

  

  

2

0.00334

0.07681

0.00590

3

0.03910

0.03576

0.00128

4

0.06218

0.02308

0.00053

5

(0.01624)

(0.07842)

0.00615

6

(0.01242)

0.00382

0.00001

7

(0.03024)

(0.01781)

0.00032

8

0.01860

0.04883

0.00238

9

(0.10569)

(0.12429)

0.01545

10

(0.02644)

0.07925

0.00628

11

0.04843

0.07487

0.00561

12

0.05728

0.00884

0.00008

13

0.02217

(0.03511)

0.00123

14

(0.04075)

(0.06292)

0.00396

15

(0.03557)

0.00519

0.00003

16

0.03397

0.06954

0.00484

17

0.02489

(0.00909)

0.00008

18

(0.02697)

(0.05185)

0.00269

19

0.08509

0.11206

0.01256

20

(0.01728)

(0.10237)

0.01048

21

0.01012

0.02740

0.00075

22

0.06964

0.05952

0.00354

23

(0.07252)

(0.14216)

0.02021

24

(0.07460)

(0.00208)

0.00000

25

0.05738

0.13198

0.01742

 

If we sum up the last column, we will get .1218, if we then divide that by our ESS of 0.0625, we get a value of 1.95. What does this mean?

We have just computed what is known as the Durbin-Watson Statistic, which is used to detect the presence of autocorrelation. The Durbin-Watson statistic, d, can be anywhere from zero to 4. Generally, when d is close to zero, it suggests the presence of positive autocorrelation; a value close to 2 indicates no autocorrelation; while a value close to 4 indicates negative autocorrelation. In any case, you want your Durbin-Watson statistic to be as close to two as possible, and ours is.

Hence, our model seems to be free of autocorrelation.

Now, Let’s Go Forecast!

Now that we have validated our model, and saw that it was free of autocorrelation, we can be comfortable forecasting. Let’s say that for years 26 and 27, we have the following forecasts for net revenues per deposit dollar, X1t and number of S&L offices, X2t. They are as follows:

X1,26 = 4.70 and X2,26 = 9,350

X1,27 = 4.80 and X2,27 = 9,400

Plugging each of these into our equations, we generate the following forecasts:

Ŷ26 = 1.56450 + 0.23720 * 4.70 – 0.000249 * 9,350

=0.3504

Ŷ27 = 1.56450 + 0.23720 * 4.80 – 0.000249 * 9,400

=0.3617

Next Week’s Forecast Friday Topic: The Effect of Omitting an Important Variable

Now that we’ve walked you through this process, you know how to forecast and run multiple regression. Next week, we will discuss what happens when a key independent variable is omitted from a regression model and all the problems it causes when we violate the regression assumption that “all relevant and no irrelevant independent variables are included in the model.” Next week’s post will show a complete demonstration of such an impact. Stay tuned!


Follow

Get every new post delivered to your Inbox.

Join 86 other followers