(Sixteenth in a series)
Last week, we discussed how to detect autocorrelation – the violation of the regression assumption that the error terms are not correlated with one another – in your forecasting model. Models exhibiting autocorrelation have parameter estimates that are inefficient, and R^{2}s and t-ratios that seem overly inflated. As a result, your model generates forecasts that are too good to be true and has a tendency to miss turning points in your time series. In last week’s Forecast Friday post, we showed you how to diagnose autocorrelation: examining the model’s parameter estimates, visually inspecting the data, and computing the Durbin-Watson statistic. Today, we’re going to discuss how to correct it.
Revisiting our Data Set
Recall our data set: average hourly wages of textile and apparel workers for the 18 months from January 1986 through June 1987, as reported in the Survey of Current Business (September issues from 1986 and 1987), and reprinted in Data Analysis Using Microsoft ® Excel, by Michael R. Middleton, page 219:
Month |
t |
Wage |
Jan-86 |
1 |
5.82 |
Feb-86 |
2 |
5.79 |
Mar-86 |
3 |
5.8 |
Apr-86 |
4 |
5.81 |
May-86 |
5 |
5.78 |
Jun-86 |
6 |
5.79 |
Jul-86 |
7 |
5.79 |
Aug-86 |
8 |
5.83 |
Sep-86 |
9 |
5.91 |
Oct-86 |
10 |
5.87 |
Nov-86 |
11 |
5.87 |
Dec-86 |
12 |
5.9 |
Jan-87 |
13 |
5.94 |
Feb-87 |
14 |
5.93 |
Mar-87 |
15 |
5.93 |
Apr-87 |
16 |
5.94 |
May-87 |
17 |
5.89 |
Jun-87 |
18 |
5.91 |
We generated the following regression model:
Ŷ = 5.7709 + 0.0095t
Our model had an R^{2} of .728, and t-ratios of about 368 for the intercept term and 6.55 for the parameter estimate, t. The Durbin-Watson statistic was 1.05, indicating positive autocorrelation. How do we correct for autocorrelation?
Lagging the Dependent Variable
One of the most common remedies for autocorrelation is to lag the dependent variable one or more periods and then make the lagged dependent variable the independent variable. So, in our data set above, you would take the first value of the dependent variable, $5.82, and make it the independent variable for period 2, with $5.79 being the dependent variable; in like manner, $5.79 will also become the independent variable for the next period, whose dependent variable has a value of $5.80, and so on. Since the error terms from one period to another exhibit correlation, by using the previous value of the dependent variable to predict the next one, you reduce that correlation of errors.
You can lag for as many periods as you need to; however, note that you lose the first observation when you lag one period (unless you know the previous period before the start of the data set, you have nothing to predict the first observation). You’ll lose two observations if you lag two periods, and so on. If you have a very small data set, the loss of degrees of freedom can lead to Type II error – failing to identify a parameter estimate as significant, when in fact it is. So, you must be careful here.
In this case, by lagging our data by one period, we have the following data set:
Month |
Wage |
Lag1 Wage |
Feb-86 |
$5.79 |
$5.82 |
Mar-86 |
$5.80 |
$5.79 |
Apr-86 |
$5.81 |
$5.80 |
May-86 |
$5.78 |
$5.81 |
Jun-86 |
$5.79 |
$5.78 |
Jul-86 |
$5.79 |
$5.79 |
Aug-86 |
$5.83 |
$5.79 |
Sep-86 |
$5.91 |
$5.83 |
Oct-86 |
$5.87 |
$5.91 |
Nov-86 |
$5.87 |
$5.87 |
Dec-86 |
$5.90 |
$5.87 |
Jan-87 |
$5.94 |
$5.90 |
Feb-87 |
$5.93 |
$5.94 |
Mar-87 |
$5.93 |
$5.93 |
Apr-87 |
$5.94 |
$5.93 |
May-87 |
$5.89 |
$5.94 |
Jun-87 |
$5.91 |
$5.89 |
So, we have created a new independent variable, Lag1_Wage. Notice that we are not going to regress time period t as an independent variable. This doesn’t mean that we should or shouldn’t; in this case, we’re only trying to demonstrate the effect of the lagging.
Rerunning the Regression
Now we do our regression analysis. We come up with the following equation:
Ŷ = 0.8253 + 0.8600*Lag1_Wage
Apparently, from this model, each $1 change in hourly wage from the previous month is associated with an average $0.86 change in hourly wages for the current month. The R^{2} for this model was virtually unchanged, 0.730. However, the Durbin-Watson statistic is now 2.01 – just about the total eradication of autocorrelation. Unfortunately, the intercept has a t-ratio of 1.04, indicating it is not significant. The parameter estimate for Lag1_Wage is about 6.37, not much different than the parameter estimate for t in our previous model. However, we did get rid of the autocorrelation.
The statistically insignificant intercept term resulting from this lagging is a result of the Type II error involved with the loss of a degree of freedom in a small sample size. Perhaps if we had several more months of data, we might have had a significant intercept estimate.
Other Approaches to Correcting Autocorrelation
There are other approaches to correcting autocorrelation. One other important way might be to identify important independent variables that have been omitted from the model. Perhaps if we had data on the average years work experience of the textile and apparel labor force from month to month, that might have increased our R^{2}, and reduced correlations in the error term. Another thing we could do is difference the data. Differencing works like lagging, only we subtract the value of the dependent and independent variables of the first observation from their respective values in the second observation; then we subtract those of the second observation’s original values from those of the third, and so on. Then we run a regression on the differences in observations. The problem here is that again, your data set is reduced by one observation and your transformed model will not have an intercept term, which can cause issues in some studies.
Other approaches to correcting autocorrelation include quasi-differencing, the Cochran-Orcutt Procedure, the Hildreth-Lu Procedure, and the Durbin Two-Step Method. These methods are iterative, require a lot of tedious effort and are beyond the scope of our post. But many college-level forecasting textbooks have sections on these procedures if you’re interested in further reading on them.
Next Forecast Friday Topic: Detecting Heteroscedasticity
Next week, we’ll discuss the last of the regression violations, heteroscedasticity, which is the violation of the assumption that error terms have a constant variance. We will discuss why heteroscedasticity exists and how to diagnose it. The week after that, we’ll discuss remedying heteroscedasticity. Once we have completed our discussions on the regression violations, we will spend a couple of weeks discussing regression modeling techniques like transforming independent variables, using categorical variables, adjusting for seasonality, and other regression techniques. These topics will be far less theoretical and more practical in terms of forecasting.