Posts Tagged ‘regression assumptions’

Forecast Friday Topic: Correcting Autocorrelation

August 5, 2010

(Sixteenth in a series)

Last week, we discussed how to detect autocorrelation – the violation of the regression assumption that the error terms are not correlated with one another – in your forecasting model. Models exhibiting autocorrelation have parameter estimates that are inefficient, and R2s and t-ratios that seem overly inflated. As a result, your model generates forecasts that are too good to be true and has a tendency to miss turning points in your time series. In last week’s Forecast Friday post, we showed you how to diagnose autocorrelation: examining the model’s parameter estimates, visually inspecting the data, and computing the Durbin-Watson statistic. Today, we’re going to discuss how to correct it.

Revisiting our Data Set

Recall our data set: average hourly wages of textile and apparel workers for the 18 months from January 1986 through June 1987, as reported in the Survey of Current Business (September issues from 1986 and 1987), and reprinted in Data Analysis Using Microsoft ® Excel, by Michael R. Middleton, page 219:

Month

t

Wage

Jan-86

1

5.82

Feb-86

2

5.79

Mar-86

3

5.8

Apr-86

4

5.81

May-86

5

5.78

Jun-86

6

5.79

Jul-86

7

5.79

Aug-86

8

5.83

Sep-86

9

5.91

Oct-86

10

5.87

Nov-86

11

5.87

Dec-86

12

5.9

Jan-87

13

5.94

Feb-87

14

5.93

Mar-87

15

5.93

Apr-87

16

5.94

May-87

17

5.89

Jun-87

18

5.91

We generated the following regression model:

Ŷ = 5.7709 + 0.0095t

Our model had an R2 of .728, and t-ratios of about 368 for the intercept term and 6.55 for the parameter estimate, t. The Durbin-Watson statistic was 1.05, indicating positive autocorrelation. How do we correct for autocorrelation?

Lagging the Dependent Variable

One of the most common remedies for autocorrelation is to lag the dependent variable one or more periods and then make the lagged dependent variable the independent variable. So, in our data set above, you would take the first value of the dependent variable, $5.82, and make it the independent variable for period 2, with $5.79 being the dependent variable; in like manner, $5.79 will also become the independent variable for the next period, whose dependent variable has a value of $5.80, and so on. Since the error terms from one period to another exhibit correlation, by using the previous value of the dependent variable to predict the next one, you reduce that correlation of errors.

You can lag for as many periods as you need to; however, note that you lose the first observation when you lag one period (unless you know the previous period before the start of the data set, you have nothing to predict the first observation). You’ll lose two observations if you lag two periods, and so on. If you have a very small data set, the loss of degrees of freedom can lead to Type II error – failing to identify a parameter estimate as significant, when in fact it is. So, you must be careful here.

In this case, by lagging our data by one period, we have the following data set:

Month

Wage

Lag1 Wage

Feb-86

$5.79

$5.82

Mar-86

$5.80

$5.79

Apr-86

$5.81

$5.80

May-86

$5.78

$5.81

Jun-86

$5.79

$5.78

Jul-86

$5.79

$5.79

Aug-86

$5.83

$5.79

Sep-86

$5.91

$5.83

Oct-86

$5.87

$5.91

Nov-86

$5.87

$5.87

Dec-86

$5.90

$5.87

Jan-87

$5.94

$5.90

Feb-87

$5.93

$5.94

Mar-87

$5.93

$5.93

Apr-87

$5.94

$5.93

May-87

$5.89

$5.94

Jun-87

$5.91

$5.89

 

So, we have created a new independent variable, Lag1_Wage. Notice that we are not going to regress time period t as an independent variable. This doesn’t mean that we should or shouldn’t; in this case, we’re only trying to demonstrate the effect of the lagging.

Rerunning the Regression

Now we do our regression analysis. We come up with the following equation:

Ŷ = 0.8253 + 0.8600*Lag1_Wage

Apparently, from this model, each $1 change in hourly wage from the previous month is associated with an average $0.86 change in hourly wages for the current month. The R2 for this model was virtually unchanged, 0.730. However, the Durbin-Watson statistic is now 2.01 – just about the total eradication of autocorrelation. Unfortunately, the intercept has a t-ratio of 1.04, indicating it is not significant. The parameter estimate for Lag1_Wage is about 6.37, not much different than the parameter estimate for t in our previous model. However, we did get rid of the autocorrelation.

The statistically insignificant intercept term resulting from this lagging is a result of the Type II error involved with the loss of a degree of freedom in a small sample size. Perhaps if we had several more months of data, we might have had a significant intercept estimate.

Other Approaches to Correcting Autocorrelation

There are other approaches to correcting autocorrelation. One other important way might be to identify important independent variables that have been omitted from the model. Perhaps if we had data on the average years work experience of the textile and apparel labor force from month to month, that might have increased our R2, and reduced correlations in the error term. Another thing we could do is difference the data. Differencing works like lagging, only we subtract the value of the dependent and independent variables of the first observation from their respective values in the second observation; then we subtract those of the second observation’s original values from those of the third, and so on. Then we run a regression on the differences in observations. The problem here is that again, your data set is reduced by one observation and your transformed model will not have an intercept term, which can cause issues in some studies.

Other approaches to correcting autocorrelation include quasi-differencing, the Cochran-Orcutt Procedure, the Hildreth-Lu Procedure, and the Durbin Two-Step Method. These methods are iterative, require a lot of tedious effort and are beyond the scope of our post. But many college-level forecasting textbooks have sections on these procedures if you’re interested in further reading on them.

Next Forecast Friday Topic: Detecting Heteroscedasticity

Next week, we’ll discuss the last of the regression violations, heteroscedasticity, which is the violation of the assumption that error terms have a constant variance. We will discuss why heteroscedasticity exists and how to diagnose it. The week after that, we’ll discuss remedying heteroscedasticity. Once we have completed our discussions on the regression violations, we will spend a couple of weeks discussing regression modeling techniques like transforming independent variables, using categorical variables, adjusting for seasonality, and other regression techniques. These topics will be far less theoretical and more practical in terms of forecasting.

Advertisements

Forecast Friday Topic: Prelude to Multiple Regression Analysis – Regression Assumptions

June 10, 2010

(Eighth in a series)

In last week’s Forecast Friday post, we continued our discussion of simple linear regression analysis, discussing how to check both the slope and intercept coefficients for significance. We then discussed how to create a prediction interval for our forecasts. I had intended this week’s Forecast Friday post to delve straight into multiple regression analysis, but have decided instead to spend some time talking about the assumptions that go into building a regression model.  These assumptions apply to both simple and multiple regression analysis, but their importance is especially noticeable with multiple regression, and I feel it is best to make you aware of them, so that when we discuss multiple regression both as a time series and as a causal/econometric forecasting tool, you’ll know how to detect and correct regression models that violate these assumptions. We will formally begin our discussion of multiple regression methods next week.

Five Key Assumptions for Ordinary Least Squares (OLS) Regression

When we develop our parameter estimates for our regression model, we want to make sure that all of our estimators have the smallest variance. Recall that when you were computing the value of your estimate, b, for the parameter β, in the equation below:

You were subtracting your independent variable’s average from each of its actual values, and doing likewise for the dependent variable. You then multiplied those two quantities together (for each observation) and summed them up to get the numerator of that calculation. To get the denominator, you again subtracted the independent variable’s mean from each of its actual values and then squared them. Then you summed those up. The calculation of the denominator is the focal point here: the value you get for your estimate of β is the estimate that minimizes the squared error for your model. Hence, the term, least squares. If you were to take the denominator of the equation above and divide it by your sample size (less one: n-1), you would get the variance of your independent variable, X. This variance is something you also want to minimize, so that your estimate of β is efficient. When your parameter estimates are efficient, you can make stronger statistical statements about them.

We also want to be sure that our estimators are free of bias. That is, we want to be sure that our sample estimate, b, is on average, equal to our true population parameter, β. That is, if we calculated several estimates of β, the average of our b’s should equal β.

Essentially, there are five assumptions that must be made to ensure our estimators are unbiased and efficient:

Assumption #1: The regression equation correctly specifies the true model.

In order to correctly specify the true model, the relationship between the dependent and independent variable must be linear. Also, we must neither exclude relevant independent variables from nor include irrelevant independent variables in our regression equation. If any of these conditions are not met – that is, Assumption #1 is violated – then our parameter estimates will exhibit bias, particularly specification bias.

In addition, our independent and dependent variables must be measured accurately. For example, if we are trying to estimate salary based on years of schooling, we want to make sure our model is measuring years of schooling as actual years of schooling, and not desired years of schooling.

Assumption #2: The independent variables are fixed numbers and not correlated with error terms.

I warned you at the start of our discussion of linear regression that the error terms were going to be important. Let’s start with the notion of fixed numbers. When you are running a regression analysis, the values of each independent variable should not change every time you test of the equation. That is, the values of your independent variables are known and controlled by you. In addition, the independent variables should not be correlated with the error term. If an independent variable is correlated with the error term, then it is very possible a relevant independent variable was excluded from the equation. If Assumption #2 is violated, then your parameter estimates will be biased.

Assumption #3: The error terms ε, have a mean, or expected value, of zero.

As you noticed in the past blog post, when we developed our regression equation for Sue Stone’s monthly sales, we went back in and plugged each observation’s independent variable into our model and generated estimates of sales for that month. We then subtracted the estimated sales from the actual. Some of our estimates were higher than average, some were lower. Summing up all these errors, they should equal zero. If they don’t, they will result in a biased estimate of the intercept, a (which we use to estimate α). This assumption is not of serious concern, however, since the intercept is often of secondary importance to the slope estimate. We also assume that the error terms are normally distributed.

Assumption #4: The error terms have a constant variance.

The variance of the error term for all values of Xi should be constant, that is, the error terms should be homoscedastic. Visually, if you were to plot the line generated by your regression equation, and then plot the error terms for each observation as points above or below the regression line, the points should cluster around the line in a band of equal width above and below the regression line. If, instead, the points began to move further and further away from the regression line as the value of X increased, then the error terms are heteroscedastic, and the constant variance assumption is violated. Heteroscedasticity does not bias parameter estimates, but makes them inefficient, or untrustworthy.

Why does heteroscedasticity occur? Sometimes, a data set has some observations whose values for the independent variable are vastly different from those of the other observations. These cases are known as outliers. For example, if you have five observations, and their X values were as follows:

{ 5, 6, 6, 7, 20}

The fifth observation would be the outlier, since its X value of 20 is so different from that of the four previous observations. Regression equations place excessive weight on extreme values. Let’s assume that you were trying to construct a model to predict new car purchases based on income. You choose “household income” as your dependent variable and “new car spending” as the dependent variable. You survey 10 people who bought a new car, and you record both their income and the amount they paid for the car. You sort each respondent in order by income and look at their spending, as depicted in the table below:

Respondent

Annual Income

New Car Purchase Price

1

$30,000

$25,900

2

$32,500

$27,500

3

$35,000

$26,000

4

$37,500

$29,000

5

$40,000

$32,000

6

$42,500

$30,500

7

$45,000

$34,000

8

$47,500

$26,500

9

$50,000

$38,000

10

$52,500

$40,000

 

Do you notice the pattern that as income increases, the new car purchase price tends to move upward? For the most part, it does. But, does it go up consistently? No. Notice how respondent #3 spent less for a car than the two respondents with lower incomes; respondent #8 spent much less for a car than lower-income respondents 4-7. Respondent #8 is an outlier. This happens because lower-income households are limited in their options for new cars, while higher-income households have more options. A low-income respondent may be limited to buying a Ford Focus or a Honda Civic; but a higher-income respondent may be able to buy a Lexus or BMW, yet still choose to buy the Civic or the Focus. Heteroscedasticity is very likely to occur with this data set. In case you haven’t guessed, heteroscedasticity is more likely to occur with cross-sectional data, rather than with time series data.

Assumption #5: The error terms are not correlated with each other.

Knowing the error term for any of our observations should not allow us to predict the error term of any other observation; the errors must be truly random. If they aren’t, autocorrelation results and the parameter estimates are inefficient, though unbiased. Autocorrelation is much more common with time series data than with cross-sectional data, and occurs because past occurrences can influence future ones. A good example of this is when I was building a regression model to help a college forecast enrollment. I started by building a simple time series regression model, then examined the errors and detected autocorrelation. How did it happen? Because most students who are enrolled in the Fall term are also likely to be enrolled again in the consecutive Spring term. Hence, I needed to correct for that autocorrelation. Similarly, while a company’s advertising expenditures in April may impact its sales in April, they are also likely to have some impact on its sales in May. This too can cause autocorrelation.

When these assumptions are kept, your regression equation is likely to contain parameter estimates that are the “best, linear, unbiased estimators” or BLUE. Keep these in mind as we go through our upcoming discussions on multiple regression.

Next Forecast Friday Topic: Regression with Two or More Independent Variables

Next week, we will plunge into our discussion of multiple regression. I will give you an example of how multiple variables are used to forecast a single dependent variable, and how to check for validity. As we go through the next couple of discussions, I will show you how to analyze the error terms to find violations of the regression assumptions. I will also show you how to determine the validity of the model, and to identify whether all independent variables within your model are relevant.