Posts Tagged ‘simple regression’

Forecast Friday Topic: Selecting the Variables for a Regression

October 14, 2010

(Twenty-fifth in a series)

When it comes to building a regression model, for many companies there’s good news and bad news. The good news: there’s plenty of independent variables from which to choose. The bad news: there’s plenty of independent variables from which to choose! While it may be possible to run a regression with all possible independent variables, each one included in your model reduces your degrees of freedom and causes the model to overfit the data on which the model is built, resulting in less reliable forecasts when new data is introduced.

So how do you come up with your short list of independent variables?

Some analysts have tried plotting the dependent variable (Y) against individual independent variables (Xi) and selecting it if there’s some noticeable relationship. Another tried method is to produce a correlation matrix of all the independent variables and if a large correlation between two of them is discovered, drop one from consideration (so to avoid multicollinearity). Still another approach has been to perform a multiple linear regression on all possible explanatory variables and then dropping those who t values are insignificant. These approaches are often selected because they are quick and simple, but they are not reliable for coming up with a decent regression model.

Stepwise Regression

Other approaches are a bit more complex, but more reliable. Perhaps the most common of these approaches is stepwise regression. Stepwise regression works by first identifying the independent variable with the highest correlation with the dependent variable. Once that variable is identified, a one-variable regression model is run. The residuals of that model are then obtained. Recall from previous Forecast Friday posts that if an important variable is omitted from a regression model, its effect on the dependent variable gets factored into the residuals. Hence, the next step in a stepwise regression is to identify the one unselected independent variable with the highest correlation with the residuals. Now you have your second independent variable, and you run a two-variable regression model. You then look at the residuals to that model and select the independent variable with the highest correlation to them, and so forth. Repeat the process until no more variables can be added into the model.

Many statistical analysis packages do stepwise regression seamlessly. Stepwise regression is not guaranteed to produce the optimal set of variables for your model.

Other Approaches

Other approaches to variable selection include best subsets regression, which involves taking various subsets of the available independent variables and running models with them, choosing the subset with the best R2. Many statistical software packages have the capability of helping determine the various subsets to choose from. Principal components analysis of all the variables is another approach, but it is beyond the scope of this discussion.

Despite systematic techniques like stepwise regression, variable selection in regression models is as much an art as a science. Whatever variables you select for your model should have a valid rationale for being there.

Next Forecast Friday Topic: I haven’t decided yet!

Let me surprise you. In the meantime, have a great weekend and be well!

Advertisements

Forecast Friday Topic: Forecasting with Seasonally-Adjusted Data

September 16, 2010

(Twenty-first in a series)

Last week, we introduced you to fictitious businesswoman Billie Burton, who puts together handmade gift baskets and care packages for customers. Billie is interested in forecasting gift basket orders so that she can get a better idea of how much time she’s going to need to set aside to assemble the packages; how much supplies to have on hand; how much revenue – and cost – she can expect; and whether she will need assistance. Gift-giving is seasonal, and Billie’s business is no exception. The Christmas season is Billie’s busiest season, and a few other months are much busier than others, so she must adjust for these seasonal factors before doing any forecasts.

Why is it important to adjust data for seasonal factors? Imagine trying to do regression analysis on monthly retail data that hasn’t been adjusted. If sales during the holiday season are much greater than at all other times of the year, there will be significant forecast errors in the model because the holiday period’s sales will be outliers. And regression analysis places greater weight on extreme values when trying to determine the least-squares equation.

Billie’s Orders, Revisited

Recall from last week that Billie has five years of monthly gift basket orders, from January 2005 to December 2009. The orders are shown again in the table below:

Month 

TOTAL GIFT BASKET ORDERS 

2005 

2006 

2007 

2008 

2009 

January 

15 

18 

22 

26 

31 

February 

30 

36 

43 

52 

62 

March

25 

18 

22 

43 

32 

April 

15 

30 

36 

27 

52 

May 

13 

16 

19 

23 

28 

June 

14 

17 

20 

24 

29 

July 

12 

14 

17 

20 

24 

August 

22 

26 

31 

37 

44 

September 

20 

24 

29 

35 

42 

October 

14 

17 

20 

24 

29 

November 

35 

42 

50 

60 

72 

December 

40 

48 

58 

70 

84 

 

Billie would like to forecast gift basket orders for the first four months of 2010, particularly February and April, for Valentine’s Day and Easter, two other busier-than-usual periods. Billie must first adjust her data.

Seasonal Adjustment

When we decomposed the time series, we computed the seasonal adjustment factors for each month. They were as follows:

Month 

Factor 

January 

0.78 

February 

1.53 

March 

0.89 

April 

1.13 

May 

0.65 

June 

0.67 

July 

0.55 

August 

1.00 

September 

0.91 

October 

0.62 

November 

1.53 

December 

1.75 

 

Knowing these monthly seasonal factors, Billie adjusts her monthly orders by dividing each month’s orders by its respective seasonal factor (e.g., each January’s orders is divided by 0.78; each February’s orders by 1.53, and so on). Billie’s seasonally-adjusted data looks like this:

Month 

Orders 

Adjustment Factor 

Seasonally Adjusted Orders 

Time Period 

Jan-05 

15 

0.78  

19.28  

1 

Feb-05 

30 

1.53  

19.61  

2 

Mar-05 

25 

0.89  

28.15  

3 

Apr-05 

15 

1.13  

13.30  

4 

May-05 

13 

0.65  

19.93  

5 

Jun-05 

14 

0.67  

21.00  

6 

Jul-05 

12 

0.55  

21.81  

7 

Aug-05 

22 

1.00  

22.10  

8 

Sep-05 

20 

0.91  

21.93  

9 

Oct-05 

14 

0.62  

22.40  

10 

Nov-05 

35 

1.53  

22.89  

11 

Dec-05 

40 

1.75  

22.92  

12 

Jan-06 

18 

0.78  

23.13  

13 

Feb-06 

36

1.53  

23.53  

14 

Mar-06 

18 

0.89  

20.27  

15 

Apr-06 

30 

1.13  

26.61  

16 

May-06 

16 

0.65  

24.53  

17 

Jun-06 

17 

0.67  

25.49  

18 

Jul-06

14 

0.55  

25.44  

19 

Aug-06 

26 

1.00  

26.12  

20 

Sep-06 

24 

0.91  

26.32  

21 

Oct-06 

17 

0.62  

27.20  

22 

Nov-06 

42 

1.53  

27.47  

23 

Dec-06 

48 

1.75  

27.50  

24 

Jan-07 

22 

0.78  

28.27  

25 

Feb-07 

43 

1.53  

28.11  

26 

Mar-07 

22 

0.89  

24.77  

27 

Apr-07 

36 

1.13  

31.93  

28 

May-07 

19 

0.65  

29.13  

29 

Jun-07 

20 

0.67  

29.99  

30 

Jul-07 

17 

0.55  

30.90  

31 

Aug-07 

31 

1.00  

31.14  

32 

Sep-07 

29 

0.91  

31.80

33 

Oct-07 

20 

0.62  

32.01  

34 

Nov-07 

50 

1.53  

32.70  

35 

Dec-07 

58 

1.75  

33.23  

36 

Jan-08 

26 

0.78  

33.42  

37 

Feb-08 

52 

1.53  

33.99  

38 

Mar-08 

43 

0.89  

48.41  

39 

Apr-08 

27 

1.13  

23.94  

40 

May-08 

23 

0.65  

35.26  

41 

Jun-08 

24 

0.67  

35.99  

42 

Jul-08 

20 

0.55  

36.35  

43 

Aug-08 

37 

1.00  

37.17  

44 

Sep-08 

35 

0.91  

38.38  

45 

Oct-08 

24 

0.62  

38.41  

46 

Nov-08 

60 

1.53  

39.24  

47 

Dec-08 

70 

1.75  

40.11  

48 

Jan-09 

31 

0.78  

39.84  

49 

Feb-09 

62 

1.53  

40.53  

50 

Mar-09 

32 

0.89  

36.03  

51 

Apr-09 

52 

1.13  

46.12  

52 

May-09 

28 

0.65

42.93  

53 

Jun-09 

29 

0.67  

43.49  

54 

Jul-09 

24 

0.55  

43.62  

55 

Aug-09 

44 

1.00  

44.20  

56 

Sep-09 

42 

0.91  

46.06  

57 

Oct-09 

29 

0.62  

46.41  

58 

Nov-09 

72 

1.53  

47.09  

59 

Dec-09 

84 

1.75  

48.13  

60 

 

Notice the seasonally adjusted gift basket orders in the fourth column. It is the quotient of the second and third columns. Notice that in the months where the seasonal adjustment factor is greater than 1, the seasonally adjusted orders will be lower than actual orders; in months where the factor is less than 1, the seasonally adjusted orders will be greater than actual. This is intended to normalize the data set. (Note: August has a seasonal factor of 1.00, suggesting it is an average month. However, that is due to rounding. Notice that August 2008’s actual orders are 37 baskets, but its adjusted orders are 37.17. That’s due to rounding). Also, the final column is the sequential time period number for each month, from 1 to 60.

Perform Regression Analysis

Now Billie runs regression analysis. She is going to do a simple regression, using the time period, t, in the last column as her independent variable and the seasonally adjusted orders as her dependent variable. Recall that last week, we ran a simple regression on the actual sales to isolate the trend component, and we identified an upward trend; however, because of the strong seasonal factors in the actual orders, the regression model didn’t fit the data well. By factoring out these seasonal variations, we should expect a model that better fits the data.

Running her regression of the seasonally adjusted orders, Billie gets the following output:

Ŷ = 0.47t +17.12

And as we expected, this model fits the data better, with an R2 of 0.872. Basically, in a baseline month, each passing month increases basket orders by about half an order.

Forecasting Orders

Now Billie needs to forecast orders for January through April 2010. January 2010 is period 61, so she plugs that into her regression equation:

Ŷ = 0.47(61) + 17.12

=45.81

Billie plugs in the data for the rest of the months and gets the following:

Month

Period

Ŷ

Jan-10

61

45.81

Feb-10

62

46.28

Mar-10

63

46.76

Apr-10

64

47.23

Remember, however, that this is seasonally-adjusted data. To get the forecasts for actual orders for each month, Billie now needs to convert them back. Since she divided each month’s orders by its seasonal adjustment factor, she must now multiply each of these months’ forecasts by those same factors. So Billie goes ahead and does that:

Month

Period

Ŷ

Seasonal Factor

Forecast Orders

Jan-10

61

45.81

0.78

35.65

Feb-10

62

46.28

1.53

70.81

Mar-10

63

46.76

0.89

41.53

Apr-10

64

47.23

1.13

53.25

 

So, Billie forecasts 36 gift basket orders in January; 71 in February, 42 in March, and 53 in April.

Next Forecast Friday Topic: Qualitative Variables in Regression Modeling

You’ve just learned how to adjust for seasonality when forecasting. One thing you’ve noticed through all of these forecasts we have built is that all variables have been quantitative. Yet sometimes, we need to account for qualitative, or categorical factors in our explanation of events. The next two Forecast Friday posts will discuss a simple approach for introducing qualitative information into modeling: “dummy” variables. Dummy variables can be helpful in determining differences in predictive estimates by region, gender, race, political affiliation, etc. As you will also find, dummy variables can even be used for a faster, more simplified approach to gauging seasonality. You’ll find our discussion on dummy variables highly useful.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Forecast Friday Topic: Multiple Regression Analysis (continued)

June 24, 2010

(Tenth in a series)

Today we resume our discussion of multiple regression analysis. Last week, we built a model to determine the extent of any relationship between U.S. savings & loan associations’ percent profit margin and two independent variables, net revenues per deposit dollar and number of S&L offices. Today, we will compute the 95% confidence interval for each parameter estimate; determine whether the model is valid; check for autocorrelation; and use the model to forecast. Recall that our resulting model was:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

Where Yt is the percent profit margin for the S&L in Year t; X1t is the net revenues per deposit dollar in Year t; and X2t is the number of S&L offices in the U.S. in Year t. Recall that the R2 is .865, indicating that 86.5% of the change in percentage profit margin is explained by changes in net revenues per deposit dollar and number of S&L offices.

Determining the 95% Confidence Interval for the Partial Slope Coefficients

In multiple regression analysis, since there are multiple independent variables, the parameter estimates for each independent variable both impact the slope of the line; hence the coefficients β1t and β2t are referred to as partial slope estimates. As with simple linear regression, we need to determine the 95% confidence interval for each parameter estimate, so that we could get an idea where the true population parameter lies. Recall from our June 3 post, we did that by determining the equation for the standard error of the estimate, sε, and then the standard error of the regression slope, sb. That worked well for simple regression, but for multiple regression, it is more complicated. Unfortunately, deriving the standard error of the partial regression coefficients requires the use of linear algebra, and would be too complicated to discuss here. Several statistical programs and Excel compute these values for us. So, we will state the values of sb1 and sb2 and go from there.

Sb1=0.05556

Sb2=0.00003

Also, we need our critical-t value for 22 degrees of freedom, which is 2.074.

Hence, our 95% confidence interval for β1 is denoted as:

0.23720 ± 2.074 × 0.05556

=0.12197 to 0.35243

Hence, we are saying that we can be 95% confident that the true parameter β1 lies somewhere between the values of 0.12197 and 0.35243.

Similarly, for β2, the procedure is similar:

-0.000249 ± 2.074 × 0.00003

=-0.00032 to -0.00018

Hence, we can be 95% confident that the true parameter β2 lies somewhere between the values of -0.00032 and -0.00018. Also, the confidence interval for the intercept, α, ranges from 1.40 to 1.73.

Note that in all of these cases, the confidence interval does not contain a value of zero within its range. The confidence intervals for α and β1 are positive; that for β2 is negative. If any parameter’s confidence interval ranges crossed zero, then the parameter estimate would not be significant.

Is Our Model Valid?

The next thing we want to do is determine if our model is valid. When validating our model we are trying to prove that our independent variables explain the variation in the dependent variable. So we start with a hypothesis test:

H0: β1 = β2 = 0

HA: at least one β ≠ 0

Our null hypothesis says that our independent variables, net revenue per deposit dollar and number of S&L offices, explain nothing of the variation in an S&L percentage profit margin, and hence, that our model is not valid. Our alternative hypothesis says that at least one of our independent variable explains some of the variation in an S&L’s percentage profit margin, and thus is valid.

So how do we do it? Enter the F-test. Like the T-test, the F-test is a means for hypothesis testing. Let’s first start by calculating our F-statistic for our model. We do that with the following equation:

Remember that RSS is the regression sum of squares and ESS is the error sum of squares. The May 27th Forecast Friday post showed you how to calculate RSS and ESS. For this model, our RSS=0.4015, and our ESS=0.0625; k is the number of independent variables, and n is the sample. Our equation reduces to:


= 70.66

If our Fcalc is greater than the critical F value for the distribution, then we can reject our null hypothesis and conclude that there is strong evidence that at least one of our independent variables explains some of the variation in an S&L’s percentage profit margin. How do we determine our critical F? There is yet another table in any statistics book or statistics Web site called the “F Distribution” table. In it, you look for two sets of degrees of freedom – one for the numerator and one for the denominator of your Fcalc equation. In the numerator, we have two degrees of freedom; in the denominator, 22. So we look at the F Distribution table notice the columns represent numerator degrees of freedom, and the rows, denominator degrees of freedom. When we find column (2), row (22), we end up with an F-value of 5.72.

Our Fcalc is greater than that, so we can conclude that our model is valid.

Is Our Model Free of Autocorrelation?

Recall from our assumptions that none of our error terms should be correlated with one another. If they are, autocorrelation results, rendering our parameter estimates inefficient. Check for autocorrelation, we need to look at our error terms, when we compare our predicted percentage profit margin, Ŷ, with our actual, Y:

Year

Percentage Profit Margin

Actual (Yt)

Predicted by Model (Ŷt)

Error

1

0.75

0.68

(0.0735)

2

0.71

0.71

0.0033

3

0.66

0.70

0.0391

4

0.61

0.67

0.0622

5

0.7

0.68

(0.0162)

6

0.72

0.71

(0.0124)

7

0.77

0.74

(0.0302)

8

0.74

0.76

0.0186

9

0.9

0.79

(0.1057)

10

0.82

0.79

(0.0264)

11

0.75

0.80

0.0484

12

0.77

0.83

0.0573

13

0.78

0.80

0.0222

14

0.84

0.80

(0.0408)

15

0.79

0.75

(0.0356)

16

0.7

0.73

0.0340

17

0.68

0.70

0.0249

18

0.72

0.69

(0.0270)

19

0.55

0.64

0.0851

20

0.63

0.61

(0.0173)

21

0.56

0.57

0.0101

22

0.41

0.48

0.0696

23

0.51

0.44

(0.0725)

24

0.47

0.40

(0.0746)

25

0.32

0.38

0.0574

The next thing we need to do is subtract the previous period’s error from the current period’s error. After that, we square our result. Note that we will only have 24 observations (we can’t subtract anything from the first observation):

Year

Error

Difference in Errors

Squared Difference in Errors

1

(0.07347)

  

  

2

0.00334

0.07681

0.00590

3

0.03910

0.03576

0.00128

4

0.06218

0.02308

0.00053

5

(0.01624)

(0.07842)

0.00615

6

(0.01242)

0.00382

0.00001

7

(0.03024)

(0.01781)

0.00032

8

0.01860

0.04883

0.00238

9

(0.10569)

(0.12429)

0.01545

10

(0.02644)

0.07925

0.00628

11

0.04843

0.07487

0.00561

12

0.05728

0.00884

0.00008

13

0.02217

(0.03511)

0.00123

14

(0.04075)

(0.06292)

0.00396

15

(0.03557)

0.00519

0.00003

16

0.03397

0.06954

0.00484

17

0.02489

(0.00909)

0.00008

18

(0.02697)

(0.05185)

0.00269

19

0.08509

0.11206

0.01256

20

(0.01728)

(0.10237)

0.01048

21

0.01012

0.02740

0.00075

22

0.06964

0.05952

0.00354

23

(0.07252)

(0.14216)

0.02021

24

(0.07460)

(0.00208)

0.00000

25

0.05738

0.13198

0.01742

 

If we sum up the last column, we will get .1218, if we then divide that by our ESS of 0.0625, we get a value of 1.95. What does this mean?

We have just computed what is known as the Durbin-Watson Statistic, which is used to detect the presence of autocorrelation. The Durbin-Watson statistic, d, can be anywhere from zero to 4. Generally, when d is close to zero, it suggests the presence of positive autocorrelation; a value close to 2 indicates no autocorrelation; while a value close to 4 indicates negative autocorrelation. In any case, you want your Durbin-Watson statistic to be as close to two as possible, and ours is.

Hence, our model seems to be free of autocorrelation.

Now, Let’s Go Forecast!

Now that we have validated our model, and saw that it was free of autocorrelation, we can be comfortable forecasting. Let’s say that for years 26 and 27, we have the following forecasts for net revenues per deposit dollar, X1t and number of S&L offices, X2t. They are as follows:

X1,26 = 4.70 and X2,26 = 9,350

X1,27 = 4.80 and X2,27 = 9,400

Plugging each of these into our equations, we generate the following forecasts:

Ŷ26 = 1.56450 + 0.23720 * 4.70 – 0.000249 * 9,350

=0.3504

Ŷ27 = 1.56450 + 0.23720 * 4.80 – 0.000249 * 9,400

=0.3617

Next Week’s Forecast Friday Topic: The Effect of Omitting an Important Variable

Now that we’ve walked you through this process, you know how to forecast and run multiple regression. Next week, we will discuss what happens when a key independent variable is omitted from a regression model and all the problems it causes when we violate the regression assumption that “all relevant and no irrelevant independent variables are included in the model.” Next week’s post will show a complete demonstration of such an impact. Stay tuned!

Forecast Friday Topic: Prelude to Multiple Regression Analysis – Regression Assumptions

June 10, 2010

(Eighth in a series)

In last week’s Forecast Friday post, we continued our discussion of simple linear regression analysis, discussing how to check both the slope and intercept coefficients for significance. We then discussed how to create a prediction interval for our forecasts. I had intended this week’s Forecast Friday post to delve straight into multiple regression analysis, but have decided instead to spend some time talking about the assumptions that go into building a regression model.  These assumptions apply to both simple and multiple regression analysis, but their importance is especially noticeable with multiple regression, and I feel it is best to make you aware of them, so that when we discuss multiple regression both as a time series and as a causal/econometric forecasting tool, you’ll know how to detect and correct regression models that violate these assumptions. We will formally begin our discussion of multiple regression methods next week.

Five Key Assumptions for Ordinary Least Squares (OLS) Regression

When we develop our parameter estimates for our regression model, we want to make sure that all of our estimators have the smallest variance. Recall that when you were computing the value of your estimate, b, for the parameter β, in the equation below:

You were subtracting your independent variable’s average from each of its actual values, and doing likewise for the dependent variable. You then multiplied those two quantities together (for each observation) and summed them up to get the numerator of that calculation. To get the denominator, you again subtracted the independent variable’s mean from each of its actual values and then squared them. Then you summed those up. The calculation of the denominator is the focal point here: the value you get for your estimate of β is the estimate that minimizes the squared error for your model. Hence, the term, least squares. If you were to take the denominator of the equation above and divide it by your sample size (less one: n-1), you would get the variance of your independent variable, X. This variance is something you also want to minimize, so that your estimate of β is efficient. When your parameter estimates are efficient, you can make stronger statistical statements about them.

We also want to be sure that our estimators are free of bias. That is, we want to be sure that our sample estimate, b, is on average, equal to our true population parameter, β. That is, if we calculated several estimates of β, the average of our b’s should equal β.

Essentially, there are five assumptions that must be made to ensure our estimators are unbiased and efficient:

Assumption #1: The regression equation correctly specifies the true model.

In order to correctly specify the true model, the relationship between the dependent and independent variable must be linear. Also, we must neither exclude relevant independent variables from nor include irrelevant independent variables in our regression equation. If any of these conditions are not met – that is, Assumption #1 is violated – then our parameter estimates will exhibit bias, particularly specification bias.

In addition, our independent and dependent variables must be measured accurately. For example, if we are trying to estimate salary based on years of schooling, we want to make sure our model is measuring years of schooling as actual years of schooling, and not desired years of schooling.

Assumption #2: The independent variables are fixed numbers and not correlated with error terms.

I warned you at the start of our discussion of linear regression that the error terms were going to be important. Let’s start with the notion of fixed numbers. When you are running a regression analysis, the values of each independent variable should not change every time you test of the equation. That is, the values of your independent variables are known and controlled by you. In addition, the independent variables should not be correlated with the error term. If an independent variable is correlated with the error term, then it is very possible a relevant independent variable was excluded from the equation. If Assumption #2 is violated, then your parameter estimates will be biased.

Assumption #3: The error terms ε, have a mean, or expected value, of zero.

As you noticed in the past blog post, when we developed our regression equation for Sue Stone’s monthly sales, we went back in and plugged each observation’s independent variable into our model and generated estimates of sales for that month. We then subtracted the estimated sales from the actual. Some of our estimates were higher than average, some were lower. Summing up all these errors, they should equal zero. If they don’t, they will result in a biased estimate of the intercept, a (which we use to estimate α). This assumption is not of serious concern, however, since the intercept is often of secondary importance to the slope estimate. We also assume that the error terms are normally distributed.

Assumption #4: The error terms have a constant variance.

The variance of the error term for all values of Xi should be constant, that is, the error terms should be homoscedastic. Visually, if you were to plot the line generated by your regression equation, and then plot the error terms for each observation as points above or below the regression line, the points should cluster around the line in a band of equal width above and below the regression line. If, instead, the points began to move further and further away from the regression line as the value of X increased, then the error terms are heteroscedastic, and the constant variance assumption is violated. Heteroscedasticity does not bias parameter estimates, but makes them inefficient, or untrustworthy.

Why does heteroscedasticity occur? Sometimes, a data set has some observations whose values for the independent variable are vastly different from those of the other observations. These cases are known as outliers. For example, if you have five observations, and their X values were as follows:

{ 5, 6, 6, 7, 20}

The fifth observation would be the outlier, since its X value of 20 is so different from that of the four previous observations. Regression equations place excessive weight on extreme values. Let’s assume that you were trying to construct a model to predict new car purchases based on income. You choose “household income” as your dependent variable and “new car spending” as the dependent variable. You survey 10 people who bought a new car, and you record both their income and the amount they paid for the car. You sort each respondent in order by income and look at their spending, as depicted in the table below:

Respondent

Annual Income

New Car Purchase Price

1

$30,000

$25,900

2

$32,500

$27,500

3

$35,000

$26,000

4

$37,500

$29,000

5

$40,000

$32,000

6

$42,500

$30,500

7

$45,000

$34,000

8

$47,500

$26,500

9

$50,000

$38,000

10

$52,500

$40,000

 

Do you notice the pattern that as income increases, the new car purchase price tends to move upward? For the most part, it does. But, does it go up consistently? No. Notice how respondent #3 spent less for a car than the two respondents with lower incomes; respondent #8 spent much less for a car than lower-income respondents 4-7. Respondent #8 is an outlier. This happens because lower-income households are limited in their options for new cars, while higher-income households have more options. A low-income respondent may be limited to buying a Ford Focus or a Honda Civic; but a higher-income respondent may be able to buy a Lexus or BMW, yet still choose to buy the Civic or the Focus. Heteroscedasticity is very likely to occur with this data set. In case you haven’t guessed, heteroscedasticity is more likely to occur with cross-sectional data, rather than with time series data.

Assumption #5: The error terms are not correlated with each other.

Knowing the error term for any of our observations should not allow us to predict the error term of any other observation; the errors must be truly random. If they aren’t, autocorrelation results and the parameter estimates are inefficient, though unbiased. Autocorrelation is much more common with time series data than with cross-sectional data, and occurs because past occurrences can influence future ones. A good example of this is when I was building a regression model to help a college forecast enrollment. I started by building a simple time series regression model, then examined the errors and detected autocorrelation. How did it happen? Because most students who are enrolled in the Fall term are also likely to be enrolled again in the consecutive Spring term. Hence, I needed to correct for that autocorrelation. Similarly, while a company’s advertising expenditures in April may impact its sales in April, they are also likely to have some impact on its sales in May. This too can cause autocorrelation.

When these assumptions are kept, your regression equation is likely to contain parameter estimates that are the “best, linear, unbiased estimators” or BLUE. Keep these in mind as we go through our upcoming discussions on multiple regression.

Next Forecast Friday Topic: Regression with Two or More Independent Variables

Next week, we will plunge into our discussion of multiple regression. I will give you an example of how multiple variables are used to forecast a single dependent variable, and how to check for validity. As we go through the next couple of discussions, I will show you how to analyze the error terms to find violations of the regression assumptions. I will also show you how to determine the validity of the model, and to identify whether all independent variables within your model are relevant.

Forecast Friday Topic: Simple Regression Analysis (Continued)

June 3, 2010

(Seventh in a series)

Last week I introduced the concept of simple linear regression and how it could be used in forecasting. I introduced the fictional businesswoman, Sue Stone, who runs her own CPA firm. Using the last 12 months of her firm’s sales, I walked you through the regression modeling process: determining the independent and dependent variables, estimating the parameter estimates, α and β, deriving the regression equation, calculating the residuals for each observation, and using those residuals to estimate the coefficient of determination – R2 – which indicates how much of the change in the dependent variable is explained by changes in the independent variable. Then I deliberately skipped a couple of steps to get straight to using the regression equation for forecasting. Today, I am going to fill in that gap, and then talk about a couple of other things so that we can move on to next week’s topic on multiple regression.

Revisiting Sue Stone

Last week, we helped Sue Stone develop a model using simple regression analysis, so that she could forecast sales. She had 12 months of sales data, which was her dependent variable, or Y, and each month (numbered from 1 to 12), was her independent variable, or X. Sue’s regression equation was as follows:

Where i is the period number corresponding to the month. So, in June 2009, i would be equal to 6; in January 2010, i would be equal to 13. Of course, since X is the month number, X=i in this example. Recall that Sue’s equation states that each passing month is associated with an average sales increase of $479.02, suggesting her sales are on an upward trend. Also note that Sue’s R2=.917, which says 91.7% of the change in Sue’s monthly sales is explained by changes in the passing months.

Are these claims valid? We need to do some further work here.

Are the Parameter Estimates Statistically Significant?

Measuring an entire population is often impossible. Quite often, we must measure a sample of the population and generalize our findings to the population. When we take an average or standard deviation of a data set that is a subset of the population, our values are estimates of the actual parameters for the population’s true average and standard deviation. These are subject to sampling error. Likewise, when we perform regression analysis on a sample of the population, our coefficients (a and b) are also subject to sampling error. Whenever we estimate population parameters (the population’s true α and β), we are frequently concerned that they might actually have values of zero. Even though we have derived values a=$9636.36 and b=$479.02, we want to perform a statistical significance test to make sure their distance from zero is meaningful and not due to sampling error.

Recall from the May 25 blog post, Using Statistics to Evaluate a Promotion, that in order to do significance testing, we must set up a hypothesis test. In this case, our null hypothesis is that the true population coefficient for month – β – is equal to zero. Our alternative hypothesis is that β is not equal to zero:

H0: β = 0

HA: β≠ 0

Our first step here is to compute the standard error of the estimate, that is, how spread out each value of the dependent variable (sales) is from the average value of sales. Since we are sampling from a population, we are looking for the estimator for the standard error of the estimate. That equation is:

Where ESS is the error sum of squares – or $2,937,062.94 – from Sue’s equation; n is the sample size, or 12; k is the number of independent variables in the model, in this case, just 1. When we plug those numbers into the above equation, we’re dividing the ESS by 10 and then taking the square root, so Sue’s estimator is:

sε = $541.95

Now that we know the estimator for the standard error of the estimate, we need to use that to find the estimator for the standard deviation of the regression slope (b). That equation is given by:

Remember from last week’s blog post that the sum of all the (x-xbar) squared values was 143. Since we have the estimator for the standard error of the estimate, we divide $541.95 by the square root of 143 to get an Sb = 45.32. Next we need to compute the t-statistic. If Sue’s t-statistic is greater than her critical t-value, then she’ll know the parameter estimate of $479.02 is significant. In Sue’s regression, she has 12 observations, and thus 10 degrees of freedom: (n-k-1) = (12-1-1) = 10. Assuming a 95% confidence interval, her critical t is 2.228. Since parameter estimates can be positive or negative, if her t value is less than -2.228 or greater than 2.228, Sue can reject her null hypothesis and conclude that her parameter estimates is meaningfully different from zero.

To compute the t-statistic, all Sue needs to do is divide her b1 coefficient ($479.02) by her sb ($45.32). She ends up with a t-statistic of 10.57, which is significant.

Next Sue must do the same for her intercept value, a. To do this, Sue, must compute the estimator of the standard deviation of the intercept (a). The equation for this estimate is:

All she needs to do is plug in her numbers from earlier: her sε = $541.95; n=12; she just takes her average x-bar of 6.5 and squares it, bringing it to 42.25; and the denominator is the same 143. Working that all in, Sue gets a standard error of 333.545. She divides her intercept value of $9636.36 by 333.545 and gets a t-statistic of 28.891, which exceeds the 2.228 critical t, so her intercept is also significant.

Prediction Intervals in Forecasting

Whew! Aren’t you glad those t-statistics calculations are over? If you run regressions in Excel, these values will be calculated for you automatically, but it’s very important that you understand how they were derived and the theory behind them. Now, we move back to forecasting. In last week’s post, we predicted just a single point with the regression equation. For January 2010, we substituted the number 13 for X, and got a point forecast for sales in that month: $15,863.64. But Sue needs a range, because she knows forecasts are not precise. Sue wants to develop a prediction interval. A prediction interval is simply the point forecast plus or minus the critical t value (2.228) for a desired level of confidence (95%, in this example) times the estimator of the standard error of the estimate ($541.95). So, Sue’s prediction interval is:

$15,863.64 ± 2.228($541.95)

= $15,863.64 ± $1,207.46

$14,656.18_____$17,071.10

So, since Sue had chosen a 95% level of confidence, she can be 95% confident that January 2010 sales will fall somewhere between $14,656.18 and $17,071.10

Recap and Plan for Next Week’s Post

Today, you learned how to test the parameter estimates for significance to determine the validity of your regression model. You also learned how to compute the estimates of the standard error of the estimates, as well as the estimators of the standard deviations of the slope and intercept. You then learned how to derive the t-statistics you need to determine whether those parameter estimates were indeed significant. And finally, you learned how to derive a prediction interval. Next week, we begin our discussion of multiple regression. We will begin by talking about the assumptions behind a regression model; then we will talk about adding a second independent variable into the model. From there, we will test the model for validity, assess the model against those assumptions, and generate projections.