Posts Tagged ‘confidence interval’

Forecast Friday Topic: Multiple Regression Analysis (continued)

June 24, 2010

(Tenth in a series)

Today we resume our discussion of multiple regression analysis. Last week, we built a model to determine the extent of any relationship between U.S. savings & loan associations’ percent profit margin and two independent variables, net revenues per deposit dollar and number of S&L offices. Today, we will compute the 95% confidence interval for each parameter estimate; determine whether the model is valid; check for autocorrelation; and use the model to forecast. Recall that our resulting model was:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

Where Yt is the percent profit margin for the S&L in Year t; X1t is the net revenues per deposit dollar in Year t; and X2t is the number of S&L offices in the U.S. in Year t. Recall that the R2 is .865, indicating that 86.5% of the change in percentage profit margin is explained by changes in net revenues per deposit dollar and number of S&L offices.

Determining the 95% Confidence Interval for the Partial Slope Coefficients

In multiple regression analysis, since there are multiple independent variables, the parameter estimates for each independent variable both impact the slope of the line; hence the coefficients β1t and β2t are referred to as partial slope estimates. As with simple linear regression, we need to determine the 95% confidence interval for each parameter estimate, so that we could get an idea where the true population parameter lies. Recall from our June 3 post, we did that by determining the equation for the standard error of the estimate, sε, and then the standard error of the regression slope, sb. That worked well for simple regression, but for multiple regression, it is more complicated. Unfortunately, deriving the standard error of the partial regression coefficients requires the use of linear algebra, and would be too complicated to discuss here. Several statistical programs and Excel compute these values for us. So, we will state the values of sb1 and sb2 and go from there.

Sb1=0.05556

Sb2=0.00003

Also, we need our critical-t value for 22 degrees of freedom, which is 2.074.

Hence, our 95% confidence interval for β1 is denoted as:

0.23720 ± 2.074 × 0.05556

=0.12197 to 0.35243

Hence, we are saying that we can be 95% confident that the true parameter β1 lies somewhere between the values of 0.12197 and 0.35243.

Similarly, for β2, the procedure is similar:

-0.000249 ± 2.074 × 0.00003

=-0.00032 to -0.00018

Hence, we can be 95% confident that the true parameter β2 lies somewhere between the values of -0.00032 and -0.00018. Also, the confidence interval for the intercept, α, ranges from 1.40 to 1.73.

Note that in all of these cases, the confidence interval does not contain a value of zero within its range. The confidence intervals for α and β1 are positive; that for β2 is negative. If any parameter’s confidence interval ranges crossed zero, then the parameter estimate would not be significant.

Is Our Model Valid?

The next thing we want to do is determine if our model is valid. When validating our model we are trying to prove that our independent variables explain the variation in the dependent variable. So we start with a hypothesis test:

H0: β1 = β2 = 0

HA: at least one β ≠ 0

Our null hypothesis says that our independent variables, net revenue per deposit dollar and number of S&L offices, explain nothing of the variation in an S&L percentage profit margin, and hence, that our model is not valid. Our alternative hypothesis says that at least one of our independent variable explains some of the variation in an S&L’s percentage profit margin, and thus is valid.

So how do we do it? Enter the F-test. Like the T-test, the F-test is a means for hypothesis testing. Let’s first start by calculating our F-statistic for our model. We do that with the following equation:

Remember that RSS is the regression sum of squares and ESS is the error sum of squares. The May 27th Forecast Friday post showed you how to calculate RSS and ESS. For this model, our RSS=0.4015, and our ESS=0.0625; k is the number of independent variables, and n is the sample. Our equation reduces to:


= 70.66

If our Fcalc is greater than the critical F value for the distribution, then we can reject our null hypothesis and conclude that there is strong evidence that at least one of our independent variables explains some of the variation in an S&L’s percentage profit margin. How do we determine our critical F? There is yet another table in any statistics book or statistics Web site called the “F Distribution” table. In it, you look for two sets of degrees of freedom – one for the numerator and one for the denominator of your Fcalc equation. In the numerator, we have two degrees of freedom; in the denominator, 22. So we look at the F Distribution table notice the columns represent numerator degrees of freedom, and the rows, denominator degrees of freedom. When we find column (2), row (22), we end up with an F-value of 5.72.

Our Fcalc is greater than that, so we can conclude that our model is valid.

Is Our Model Free of Autocorrelation?

Recall from our assumptions that none of our error terms should be correlated with one another. If they are, autocorrelation results, rendering our parameter estimates inefficient. Check for autocorrelation, we need to look at our error terms, when we compare our predicted percentage profit margin, Ŷ, with our actual, Y:

Year

Percentage Profit Margin

Actual (Yt)

Predicted by Model (Ŷt)

Error

1

0.75

0.68

(0.0735)

2

0.71

0.71

0.0033

3

0.66

0.70

0.0391

4

0.61

0.67

0.0622

5

0.7

0.68

(0.0162)

6

0.72

0.71

(0.0124)

7

0.77

0.74

(0.0302)

8

0.74

0.76

0.0186

9

0.9

0.79

(0.1057)

10

0.82

0.79

(0.0264)

11

0.75

0.80

0.0484

12

0.77

0.83

0.0573

13

0.78

0.80

0.0222

14

0.84

0.80

(0.0408)

15

0.79

0.75

(0.0356)

16

0.7

0.73

0.0340

17

0.68

0.70

0.0249

18

0.72

0.69

(0.0270)

19

0.55

0.64

0.0851

20

0.63

0.61

(0.0173)

21

0.56

0.57

0.0101

22

0.41

0.48

0.0696

23

0.51

0.44

(0.0725)

24

0.47

0.40

(0.0746)

25

0.32

0.38

0.0574

The next thing we need to do is subtract the previous period’s error from the current period’s error. After that, we square our result. Note that we will only have 24 observations (we can’t subtract anything from the first observation):

Year

Error

Difference in Errors

Squared Difference in Errors

1

(0.07347)

  

  

2

0.00334

0.07681

0.00590

3

0.03910

0.03576

0.00128

4

0.06218

0.02308

0.00053

5

(0.01624)

(0.07842)

0.00615

6

(0.01242)

0.00382

0.00001

7

(0.03024)

(0.01781)

0.00032

8

0.01860

0.04883

0.00238

9

(0.10569)

(0.12429)

0.01545

10

(0.02644)

0.07925

0.00628

11

0.04843

0.07487

0.00561

12

0.05728

0.00884

0.00008

13

0.02217

(0.03511)

0.00123

14

(0.04075)

(0.06292)

0.00396

15

(0.03557)

0.00519

0.00003

16

0.03397

0.06954

0.00484

17

0.02489

(0.00909)

0.00008

18

(0.02697)

(0.05185)

0.00269

19

0.08509

0.11206

0.01256

20

(0.01728)

(0.10237)

0.01048

21

0.01012

0.02740

0.00075

22

0.06964

0.05952

0.00354

23

(0.07252)

(0.14216)

0.02021

24

(0.07460)

(0.00208)

0.00000

25

0.05738

0.13198

0.01742

 

If we sum up the last column, we will get .1218, if we then divide that by our ESS of 0.0625, we get a value of 1.95. What does this mean?

We have just computed what is known as the Durbin-Watson Statistic, which is used to detect the presence of autocorrelation. The Durbin-Watson statistic, d, can be anywhere from zero to 4. Generally, when d is close to zero, it suggests the presence of positive autocorrelation; a value close to 2 indicates no autocorrelation; while a value close to 4 indicates negative autocorrelation. In any case, you want your Durbin-Watson statistic to be as close to two as possible, and ours is.

Hence, our model seems to be free of autocorrelation.

Now, Let’s Go Forecast!

Now that we have validated our model, and saw that it was free of autocorrelation, we can be comfortable forecasting. Let’s say that for years 26 and 27, we have the following forecasts for net revenues per deposit dollar, X1t and number of S&L offices, X2t. They are as follows:

X1,26 = 4.70 and X2,26 = 9,350

X1,27 = 4.80 and X2,27 = 9,400

Plugging each of these into our equations, we generate the following forecasts:

Ŷ26 = 1.56450 + 0.23720 * 4.70 – 0.000249 * 9,350

=0.3504

Ŷ27 = 1.56450 + 0.23720 * 4.80 – 0.000249 * 9,400

=0.3617

Next Week’s Forecast Friday Topic: The Effect of Omitting an Important Variable

Now that we’ve walked you through this process, you know how to forecast and run multiple regression. Next week, we will discuss what happens when a key independent variable is omitted from a regression model and all the problems it causes when we violate the regression assumption that “all relevant and no irrelevant independent variables are included in the model.” Next week’s post will show a complete demonstration of such an impact. Stay tuned!

Advertisements

Forecast Friday Topic: Simple Regression Analysis (Continued)

June 3, 2010

(Seventh in a series)

Last week I introduced the concept of simple linear regression and how it could be used in forecasting. I introduced the fictional businesswoman, Sue Stone, who runs her own CPA firm. Using the last 12 months of her firm’s sales, I walked you through the regression modeling process: determining the independent and dependent variables, estimating the parameter estimates, α and β, deriving the regression equation, calculating the residuals for each observation, and using those residuals to estimate the coefficient of determination – R2 – which indicates how much of the change in the dependent variable is explained by changes in the independent variable. Then I deliberately skipped a couple of steps to get straight to using the regression equation for forecasting. Today, I am going to fill in that gap, and then talk about a couple of other things so that we can move on to next week’s topic on multiple regression.

Revisiting Sue Stone

Last week, we helped Sue Stone develop a model using simple regression analysis, so that she could forecast sales. She had 12 months of sales data, which was her dependent variable, or Y, and each month (numbered from 1 to 12), was her independent variable, or X. Sue’s regression equation was as follows:

Where i is the period number corresponding to the month. So, in June 2009, i would be equal to 6; in January 2010, i would be equal to 13. Of course, since X is the month number, X=i in this example. Recall that Sue’s equation states that each passing month is associated with an average sales increase of $479.02, suggesting her sales are on an upward trend. Also note that Sue’s R2=.917, which says 91.7% of the change in Sue’s monthly sales is explained by changes in the passing months.

Are these claims valid? We need to do some further work here.

Are the Parameter Estimates Statistically Significant?

Measuring an entire population is often impossible. Quite often, we must measure a sample of the population and generalize our findings to the population. When we take an average or standard deviation of a data set that is a subset of the population, our values are estimates of the actual parameters for the population’s true average and standard deviation. These are subject to sampling error. Likewise, when we perform regression analysis on a sample of the population, our coefficients (a and b) are also subject to sampling error. Whenever we estimate population parameters (the population’s true α and β), we are frequently concerned that they might actually have values of zero. Even though we have derived values a=$9636.36 and b=$479.02, we want to perform a statistical significance test to make sure their distance from zero is meaningful and not due to sampling error.

Recall from the May 25 blog post, Using Statistics to Evaluate a Promotion, that in order to do significance testing, we must set up a hypothesis test. In this case, our null hypothesis is that the true population coefficient for month – β – is equal to zero. Our alternative hypothesis is that β is not equal to zero:

H0: β = 0

HA: β≠ 0

Our first step here is to compute the standard error of the estimate, that is, how spread out each value of the dependent variable (sales) is from the average value of sales. Since we are sampling from a population, we are looking for the estimator for the standard error of the estimate. That equation is:

Where ESS is the error sum of squares – or $2,937,062.94 – from Sue’s equation; n is the sample size, or 12; k is the number of independent variables in the model, in this case, just 1. When we plug those numbers into the above equation, we’re dividing the ESS by 10 and then taking the square root, so Sue’s estimator is:

sε = $541.95

Now that we know the estimator for the standard error of the estimate, we need to use that to find the estimator for the standard deviation of the regression slope (b). That equation is given by:

Remember from last week’s blog post that the sum of all the (x-xbar) squared values was 143. Since we have the estimator for the standard error of the estimate, we divide $541.95 by the square root of 143 to get an Sb = 45.32. Next we need to compute the t-statistic. If Sue’s t-statistic is greater than her critical t-value, then she’ll know the parameter estimate of $479.02 is significant. In Sue’s regression, she has 12 observations, and thus 10 degrees of freedom: (n-k-1) = (12-1-1) = 10. Assuming a 95% confidence interval, her critical t is 2.228. Since parameter estimates can be positive or negative, if her t value is less than -2.228 or greater than 2.228, Sue can reject her null hypothesis and conclude that her parameter estimates is meaningfully different from zero.

To compute the t-statistic, all Sue needs to do is divide her b1 coefficient ($479.02) by her sb ($45.32). She ends up with a t-statistic of 10.57, which is significant.

Next Sue must do the same for her intercept value, a. To do this, Sue, must compute the estimator of the standard deviation of the intercept (a). The equation for this estimate is:

All she needs to do is plug in her numbers from earlier: her sε = $541.95; n=12; she just takes her average x-bar of 6.5 and squares it, bringing it to 42.25; and the denominator is the same 143. Working that all in, Sue gets a standard error of 333.545. She divides her intercept value of $9636.36 by 333.545 and gets a t-statistic of 28.891, which exceeds the 2.228 critical t, so her intercept is also significant.

Prediction Intervals in Forecasting

Whew! Aren’t you glad those t-statistics calculations are over? If you run regressions in Excel, these values will be calculated for you automatically, but it’s very important that you understand how they were derived and the theory behind them. Now, we move back to forecasting. In last week’s post, we predicted just a single point with the regression equation. For January 2010, we substituted the number 13 for X, and got a point forecast for sales in that month: $15,863.64. But Sue needs a range, because she knows forecasts are not precise. Sue wants to develop a prediction interval. A prediction interval is simply the point forecast plus or minus the critical t value (2.228) for a desired level of confidence (95%, in this example) times the estimator of the standard error of the estimate ($541.95). So, Sue’s prediction interval is:

$15,863.64 ± 2.228($541.95)

= $15,863.64 ± $1,207.46

$14,656.18_____$17,071.10

So, since Sue had chosen a 95% level of confidence, she can be 95% confident that January 2010 sales will fall somewhere between $14,656.18 and $17,071.10

Recap and Plan for Next Week’s Post

Today, you learned how to test the parameter estimates for significance to determine the validity of your regression model. You also learned how to compute the estimates of the standard error of the estimates, as well as the estimators of the standard deviations of the slope and intercept. You then learned how to derive the t-statistics you need to determine whether those parameter estimates were indeed significant. And finally, you learned how to derive a prediction interval. Next week, we begin our discussion of multiple regression. We will begin by talking about the assumptions behind a regression model; then we will talk about adding a second independent variable into the model. From there, we will test the model for validity, assess the model against those assumptions, and generate projections.