*(Seventh in a series)*

Last week I introduced the concept of simple linear regression and how it could be used in forecasting. I introduced the fictional businesswoman, Sue Stone, who runs her own CPA firm. Using the last 12 months of her firm’s sales, I walked you through the regression modeling process: determining the independent and dependent variables, estimating the parameter estimates, α and β, deriving the regression equation, calculating the residuals for each observation, and using those residuals to estimate the coefficient of determination – R^{2} – which indicates how much of the change in the dependent variable is explained by changes in the independent variable. Then I deliberately skipped a couple of steps to get straight to using the regression equation for forecasting. Today, I am going to fill in that gap, and then talk about a couple of other things so that we can move on to next week’s topic on multiple regression.

**Revisiting Sue Stone**

Last week, we helped Sue Stone develop a model using simple regression analysis, so that she could forecast sales. She had 12 months of sales data, which was her dependent variable, or Y, and each month (numbered from 1 to 12), was her independent variable, or X. Sue’s regression equation was as follows:

Where *i* is the period number corresponding to the month. So, in June 2009, i would be equal to 6; in January 2010, i would be equal to 13. Of course, since X is the month number, X=i in this example. Recall that Sue’s equation states that each passing month is associated with an average sales increase of $479.02, suggesting her sales are on an upward trend. Also note that Sue’s R^{2}=.917, which says 91.7% of the change in Sue’s monthly sales is explained by changes in the passing months.

Are these claims valid? We need to do some further work here.

**Are the Parameter Estimates Statistically Significant?**

Measuring an entire population is often impossible. Quite often, we must measure a sample of the population and generalize our findings to the population. When we take an average or standard deviation of a data set that is a subset of the population, our values are estimates of the actual parameters for the population’s true average and standard deviation. These are subject to *sampling error.* Likewise, when we perform regression analysis on a sample of the population, our coefficients (a and b) are also subject to sampling error. Whenever we estimate population parameters (the population’s true α and β), we are frequently concerned that they might actually have values of zero. Even though we have derived values a=$9636.36 and b=$479.02, we want to perform a statistical significance test to make sure their distance from zero is meaningful and not due to sampling error.

Recall from the May 25 blog post, Using Statistics to Evaluate a Promotion, that in order to do significance testing, we must set up a hypothesis test. In this case, our null hypothesis is that the true population coefficient for month – β – is equal to zero. Our alternative hypothesis is that β is not equal to zero:

H_{0}: β = 0

H_{A}: β≠ 0

Our first step here is to compute the *standard error of the estimate*, that is, how spread out each value of the dependent variable (sales) is from the average value of sales. Since we are sampling from a population, we are looking for the *estimator* for the standard error of the estimate. That equation is:

Where ESS is the *error sum of squares* – or $2,937,062.94 – from Sue’s equation; *n* is the sample size, or 12; *k* is the number of independent variables in the model, in this case, just 1. When we plug those numbers into the above equation, we’re dividing the ESS by 10 and then taking the square root, so Sue’s estimator is:

s_{ε} = $541.95

Now that we know the estimator for the standard error of the estimate, we need to use that to find the estimator for *the standard deviation of the regression slope (b).* That equation is given by:

Remember from last week’s blog post that the sum of all the (x-xbar) squared values was 143. Since we have the estimator for the standard error of the estimate, we divide $541.95 by the square root of 143 to get an S_{b} = 45.32. Next we need to compute the t-statistic. If Sue’s t-statistic is greater than her critical t-value, then she’ll know the parameter estimate of $479.02 is significant. In Sue’s regression, she has 12 observations, and thus 10 degrees of freedom: (n-k-1) = (12-1-1) = 10. Assuming a 95% confidence interval, her critical t is 2.228. Since parameter estimates can be positive or negative, if her t value is less than -2.228 or greater than 2.228, Sue can reject her null hypothesis and conclude that her parameter estimates is meaningfully different from zero.

To compute the t-statistic, all Sue needs to do is divide her b_{1} coefficient ($479.02) by her s_{b} ($45.32). She ends up with a t-statistic of 10.57, which *is significant.*

Next Sue must do the same for her intercept value, a. To do this, Sue, must compute the *estimator of the standard deviation of the intercept (a). *The equation for this estimate is:

All she needs to do is plug in her numbers from earlier: her s_{ε} = $541.95; n=12; she just takes her average x-bar of 6.5 and squares it, bringing it to 42.25; and the denominator is the same 143. Working that all in, Sue gets a standard error of 333.545. She divides her intercept value of $9636.36 by 333.545 and gets a t-statistic of 28.891, which exceeds the 2.228 critical t, *so her intercept is also significant.*

**Prediction Intervals in Forecasting**

Whew! Aren’t you glad those t-statistics calculations are over? If you run regressions in Excel, these values will be calculated for you automatically, but it’s very important that you understand how they were derived and the theory behind them. Now, we move back to forecasting. In last week’s post, we predicted just a single point with the regression equation. For January 2010, we substituted the number 13 for X, and got a *point forecast *for sales in that month: $15,863.64. But Sue needs a range, because she knows forecasts are not precise. Sue wants to develop a *prediction interval*. A prediction interval is simply the point forecast plus or minus the critical t value (2.228) for a desired level of confidence (95%, in this example) times the estimator of the standard error of the estimate ($541.95). So, Sue’s prediction interval is:

$15,863.64 ± 2.228($541.95)

= $15,863.64 ± $1,207.46

$14,656.18_____$17,071.10

So, since Sue had chosen a 95% level of confidence, she can be 95% confident that January 2010 sales will fall somewhere between $14,656.18 and $17,071.10

**Recap and Plan for Next Week’s Post**

Today, you learned how to test the parameter estimates for significance to determine the validity of your regression model. You also learned how to compute the estimates of the standard error of the estimates, as well as the estimators of the standard deviations of the slope and intercept. You then learned how to derive the t-statistics you need to determine whether those parameter estimates were indeed significant. And finally, you learned how to derive a prediction interval. Next week, we begin our discussion of multiple regression. We will begin by talking about the assumptions behind a regression model; then we will talk about adding a second independent variable into the model. From there, we will test the model for validity, assess the model against those assumptions, and generate projections.

Tags: alternative hypothesis, Analysights, business forecasting, confidence interval, critical-t, estimator of the standard deviation of the regression intercept, estimator of the standard deviation of the regression slope, Forecast Friday, Forecasting, hypothesis testing, multiple regression, null hypothesis, ordinary least squares, regression, regression analysis, sampling error, simple linear regression, simple regression, small business forecasting, standard error of the estimate, statistical analysis, statistical sampling, statistical significance, Statistics, t-value

June 4, 2010 at 5:42 pm |

Alex, this is perhaps the most clear and succinct summary of regression analysis that I have read–thank you for posting it! The equations are a little difficult to read on my browser–not sure if there is anything you can do to make those more legible.

June 4, 2010 at 6:17 pm |

Gary, Glad you like it! I have the same problem with some of the equations, mostly because the symbols are generated in Word and then transferred to the blog editor. I’m still working on that.

June 24, 2010 at 12:10 am |

[…] estimate, so that we could get an idea where the true population parameter lies. Recall from our June 3 post, we did that by determining the equation for the standard error of the estimate, sε, and then the […]