Posts Tagged ‘simple linear regression’

Forecast Friday Topic: Prelude to Multiple Regression Analysis – Regression Assumptions

June 10, 2010

(Eighth in a series)

In last week’s Forecast Friday post, we continued our discussion of simple linear regression analysis, discussing how to check both the slope and intercept coefficients for significance. We then discussed how to create a prediction interval for our forecasts. I had intended this week’s Forecast Friday post to delve straight into multiple regression analysis, but have decided instead to spend some time talking about the assumptions that go into building a regression model.  These assumptions apply to both simple and multiple regression analysis, but their importance is especially noticeable with multiple regression, and I feel it is best to make you aware of them, so that when we discuss multiple regression both as a time series and as a causal/econometric forecasting tool, you’ll know how to detect and correct regression models that violate these assumptions. We will formally begin our discussion of multiple regression methods next week.

Five Key Assumptions for Ordinary Least Squares (OLS) Regression

When we develop our parameter estimates for our regression model, we want to make sure that all of our estimators have the smallest variance. Recall that when you were computing the value of your estimate, b, for the parameter β, in the equation below:

You were subtracting your independent variable’s average from each of its actual values, and doing likewise for the dependent variable. You then multiplied those two quantities together (for each observation) and summed them up to get the numerator of that calculation. To get the denominator, you again subtracted the independent variable’s mean from each of its actual values and then squared them. Then you summed those up. The calculation of the denominator is the focal point here: the value you get for your estimate of β is the estimate that minimizes the squared error for your model. Hence, the term, least squares. If you were to take the denominator of the equation above and divide it by your sample size (less one: n-1), you would get the variance of your independent variable, X. This variance is something you also want to minimize, so that your estimate of β is efficient. When your parameter estimates are efficient, you can make stronger statistical statements about them.

We also want to be sure that our estimators are free of bias. That is, we want to be sure that our sample estimate, b, is on average, equal to our true population parameter, β. That is, if we calculated several estimates of β, the average of our b’s should equal β.

Essentially, there are five assumptions that must be made to ensure our estimators are unbiased and efficient:

Assumption #1: The regression equation correctly specifies the true model.

In order to correctly specify the true model, the relationship between the dependent and independent variable must be linear. Also, we must neither exclude relevant independent variables from nor include irrelevant independent variables in our regression equation. If any of these conditions are not met – that is, Assumption #1 is violated – then our parameter estimates will exhibit bias, particularly specification bias.

In addition, our independent and dependent variables must be measured accurately. For example, if we are trying to estimate salary based on years of schooling, we want to make sure our model is measuring years of schooling as actual years of schooling, and not desired years of schooling.

Assumption #2: The independent variables are fixed numbers and not correlated with error terms.

I warned you at the start of our discussion of linear regression that the error terms were going to be important. Let’s start with the notion of fixed numbers. When you are running a regression analysis, the values of each independent variable should not change every time you test of the equation. That is, the values of your independent variables are known and controlled by you. In addition, the independent variables should not be correlated with the error term. If an independent variable is correlated with the error term, then it is very possible a relevant independent variable was excluded from the equation. If Assumption #2 is violated, then your parameter estimates will be biased.

Assumption #3: The error terms ε, have a mean, or expected value, of zero.

As you noticed in the past blog post, when we developed our regression equation for Sue Stone’s monthly sales, we went back in and plugged each observation’s independent variable into our model and generated estimates of sales for that month. We then subtracted the estimated sales from the actual. Some of our estimates were higher than average, some were lower. Summing up all these errors, they should equal zero. If they don’t, they will result in a biased estimate of the intercept, a (which we use to estimate α). This assumption is not of serious concern, however, since the intercept is often of secondary importance to the slope estimate. We also assume that the error terms are normally distributed.

Assumption #4: The error terms have a constant variance.

The variance of the error term for all values of Xi should be constant, that is, the error terms should be homoscedastic. Visually, if you were to plot the line generated by your regression equation, and then plot the error terms for each observation as points above or below the regression line, the points should cluster around the line in a band of equal width above and below the regression line. If, instead, the points began to move further and further away from the regression line as the value of X increased, then the error terms are heteroscedastic, and the constant variance assumption is violated. Heteroscedasticity does not bias parameter estimates, but makes them inefficient, or untrustworthy.

Why does heteroscedasticity occur? Sometimes, a data set has some observations whose values for the independent variable are vastly different from those of the other observations. These cases are known as outliers. For example, if you have five observations, and their X values were as follows:

{ 5, 6, 6, 7, 20}

The fifth observation would be the outlier, since its X value of 20 is so different from that of the four previous observations. Regression equations place excessive weight on extreme values. Let’s assume that you were trying to construct a model to predict new car purchases based on income. You choose “household income” as your dependent variable and “new car spending” as the dependent variable. You survey 10 people who bought a new car, and you record both their income and the amount they paid for the car. You sort each respondent in order by income and look at their spending, as depicted in the table below:

Respondent

Annual Income

New Car Purchase Price

1

$30,000

$25,900

2

$32,500

$27,500

3

$35,000

$26,000

4

$37,500

$29,000

5

$40,000

$32,000

6

$42,500

$30,500

7

$45,000

$34,000

8

$47,500

$26,500

9

$50,000

$38,000

10

$52,500

$40,000

 

Do you notice the pattern that as income increases, the new car purchase price tends to move upward? For the most part, it does. But, does it go up consistently? No. Notice how respondent #3 spent less for a car than the two respondents with lower incomes; respondent #8 spent much less for a car than lower-income respondents 4-7. Respondent #8 is an outlier. This happens because lower-income households are limited in their options for new cars, while higher-income households have more options. A low-income respondent may be limited to buying a Ford Focus or a Honda Civic; but a higher-income respondent may be able to buy a Lexus or BMW, yet still choose to buy the Civic or the Focus. Heteroscedasticity is very likely to occur with this data set. In case you haven’t guessed, heteroscedasticity is more likely to occur with cross-sectional data, rather than with time series data.

Assumption #5: The error terms are not correlated with each other.

Knowing the error term for any of our observations should not allow us to predict the error term of any other observation; the errors must be truly random. If they aren’t, autocorrelation results and the parameter estimates are inefficient, though unbiased. Autocorrelation is much more common with time series data than with cross-sectional data, and occurs because past occurrences can influence future ones. A good example of this is when I was building a regression model to help a college forecast enrollment. I started by building a simple time series regression model, then examined the errors and detected autocorrelation. How did it happen? Because most students who are enrolled in the Fall term are also likely to be enrolled again in the consecutive Spring term. Hence, I needed to correct for that autocorrelation. Similarly, while a company’s advertising expenditures in April may impact its sales in April, they are also likely to have some impact on its sales in May. This too can cause autocorrelation.

When these assumptions are kept, your regression equation is likely to contain parameter estimates that are the “best, linear, unbiased estimators” or BLUE. Keep these in mind as we go through our upcoming discussions on multiple regression.

Next Forecast Friday Topic: Regression with Two or More Independent Variables

Next week, we will plunge into our discussion of multiple regression. I will give you an example of how multiple variables are used to forecast a single dependent variable, and how to check for validity. As we go through the next couple of discussions, I will show you how to analyze the error terms to find violations of the regression assumptions. I will also show you how to determine the validity of the model, and to identify whether all independent variables within your model are relevant.

Advertisements

Forecast Friday Topic: Simple Regression Analysis (Continued)

June 3, 2010

(Seventh in a series)

Last week I introduced the concept of simple linear regression and how it could be used in forecasting. I introduced the fictional businesswoman, Sue Stone, who runs her own CPA firm. Using the last 12 months of her firm’s sales, I walked you through the regression modeling process: determining the independent and dependent variables, estimating the parameter estimates, α and β, deriving the regression equation, calculating the residuals for each observation, and using those residuals to estimate the coefficient of determination – R2 – which indicates how much of the change in the dependent variable is explained by changes in the independent variable. Then I deliberately skipped a couple of steps to get straight to using the regression equation for forecasting. Today, I am going to fill in that gap, and then talk about a couple of other things so that we can move on to next week’s topic on multiple regression.

Revisiting Sue Stone

Last week, we helped Sue Stone develop a model using simple regression analysis, so that she could forecast sales. She had 12 months of sales data, which was her dependent variable, or Y, and each month (numbered from 1 to 12), was her independent variable, or X. Sue’s regression equation was as follows:

Where i is the period number corresponding to the month. So, in June 2009, i would be equal to 6; in January 2010, i would be equal to 13. Of course, since X is the month number, X=i in this example. Recall that Sue’s equation states that each passing month is associated with an average sales increase of $479.02, suggesting her sales are on an upward trend. Also note that Sue’s R2=.917, which says 91.7% of the change in Sue’s monthly sales is explained by changes in the passing months.

Are these claims valid? We need to do some further work here.

Are the Parameter Estimates Statistically Significant?

Measuring an entire population is often impossible. Quite often, we must measure a sample of the population and generalize our findings to the population. When we take an average or standard deviation of a data set that is a subset of the population, our values are estimates of the actual parameters for the population’s true average and standard deviation. These are subject to sampling error. Likewise, when we perform regression analysis on a sample of the population, our coefficients (a and b) are also subject to sampling error. Whenever we estimate population parameters (the population’s true α and β), we are frequently concerned that they might actually have values of zero. Even though we have derived values a=$9636.36 and b=$479.02, we want to perform a statistical significance test to make sure their distance from zero is meaningful and not due to sampling error.

Recall from the May 25 blog post, Using Statistics to Evaluate a Promotion, that in order to do significance testing, we must set up a hypothesis test. In this case, our null hypothesis is that the true population coefficient for month – β – is equal to zero. Our alternative hypothesis is that β is not equal to zero:

H0: β = 0

HA: β≠ 0

Our first step here is to compute the standard error of the estimate, that is, how spread out each value of the dependent variable (sales) is from the average value of sales. Since we are sampling from a population, we are looking for the estimator for the standard error of the estimate. That equation is:

Where ESS is the error sum of squares – or $2,937,062.94 – from Sue’s equation; n is the sample size, or 12; k is the number of independent variables in the model, in this case, just 1. When we plug those numbers into the above equation, we’re dividing the ESS by 10 and then taking the square root, so Sue’s estimator is:

sε = $541.95

Now that we know the estimator for the standard error of the estimate, we need to use that to find the estimator for the standard deviation of the regression slope (b). That equation is given by:

Remember from last week’s blog post that the sum of all the (x-xbar) squared values was 143. Since we have the estimator for the standard error of the estimate, we divide $541.95 by the square root of 143 to get an Sb = 45.32. Next we need to compute the t-statistic. If Sue’s t-statistic is greater than her critical t-value, then she’ll know the parameter estimate of $479.02 is significant. In Sue’s regression, she has 12 observations, and thus 10 degrees of freedom: (n-k-1) = (12-1-1) = 10. Assuming a 95% confidence interval, her critical t is 2.228. Since parameter estimates can be positive or negative, if her t value is less than -2.228 or greater than 2.228, Sue can reject her null hypothesis and conclude that her parameter estimates is meaningfully different from zero.

To compute the t-statistic, all Sue needs to do is divide her b1 coefficient ($479.02) by her sb ($45.32). She ends up with a t-statistic of 10.57, which is significant.

Next Sue must do the same for her intercept value, a. To do this, Sue, must compute the estimator of the standard deviation of the intercept (a). The equation for this estimate is:

All she needs to do is plug in her numbers from earlier: her sε = $541.95; n=12; she just takes her average x-bar of 6.5 and squares it, bringing it to 42.25; and the denominator is the same 143. Working that all in, Sue gets a standard error of 333.545. She divides her intercept value of $9636.36 by 333.545 and gets a t-statistic of 28.891, which exceeds the 2.228 critical t, so her intercept is also significant.

Prediction Intervals in Forecasting

Whew! Aren’t you glad those t-statistics calculations are over? If you run regressions in Excel, these values will be calculated for you automatically, but it’s very important that you understand how they were derived and the theory behind them. Now, we move back to forecasting. In last week’s post, we predicted just a single point with the regression equation. For January 2010, we substituted the number 13 for X, and got a point forecast for sales in that month: $15,863.64. But Sue needs a range, because she knows forecasts are not precise. Sue wants to develop a prediction interval. A prediction interval is simply the point forecast plus or minus the critical t value (2.228) for a desired level of confidence (95%, in this example) times the estimator of the standard error of the estimate ($541.95). So, Sue’s prediction interval is:

$15,863.64 ± 2.228($541.95)

= $15,863.64 ± $1,207.46

$14,656.18_____$17,071.10

So, since Sue had chosen a 95% level of confidence, she can be 95% confident that January 2010 sales will fall somewhere between $14,656.18 and $17,071.10

Recap and Plan for Next Week’s Post

Today, you learned how to test the parameter estimates for significance to determine the validity of your regression model. You also learned how to compute the estimates of the standard error of the estimates, as well as the estimators of the standard deviations of the slope and intercept. You then learned how to derive the t-statistics you need to determine whether those parameter estimates were indeed significant. And finally, you learned how to derive a prediction interval. Next week, we begin our discussion of multiple regression. We will begin by talking about the assumptions behind a regression model; then we will talk about adding a second independent variable into the model. From there, we will test the model for validity, assess the model against those assumptions, and generate projections.

Forecast Friday Topic: Simple Regression Analysis

May 27, 2010

(Sixth in a series)

Today, we begin our discussion of regression analysis as a time series forecasting tool. This discussion will take the next few weeks, as there is much behind it. As always, I will make sure everything is simplified and easy for you to digest. Regression is a powerful tool that can be very helpful for mid- and long-range forecasting. Quite often, the business decisions we make require us to consider relationships between two or more variables. Rarely can we make changes to our promotion, pricing, and/or product development strategies without them having an impact of some kind on our sales. Just how big an impact would that be? How do we measure the relationship between two or more variables? And does a real relationship even exist between those variables? Regression analysis helps us find out.

One thing I must point out: Remember the “deviations” we discussed in the posts on moving average and exponential smoothing techniques: The difference between the forecasted and actual values for each observation, of which we took the absolute value? Good. In regression analysis, we refer to the deviations as the “error terms” or “residuals.” In regression analysis, the residuals – which we will square, rather than take the absolute value – become very important in gauging the regression model’s accuracy, validity, efficiency, and “goodness of fit.”

Simple Linear Regression Analysis

Sue Stone, owner of Stone & Associates, looked at her CPA practice’s monthly receipts from January to December 2009. The sales were as follows:

Month 

Sales 

January 

$10,000 

February 

$11,000 

March 

$10,500 

April 

$11,500 

May 

$12,500 

June 

$12,000 

July 

$14,000 

August 

$13,000 

September 

$13,500 

October 

$15,000 

November

$14,500 

December 

$15,500 

Sue is trying to predict what sales will be for each month in the first quarter of 2010, but is unsure of how to go about it. Moving average and exponential smoothing techniques rarely go more than one period ahead. So, what is Sue to do?

When we are presented with a set of numbers, one of the ways we try to make sense of it is by taking its average. Perhaps Sue can average all 12 months’ sales – $12,750 – and use that her forecast for each of next three months. But how accurately would that measure each month of 2009? How spread out are each month’s sales from the average? Sue subtracts the average from each month’s sales and examines the difference:

Month 

Sales 

Sales Less Average Sales 
January 

$10,000 

-$2,750 

February

$11,000 

-$1,750 

March 

$10,500 

-$2,250 

April 

$11,500 

-$1,250 

May 

$12,500 

-$250 

June 

$12,000 

-$750 

July 

$14,000 

$1,250 

August 

$13,000 

$250 

September 

$13,500 

$750 

October 

$15,000 

$2,250 

November 

$14,500 

$1,750 

December 

$15,500 

$2,750 

 

Sue notices that the error between actual and average is quite high in both the first four months of 2009 and in the last three months of 2009. She wants to understand the overall error in using the average as a forecast of sales. However, when she sums up all the errors from month to month, Sue finds they sum to zero. That tells her nothing. So she squares each month’s error value and sums them:

Month 

Sales 

Error 

Error Squared 

January 

$10,000 

-$2,750 

$7,562,500 

February 

$11,000 

-$1,750 

$3,062,500 

March 

$10,500

-$2,250 

$5,062,500 

April 

$11,500 

-$1,250 

$1,562,500 

May 

$12,500 

-$250 

$62,500 

June 

$12,000 

-$750 

$562,500 

July 

$14,000 

$1,250 

$1,562,500 

August 

$13,000 

$250 

$62,500 

September 

$13,500 

$750 

$562,500 

October 

$15,000 

$2,250 

$5,062,500 

November 

$14,500

$1,750 

$3,062,500 

December 

$15,500 

$2,750 

$7,562,500 

   

Total Error: 

$35,750,000 

    

In totaling these squared errors, Sue derives the total sum of squares, or TSS error: 35,750,000. Is there any way she can improve upon that? Sue thinks for a while. She doesn’t know too much more about her 2009 sales except for the month in which they were generated. She plots the sales on a chart:

Sue notices that sales by month appear to be on an upward trend. Sue thinks for a moment. “All I know is the sales and the month,” she says to herself, “How can I develop a model to forecast accurately?” Sue reads about a statistical procedure called regression analysis and, seeing that each month’s sales is in sequential order, she wonders whether the mere passage of time simply causes sales to go higher. Sue numbers each month, with January assigned a 1 and December, a 12.

She also realizes that she is trying to predict sales with each passing month. Hence, she hypothesizes that the change in sales depends on the change in the month. Hence, sales is Sue’s dependent variable. Because the month number is used to estimate change in sales, it is her independent variable. In regression analysis, the relationship between an independent and a dependent value is expressed:

Y = α + βX + ε

    Where: Y is the value of the dependent variable

    X is the value of the independent variable

    α is a population parameter, called the intercept, which would be the value of Y when X=0

    β is also a population parameter – the slope of the regression line – representing the change in Y associated with each one-unit change in X.

    ε is the error term.

Sue further reads that the goal of regression analysis is to minimize the error sum of squares, which is why it is referred to as ordinary least squares (OLS) regression. She also notices that she is building her regression on a sample, so there is a sample regression equation used to estimate what the true regression is for the population:

Essentially, the equation is the same as the one above, however the terms indicate the sample. The Y-term (called “Y hat”) is the sample forecasted value of the dependent variable (sales) at period i; a is the sample estimate of α; b is the sample estimate of β; Xi is the value of the independent variable at period i; and ei is the error, or difference between Y hat (the forecasted value) and actual Y for period i. Sue needs to find the values for a and b – the estimates of the population parameters – that minimize the error sum of squares.

Sue reads that the equations for estimating a and b are derived from calculus, but expressed algebraically as:

Sue learns that the X and Y terms with lines above them, known as “X bar” and “Y bar,” respectively are the averages of all the X and Y values, respectively. She also reads that the Σ notation – the Greek letter sigma – represents a sum. Hence, Sue realizes a few things:

  1. She must estimate b before she can estimate a;
  2. To estimate b,she must take care of the numerator:
    1. first subtract each observation’s month number from the average month’s number (X minus X-bar),
    2. subtract each observation’s sales from the average sales (Y minus Y-bar),
    3. multiply those two together, and
    4. Add up (2c) for all observations.
  3. To get the denominator for calculating b, she must:
    1. Again subtract X-bar from X, but then square the difference, for each observation.
    2. Sum them up
  4. Calculating b is easy: She needs only to divide the result from (2) by the result from (3).
  5. Calculating a is also easy: She multiplies her b value by the average month (X-bar), and subtracts it from average sales (Y-bar).

Sue now goes to work to compute her regression equation. She goes into Excel and enters her monthly sales data in a table, and computes the averages for sales and month number:

 

Month (X) 

Sales (Y) 

 

1 

$10,000 

 

2 

$11,000 

 

3 

$10,500 

 

4 

$11,500 

 

5 

$12,500 

 

6 

$12,000 

 

7 

$14,000 

 

8 

$13,000 

 

9 

$13,500 

 

10 

$15,000 

 

11 

$14,500 

 

12 

$15,500 

Average 

6.5 

$12,750 

 

Sue goes ahead and subtracts the X and Y values from their respective averages, and computes the components she needs (the “Product” is the result of multiplying the values in the first two columns together):

X minus X-bar 

Y minus Y-bar 

Product 

(X minus X-bar) Squared 

-5.5 

-$2,750 

$15,125 

30.25 

-4.5 

-$1,750 

$7,875 

20.25 

-3.5 

-$2,250 

$7,875 

12.25 

-2.5 

-$1,250 

$3,125 

6.25 

-1.5 

-$250 

$375 

2.25 

-0.5 

-$750 

$375 

0.25 

0.5 

$1,250 

$625 

0.25 

1.5 

$250 

$375 

2.25 

2.5 

$750 

$1,875 

6.25 

3.5 

$2,250 

$7,875 

12.25 

4.5 

$1,750 

$7,875

20.25 

5.5 

$2,750 

$15,125 

30.25 

Total 

$68,500 

143 

 

Sue computes b:

b = $68,500/143

= $479.02

Now that Sue knows b, she calculates a:

a = $12,750 – $479.02(6.5)

= $12,750 – $3,113.64

= $9,636.36

Hence, assuming errors are zero, Sue’s least-squares regression equation is:

Y(hat) =$9,636.36 + $479.02X

Or, in business terminology:

Forecasted Sales = $9,636.36 + $479.02 * Month number.

This means that each passing month is associated with an average increase in sales of $479.02 for Sue’s CPA firm. How accurately does this regression model predict sales? Sue estimates the error by plugging each month’s number into the equation and then comparing her forecast for that month with the actual sales:

Month (X) 

Sales (Y) 

Forecasted Sales 

Error 

1 

$10,000 

$10,115.38

-$115.38 

2 

$11,000 

$10,594.41 

$405.59 

3 

$10,500 

$11,073.43 

-$573.43 

4 

$11,500 

$11,552.45 

-$52.45 

5 

$12,500 

$12,031.47 

$468.53 

6 

$12,000 

$12,510.49 

-$510.49 

7 

$14,000 

$12,989.51 

$1,010.49 

8 

$13,000 

$13,468.53 

-$468.53 

9 

$13,500 

$13,947.55 

-$447.55

10 

$15,000 

$14,426.57 

$573.43 

11 

$14,500 

$14,905.59 

-$405.59 

12 

$15,500 

$15,384.62 

$115.38 

 

Sue’s actual and forecasted sales appear to be pretty close, except for her July estimate, which is off by a little over $1,000. But does her model predict better than if she simply used average sales as her forecast for each month? To do that, she must compute the error sum of squares, ESS, error. Sue must square the error terms for each observation and sum them up to obtain ESS:

ESS = Σe2

Error 

Squared Error 

-$115.38 

$13,313.61 

$405.59 

$164,506.82 

-$573.43 

$328,818.04 

-$52.45 

$2,750.75 

$468.53 

$219,521.74 

-$510.49 

$260,599.54 

$1,010.49 

$1,021,089.05 

-$468.53 

$219,521.74 

-$447.55 

$200,303.19 

$573.43 

$328,818.04 

-$405.59 

$164,506.82 

$115.38 

$13,313.61 

ESS=

$2,937,062.94 

 

Notice Sue’s error sum of squares. This is the error, or unexplained, sum of squared deviations between the forecasted and actual sales. The difference between the total sum of squares (TSS) and the Error Sum of Squares (ESS) is the regression sum of squares, RSS, and that is the sum of squared deviations that are explained by the regression. RSS is also calculated as each forecasted value of sales less the average of sales:

Forecasted Sales 

Average Sales

Regression Error 

Reg. Error Squared 

$10,115.38 

$12,750 

-$2,634.62 

$6,941,198.22 

$10,594.41 

$12,750 

-$2,155.59 

$4,646,587.24 

$11,073.43 

$12,750 

-$1,676.57 

$2,810,898.45 

$11,552.45 

$12,750 

-$1,197.55 

$1,434,131.86 

$12,031.47 

$12,750 

-$718.53

$516,287.47 

$12,510.49 

$12,750 

-$239.51 

$57,365.27 

$12,989.51 

$12,750 

$239.51 

$57,365.27 

$13,468.53 

$12,750 

$718.53 

$516,287.47 

$13,947.55 

$12,750 

$1,197.55 

$1,434,131.86 

$14,426.57 

$12,750 

$1,676.57 

$2,810,898.45 

$14,905.59 

$12,750 

$2,155.59 

$4,646,587.24

$15,384.62 

$12,750 

$2,634.62 

$6,941,198.22 

   

RSS= 

$32,812,937.06 

 

Sue immediately adds the RSS and the ESS and sees they match the TSS: $35,750,000. She also knows that nearly 33 million of that TSS is explained by her regression model, so she divides her RSS by the TSS:

32,812,937.06 / 35,750,000

=.917 or 91.7%

This quotient, known as the coefficient of determination, and denoted as R2, tells Sue that each passing month explains 91.7% of the change in monthly sales that she experiences. What R2 means is that Sue improved her forecast accuracy by 91.7% by using this simple model instead of the simple average. As you will find out in subsequent blog posts, maximizing R2 isn’t the “be all and end all”. In fact, there is still much to do with this model, which will be discussed in next week’s Forecast Friday post. But for now, Sue’s model seems to have reduced a great deal of error.

It is important to note that while each month does seem to be related to sales, the passing months do not cause the increase in sales. Correlation does not mean causation. There could be something behind the scenes (e.g., Sue’s advertising, or the types of projects she works on, etc.) that is driving the upward trend in her sales.

Using the Regression Equation to Forecast Sales

Now Sue can use the same model to forecast sales for January 2010 and February 2010, etc. She has her equation, so since January 2010 is period 13, she plugs in 13 for X, and gets a forecast of $15,863.64; for February (period 14), she gets $16,342.66.

Recap and Plan for Next Week

You have now learned the basics of simple regression analysis. You have learned how to estimate the parameters for the regression equation, how to measure the improvement in accuracy from the regression model, and how to generate forecasts. Next week, we will be checking the validity of Sue’s equation, and discussing the important assumptions underlying regression analysis. Until then, you have a basic overview of what regression analysis is.