## Posts Tagged ‘error term’

### Forecast Friday Topic: Detecting Autocorrelation

July 29, 2010

(Fifteenth in a series)

We have spent the last few Forecast Friday posts discussing violations of different assumptions in regression analysis. So far, we have discussed the effects of specification bias and multicollinearity on parameter estimates, and their corresponding effect on your forecasts. Today, we will discuss another violation, autocorrelation, which occurs when sequential residual (error) terms are correlated with one another.

When working with time series data, autocorrelation is the most common problem forecasters face. When the assumption of uncorrelated residuals is violated, we end up with models that have inefficient parameter estimates and upwardly-biased t-ratios and R2 values. These inflated values make our forecasting model appear better than it really is, and can cause our model to miss turning points. Hence, if you’re model is predicting an increase in sales and you, in actuality, see sales plunge, it may be due to autocorrelation.

What Does Autocorrelation Look Like?

Autocorrelation can take on two types: positive or negative. In positive autocorrelation, consecutive errors usually have the same sign: positive residuals are almost always followed by positive residuals, while negative residuals are almost always followed by negative residuals. In negative autocorrelation, consecutive errors typically have opposite signs: positive residuals are almost always followed by negative residuals and vice versa.

In addition, there are different orders of autocorrelation. The simplest, most common kind of autocorrelation, first-order autocorrelation, occurs when the consecutive errors are correlated. Second-order autocorrelation occurs when error terms two periods apart are correlated, and so forth. Here, we will concentrate solely on first-order autocorrelation.

You will see a visual depiction of positive autocorrelation later in this post.

What Causes Autocorrelation?

The two main culprits for autocorrelation are sluggishness in the business cycle (also known as inertia) and omitted variables from the model. At various turning points in a time series, inertia is very common. At the time when a time series turns upward (downward), its observations build (lose) momentum, and continue going up (down) until the series reaches its peak (trough). As a result, successive observations and the error terms associated with them depend on each other.

Another example of inertia happens when forecasting a time series where the same observations can be in multiple successive periods. For example, I once developed a model to forecast enrollment for a community college, and found autocorrelation to be present in my initial model. This happened because many of the students enrolled during the spring term were also enrolled in the previous fall term. As a result, I needed to correct for that.

The other main cause of autocorrelation is omitted variables from the model. When an important independent variable is omitted from a model, its effect on the dependent variable becomes part of the error term. Hence, if the omitted variable has a positive correlation with the dependent variable, it is likely to cause error terms that are positively correlated.

How Do We Detect Autocorrelation?

To illustrate how we go about detecting autocorrelation, let’s first start with a data set. I have pulled the average hourly wages of textile and apparel workers for the 18 months from January 1986 through June 1987. The original source was the Survey of Current Business, September issues from 1986 and 1987, but this data set was reprinted in Data Analysis Using Microsoft ® Excel, by Michael R. Middleton, page 219:

 Month t Wage Jan-86 1 5.82 Feb-86 2 5.79 Mar-86 3 5.8 Apr-86 4 5.81 May-86 5 5.78 Jun-86 6 5.79 Jul-86 7 5.79 Aug-86 8 5.83 Sep-86 9 5.91 Oct-86 10 5.87 Nov-86 11 5.87 Dec-86 12 5.9 Jan-87 13 5.94 Feb-87 14 5.93 Mar-87 15 5.93 Apr-87 16 5.94 May-87 17 5.89 Jun-87 18 5.91

Now, let’s run a simple regression model, using time period t as the independent variable and Wage as the dependent variable. Using the data set above, we derive the following model:

Ŷ = 5.7709 + 0.0095t

Examine the Model Output

Notice also the following model diagnostic statistics:

 R2= 0.728 Variable Coefficient t-ratio Intercept 5.7709 367.62 t 0.0095 6.55

You can see that the R2 is a high number, with changes in t explaining nearly three-quarters the variation in average hourly wage. Note also the t-ratios for both the intercept and the parameter estimate for t. Both are very high. Recall that a high R2 and high t-ratios are symptoms of autocorrelation.

Visually Inspect Residuals

Just because a model has a high R2 and parameters with high t-ratios doesn’t mean autocorrelation is present. More work must be done to detect autocorrelation. Another way to check for autocorrelation is to visually inspect the residuals. The best way to do this is through plotting the average hourly wage predicted by the model against the actual average hourly wage, as Middleton has done:

Notice the green line representing the Predicted Wage. It is a straight, upward line. This is to be expected, since the independent variable is sequential and shows an increasing trend. The red line depicts the actual wage in the time series. Notice that the model’s forecast is higher than actual for months 5 through 8, and for months 17 and 18. The model also underpredicts for months 12 through 16. This clearly illustrates the presence of positive, first-order autocorrelation.

The Durbin-Watson Statistic

Examining the model components and visually inspecting the residuals are intuitive, but not definitive ways to diagnose autocorrelation. To really be sure if autocorrelation exists, we must compute the Durbin-Watson statistic, often denoted as d.

In our June 24 Forecast Friday post, we demonstrated how to calculate the Durbin-Watson statistic. The actual formula is:

That is, beginning with the error term for the second observation, we subtract the immediate previous error term from it; then we square the difference. We do this for each observation from the second one onward. Then we sum all of those squared differences together. Next, we square the error terms for each observation, and sum those together. Then we divide the sum of squared differences by the sum of squared error terms, to get our Durbin-Watson statistic.

For our example, we have the following:

 t Error Squared Error et-et-1 Squared Difference 1 0.0396 0.0016 2 0.0001 0.0000 (0.0395) 0.0016 3 0.0006 0.0000 0.0005 0.0000 4 0.0011 0.0000 0.0005 0.0000 5 (0.0384) 0.0015 (0.0395) 0.0016 6 (0.0379) 0.0014 0.0005 0.0000 7 (0.0474) 0.0022 (0.0095) 0.0001 8 (0.0169) 0.0003 0.0305 0.0009 9 0.0536 0.0029 0.0705 0.0050 10 0.0041 0.0000 (0.0495) 0.0024 11 (0.0054) 0.0000 (0.0095) 0.0001 12 0.0152 0.0002 0.0205 0.0004 13 0.0457 0.0021 0.0305 0.0009 14 0.0262 0.0007 (0.0195) 0.0004 15 0.0167 0.0003 (0.0095) 0.0001 16 0.0172 0.0003 0.0005 0.0000 17 (0.0423) 0.0018 (0.0595) 0.0035 18 (0.0318) 0.0010 0.0105 0.0001 Sum: 0.0163 0.0171

To obtain our Durbin-Watson statistic, we plug our sums into the formula:

= 1.050

What Does the Durbin-Watson Statistic Tell Us?

Our Durbin-Watson statistic is 1.050. What does that mean? The Durbin-Watson statistic is interpreted as follows:

• If d is close to zero (0), then positive autocorrelation is probably present;
• If d is close to two (2), then the model is likely free of autocorrelation; and
• If d is close to four (4), then negative autocorrelation is probably present.

As we saw from our visual examination of the residuals, we appear to have positive autocorrelation, and the fact that our Durbin-Watson statistic is about halfway between zero and two suggests the presence of positive autocorrelation.

Next Forecast Friday Topic: Correcting Autocorrelation

Today we went through the process of understanding the causes and effect of autocorrelation, and how to suspect and detect its presence. Next week, we will discuss how to correct for autocorrelation and eliminate it so that we can have more efficient parameter estimates.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

### Forecast Friday Topic: Multicollinearity – Correcting and Accepting it

July 22, 2010

(Fourteenth in a series)

In last week’s Forecast Friday post, we discussed how to detect multicollinearity in a regression model and how dropping a suspect variable or variables from the model can be one approach to reducing or eliminating multicollinearity. However, removing variables can cause other problems – particularly specification bias – if the suspect variable is indeed an important predictor. Today we will discuss two additional approaches to correcting multicollinearity – obtaining more data and transforming variables – and will discuss when it’s best to just accept the multicollinearity.

Obtaining More Data

Multicollinearity is really an issue with the sample, not the population. Sometimes, sampling produces a data set that might be too homogeneous. One way to remedy this would be to add more observations to the data set. Enlarging the sample will introduce more variation in the data series, which reduces the effect of sampling error and helps increase precision when estimating various properties of the data. Increased sample sizes can reduce either the presence or the impact of multicollinearity, or both. Obtaining more data is often the best way to remedy multicollinearity.

Obtaining more data does have problems, however. Sometimes, additional data just isn’t available. This is especially the case with time series data, which can be limited or otherwise finite. If you need to obtain that additional information through great effort, it can be costly and time consuming. Also, the additional data you add to your sample could be quite similar to your original data set, so there would be no benefit to enlarging your data set. The new data could even make problems worse!

Transforming Variables

Another way statisticians and modelers go about eliminating multicollinearity is through data transformation. This can be done in a number of ways.

Combine Some Variables

The most obvious way would be to find a way to combine some of the variables. After all, multicollinearity suggests that two or more independent variables are strongly correlated. Perhaps you can multiply two variables together and use the product of those two variables in place of them.

So, in our example of the donor history, we had the two variables “Average Contribution in Last 12 Months” and “Times Donated in Last 12 Months.” We can multiply them to create a composite variable, “Total Contributions in Last 12 Months,” and then use that new variable, along with the variable “Months Since Last Donation” to perform the regression. In fact, if we did that with our model, we end up with a model (not shown here) that has an R2=0.895, and this time the coefficient for “Months Since Last Donation” is significant, as is our “Total Contribution” variable. Our F statistic is a little over 72. Essentially, the R2 and F statistics are only slightly lower than in our original model, suggesting that the transformation was useful. However, looking at the correlation matrix, we still see a strong negative correlation between our two independent variables, suggesting that we still haven’t eliminated multicollinearity.

Centered Interaction Terms

Sometimes we can reduce multicollinearity by creating an interaction term between variables in question. In a model trying to predict performance on a test based on hours spent studying and hours of sleep, you might find that hours spent studying appears to be related with hours of sleep. So, you create a third independent variable, Sleep_Study_Interaction. You do this by computing the average value for both the hours of sleep and hours of studying variables. For each observation, you subtract each independent variable’s mean from its respective value for that observation. Once you’ve done that for each observation, multiply their differences together. This is your interaction term, Sleep_Study_Interaction. Run the regression now with the original two variables and the interaction term. When you subtract the means from the variables in question, you are in effect centering interaction term, which means you’re taking into account central tendency in your data.

Differencing Data

If you’re working with time series data, one way to reduce multicollinearity is to run your regression using differences. To do this, you take every variable – dependent and independent – and, beginning with the second observation – subtract the immediate prior observation’s values for those variables from the current observation. Now, instead of working with original data, you are working with the change in data from one period to the next. Differencing eliminates multicollinearity by removing the trend component of the time series. If all independent variables had followed more or less the same trend, they could end up highly correlated. Sometimes, however, trends can build on themselves for several periods, so multiple differencing may be required. In this case, subtracting the period before was taking a “first difference.” If we subtracted two periods before, it’s a “second difference,” and so on. Note also that with differencing, we lose the first observations in the data, depending on how many periods we have to difference, so if you have a small data set, differencing can reduce your degrees of freedom and increase your risk of making a Type I Error: concluding that an independent variable is not statistically significant when, in truth it is.

Other Transformations

Sometimes, it makes sense to take a look at a scatter plot of each independent variable’s values with that of the dependent variable to see if the relationship is fairly linear. If it is not, that’s a cue to transform an independent variable. If an independent variable appears to have a logarithmic relationship, you might substitute its natural log. Also, depending on the relationship, you can use other transformations: square root, square, negative reciprocal, etc.

Another consideration: if you’re predicting the impact of violent crime on a city’s median family income, instead of using the number of violent crimes committed in the city, you might instead divide it by the city’s population and come up with a per-capita figure. That will give more useful insights into the incidence of crime in the city.

Transforming data in these ways helps reduce multicollinearity by representing independent variables differently, so that they are less correlated with other independent variables.

Limits of Data Transformation

Transforming data has its own pitfalls. First, transforming data also transforms the model. A model that uses a per-capita crime figure for an independent variable has a very different interpretation than one using an aggregate crime figure. Also, interpretations of models and their results get more complicated as data is transformed. Ideally, models are supposed to be parsimonious – that is, they explain a great deal about the relationship as simply as possible. Typically, parsimony means as few independent variables as possible, but it also means as few transformations as possible. You also need to do more work. If you try to plug in new data to your resulting model for forecasting, you must remember to take the values for your data and transform them accordingly.

Living With Multicollinearity

Multicollinearity is par for the course when a model consists of two or more independent variables, so often the question isn’t whether multicollinearity exists, but rather how severe it is. Multicollinearity doesn’t bias your parameter estimates, but it inflates their variance, making them inefficient or untrustworthy. As you have seen from the remedies offered in this post, the cures can be worse than the disease. Correcting multicollinearity can also be an iterative process; the benefit of reducing multicollinearity may not justify the time and resources required to do so. Sometimes, any effort to reduce multicollinearity is futile. Generally, for the purposes of forecasting, it might be perfectly OK to disregard the multicollinearity. If, however, you’re using regression analysis to explain relationships, then you must try to reduce the multicollinearity.

A good approach is to run a couple of different models, some using variations of the remedies we’ve discussed here, and comparing their degree of multicollinearity with that of the original model. It is also important to compare the forecast accuracy of each. After all, if all you’re trying to do is forecast, then a model with slightly less multicollinearity but a higher degree of forecast error is probably not preferable to a more precise forecasting model with higher degrees of multicollinearity.

The Takeaways:

1. Where you have multiple regression, you almost always have multicollinearity, especially in time series data.
2. A correlation matrix is a good way to detect multicollinearity. Multicollinearity can be very serious if the correlation matrix shows that some of the independent variables are more highly correlated with each other than they are with the dependent variable.
3. You should suspect multicollinearity if:
1. You have a high R2 but low t-statistics;
2. The sign for a coefficient is opposite of what is normally expected (a relationship that should be positive is negative, and vice-versa).
4. Multicollinearity doesn’t bias parameter estimates, but makes them untrustworthy by enlarging their variance.
5. There are several ways of remedying multicollinearity, with obtaining more data often being the best approach. Each remedy for multicollinearity contributes a new set of problems and limitations, so you must weigh the benefit of reduced multicollinearity on time and resources needed to do so, and the resulting impact on your forecast accuracy.

Next Forecast Friday Topic: Autocorrelation

These past two weeks, we discussed the problem of multicollinearity. Next week, we will discuss the problem of autocorrelation – the phenomenon that occurs when we violate the assumption that the error terms are not correlated with each other. We will discuss how to detect autocorrelation, discuss in greater depth the Durbin-Watson statistic’s use as a measure of the presence of autocorrelation, and how to correct for autocorrelation.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

### Forecast Friday Topic: Simple Regression Analysis

May 27, 2010

(Sixth in a series)

Today, we begin our discussion of regression analysis as a time series forecasting tool. This discussion will take the next few weeks, as there is much behind it. As always, I will make sure everything is simplified and easy for you to digest. Regression is a powerful tool that can be very helpful for mid- and long-range forecasting. Quite often, the business decisions we make require us to consider relationships between two or more variables. Rarely can we make changes to our promotion, pricing, and/or product development strategies without them having an impact of some kind on our sales. Just how big an impact would that be? How do we measure the relationship between two or more variables? And does a real relationship even exist between those variables? Regression analysis helps us find out.

One thing I must point out: Remember the “deviations” we discussed in the posts on moving average and exponential smoothing techniques: The difference between the forecasted and actual values for each observation, of which we took the absolute value? Good. In regression analysis, we refer to the deviations as the “error terms” or “residuals.” In regression analysis, the residuals – which we will square, rather than take the absolute value – become very important in gauging the regression model’s accuracy, validity, efficiency, and “goodness of fit.”

Simple Linear Regression Analysis

Sue Stone, owner of Stone & Associates, looked at her CPA practice’s monthly receipts from January to December 2009. The sales were as follows:

 Month Sales January \$10,000 February \$11,000 March \$10,500 April \$11,500 May \$12,500 June \$12,000 July \$14,000 August \$13,000 September \$13,500 October \$15,000 November \$14,500 December \$15,500

Sue is trying to predict what sales will be for each month in the first quarter of 2010, but is unsure of how to go about it. Moving average and exponential smoothing techniques rarely go more than one period ahead. So, what is Sue to do?

When we are presented with a set of numbers, one of the ways we try to make sense of it is by taking its average. Perhaps Sue can average all 12 months’ sales – \$12,750 – and use that her forecast for each of next three months. But how accurately would that measure each month of 2009? How spread out are each month’s sales from the average? Sue subtracts the average from each month’s sales and examines the difference:

 Month Sales Sales Less Average Sales January \$10,000 -\$2,750 February \$11,000 -\$1,750 March \$10,500 -\$2,250 April \$11,500 -\$1,250 May \$12,500 -\$250 June \$12,000 -\$750 July \$14,000 \$1,250 August \$13,000 \$250 September \$13,500 \$750 October \$15,000 \$2,250 November \$14,500 \$1,750 December \$15,500 \$2,750

Sue notices that the error between actual and average is quite high in both the first four months of 2009 and in the last three months of 2009. She wants to understand the overall error in using the average as a forecast of sales. However, when she sums up all the errors from month to month, Sue finds they sum to zero. That tells her nothing. So she squares each month’s error value and sums them:

 Month Sales Error Error Squared January \$10,000 -\$2,750 \$7,562,500 February \$11,000 -\$1,750 \$3,062,500 March \$10,500 -\$2,250 \$5,062,500 April \$11,500 -\$1,250 \$1,562,500 May \$12,500 -\$250 \$62,500 June \$12,000 -\$750 \$562,500 July \$14,000 \$1,250 \$1,562,500 August \$13,000 \$250 \$62,500 September \$13,500 \$750 \$562,500 October \$15,000 \$2,250 \$5,062,500 November \$14,500 \$1,750 \$3,062,500 December \$15,500 \$2,750 \$7,562,500 Total Error: \$35,750,000

In totaling these squared errors, Sue derives the total sum of squares, or TSS error: 35,750,000. Is there any way she can improve upon that? Sue thinks for a while. She doesn’t know too much more about her 2009 sales except for the month in which they were generated. She plots the sales on a chart:

Sue notices that sales by month appear to be on an upward trend. Sue thinks for a moment. “All I know is the sales and the month,” she says to herself, “How can I develop a model to forecast accurately?” Sue reads about a statistical procedure called regression analysis and, seeing that each month’s sales is in sequential order, she wonders whether the mere passage of time simply causes sales to go higher. Sue numbers each month, with January assigned a 1 and December, a 12.

She also realizes that she is trying to predict sales with each passing month. Hence, she hypothesizes that the change in sales depends on the change in the month. Hence, sales is Sue’s dependent variable. Because the month number is used to estimate change in sales, it is her independent variable. In regression analysis, the relationship between an independent and a dependent value is expressed:

Y = α + βX + ε

Where: Y is the value of the dependent variable

X is the value of the independent variable

α is a population parameter, called the intercept, which would be the value of Y when X=0

β is also a population parameter – the slope of the regression line – representing the change in Y associated with each one-unit change in X.

ε is the error term.

Sue further reads that the goal of regression analysis is to minimize the error sum of squares, which is why it is referred to as ordinary least squares (OLS) regression. She also notices that she is building her regression on a sample, so there is a sample regression equation used to estimate what the true regression is for the population:

Essentially, the equation is the same as the one above, however the terms indicate the sample. The Y-term (called “Y hat”) is the sample forecasted value of the dependent variable (sales) at period i; a is the sample estimate of α; b is the sample estimate of β; Xi is the value of the independent variable at period i; and ei is the error, or difference between Y hat (the forecasted value) and actual Y for period i. Sue needs to find the values for a and b – the estimates of the population parameters – that minimize the error sum of squares.

Sue reads that the equations for estimating a and b are derived from calculus, but expressed algebraically as:

Sue learns that the X and Y terms with lines above them, known as “X bar” and “Y bar,” respectively are the averages of all the X and Y values, respectively. She also reads that the Σ notation – the Greek letter sigma – represents a sum. Hence, Sue realizes a few things:

1. She must estimate b before she can estimate a;
2. To estimate b,she must take care of the numerator:
1. first subtract each observation’s month number from the average month’s number (X minus X-bar),
2. subtract each observation’s sales from the average sales (Y minus Y-bar),
3. multiply those two together, and
4. Add up (2c) for all observations.
3. To get the denominator for calculating b, she must:
1. Again subtract X-bar from X, but then square the difference, for each observation.
2. Sum them up
4. Calculating b is easy: She needs only to divide the result from (2) by the result from (3).
5. Calculating a is also easy: She multiplies her b value by the average month (X-bar), and subtracts it from average sales (Y-bar).

Sue now goes to work to compute her regression equation. She goes into Excel and enters her monthly sales data in a table, and computes the averages for sales and month number:

 Month (X) Sales (Y) 1 \$10,000 2 \$11,000 3 \$10,500 4 \$11,500 5 \$12,500 6 \$12,000 7 \$14,000 8 \$13,000 9 \$13,500 10 \$15,000 11 \$14,500 12 \$15,500 Average 6.5 \$12,750

Sue goes ahead and subtracts the X and Y values from their respective averages, and computes the components she needs (the “Product” is the result of multiplying the values in the first two columns together):

 X minus X-bar Y minus Y-bar Product (X minus X-bar) Squared -5.5 -\$2,750 \$15,125 30.25 -4.5 -\$1,750 \$7,875 20.25 -3.5 -\$2,250 \$7,875 12.25 -2.5 -\$1,250 \$3,125 6.25 -1.5 -\$250 \$375 2.25 -0.5 -\$750 \$375 0.25 0.5 \$1,250 \$625 0.25 1.5 \$250 \$375 2.25 2.5 \$750 \$1,875 6.25 3.5 \$2,250 \$7,875 12.25 4.5 \$1,750 \$7,875 20.25 5.5 \$2,750 \$15,125 30.25 Total \$68,500 143

Sue computes b:

b = \$68,500/143

= \$479.02

Now that Sue knows b, she calculates a:

a = \$12,750 – \$479.02(6.5)

= \$12,750 – \$3,113.64

= \$9,636.36

Hence, assuming errors are zero, Sue’s least-squares regression equation is:

Y(hat) =\$9,636.36 + \$479.02X

Or, in business terminology:

Forecasted Sales = \$9,636.36 + \$479.02 * Month number.

This means that each passing month is associated with an average increase in sales of \$479.02 for Sue’s CPA firm. How accurately does this regression model predict sales? Sue estimates the error by plugging each month’s number into the equation and then comparing her forecast for that month with the actual sales:

 Month (X) Sales (Y) Forecasted Sales Error 1 \$10,000 \$10,115.38 -\$115.38 2 \$11,000 \$10,594.41 \$405.59 3 \$10,500 \$11,073.43 -\$573.43 4 \$11,500 \$11,552.45 -\$52.45 5 \$12,500 \$12,031.47 \$468.53 6 \$12,000 \$12,510.49 -\$510.49 7 \$14,000 \$12,989.51 \$1,010.49 8 \$13,000 \$13,468.53 -\$468.53 9 \$13,500 \$13,947.55 -\$447.55 10 \$15,000 \$14,426.57 \$573.43 11 \$14,500 \$14,905.59 -\$405.59 12 \$15,500 \$15,384.62 \$115.38

Sue’s actual and forecasted sales appear to be pretty close, except for her July estimate, which is off by a little over \$1,000. But does her model predict better than if she simply used average sales as her forecast for each month? To do that, she must compute the error sum of squares, ESS, error. Sue must square the error terms for each observation and sum them up to obtain ESS:

ESS = Σe2

 Error Squared Error -\$115.38 \$13,313.61 \$405.59 \$164,506.82 -\$573.43 \$328,818.04 -\$52.45 \$2,750.75 \$468.53 \$219,521.74 -\$510.49 \$260,599.54 \$1,010.49 \$1,021,089.05 -\$468.53 \$219,521.74 -\$447.55 \$200,303.19 \$573.43 \$328,818.04 -\$405.59 \$164,506.82 \$115.38 \$13,313.61 ESS= \$2,937,062.94

Notice Sue’s error sum of squares. This is the error, or unexplained, sum of squared deviations between the forecasted and actual sales. The difference between the total sum of squares (TSS) and the Error Sum of Squares (ESS) is the regression sum of squares, RSS, and that is the sum of squared deviations that are explained by the regression. RSS is also calculated as each forecasted value of sales less the average of sales:

 Forecasted Sales Average Sales Regression Error Reg. Error Squared \$10,115.38 \$12,750 -\$2,634.62 \$6,941,198.22 \$10,594.41 \$12,750 -\$2,155.59 \$4,646,587.24 \$11,073.43 \$12,750 -\$1,676.57 \$2,810,898.45 \$11,552.45 \$12,750 -\$1,197.55 \$1,434,131.86 \$12,031.47 \$12,750 -\$718.53 \$516,287.47 \$12,510.49 \$12,750 -\$239.51 \$57,365.27 \$12,989.51 \$12,750 \$239.51 \$57,365.27 \$13,468.53 \$12,750 \$718.53 \$516,287.47 \$13,947.55 \$12,750 \$1,197.55 \$1,434,131.86 \$14,426.57 \$12,750 \$1,676.57 \$2,810,898.45 \$14,905.59 \$12,750 \$2,155.59 \$4,646,587.24 \$15,384.62 \$12,750 \$2,634.62 \$6,941,198.22 RSS= \$32,812,937.06

Sue immediately adds the RSS and the ESS and sees they match the TSS: \$35,750,000. She also knows that nearly 33 million of that TSS is explained by her regression model, so she divides her RSS by the TSS:

32,812,937.06 / 35,750,000

=.917 or 91.7%

This quotient, known as the coefficient of determination, and denoted as R2, tells Sue that each passing month explains 91.7% of the change in monthly sales that she experiences. What R2 means is that Sue improved her forecast accuracy by 91.7% by using this simple model instead of the simple average. As you will find out in subsequent blog posts, maximizing R2 isn’t the “be all and end all”. In fact, there is still much to do with this model, which will be discussed in next week’s Forecast Friday post. But for now, Sue’s model seems to have reduced a great deal of error.

It is important to note that while each month does seem to be related to sales, the passing months do not cause the increase in sales. Correlation does not mean causation. There could be something behind the scenes (e.g., Sue’s advertising, or the types of projects she works on, etc.) that is driving the upward trend in her sales.

Using the Regression Equation to Forecast Sales

Now Sue can use the same model to forecast sales for January 2010 and February 2010, etc. She has her equation, so since January 2010 is period 13, she plugs in 13 for X, and gets a forecast of \$15,863.64; for February (period 14), she gets \$16,342.66.

Recap and Plan for Next Week

You have now learned the basics of simple regression analysis. You have learned how to estimate the parameters for the regression equation, how to measure the improvement in accuracy from the regression model, and how to generate forecasts. Next week, we will be checking the validity of Sue’s equation, and discussing the important assumptions underlying regression analysis. Until then, you have a basic overview of what regression analysis is.