Posts Tagged ‘t-value’

Forecast Friday Topic: Multicollinearity – Correcting and Accepting it

July 22, 2010

(Fourteenth in a series)

In last week’s Forecast Friday post, we discussed how to detect multicollinearity in a regression model and how dropping a suspect variable or variables from the model can be one approach to reducing or eliminating multicollinearity. However, removing variables can cause other problems – particularly specification bias – if the suspect variable is indeed an important predictor. Today we will discuss two additional approaches to correcting multicollinearity – obtaining more data and transforming variables – and will discuss when it’s best to just accept the multicollinearity.

Obtaining More Data

Multicollinearity is really an issue with the sample, not the population. Sometimes, sampling produces a data set that might be too homogeneous. One way to remedy this would be to add more observations to the data set. Enlarging the sample will introduce more variation in the data series, which reduces the effect of sampling error and helps increase precision when estimating various properties of the data. Increased sample sizes can reduce either the presence or the impact of multicollinearity, or both. Obtaining more data is often the best way to remedy multicollinearity.

Obtaining more data does have problems, however. Sometimes, additional data just isn’t available. This is especially the case with time series data, which can be limited or otherwise finite. If you need to obtain that additional information through great effort, it can be costly and time consuming. Also, the additional data you add to your sample could be quite similar to your original data set, so there would be no benefit to enlarging your data set. The new data could even make problems worse!

Transforming Variables

Another way statisticians and modelers go about eliminating multicollinearity is through data transformation. This can be done in a number of ways.

Combine Some Variables

The most obvious way would be to find a way to combine some of the variables. After all, multicollinearity suggests that two or more independent variables are strongly correlated. Perhaps you can multiply two variables together and use the product of those two variables in place of them.

So, in our example of the donor history, we had the two variables “Average Contribution in Last 12 Months” and “Times Donated in Last 12 Months.” We can multiply them to create a composite variable, “Total Contributions in Last 12 Months,” and then use that new variable, along with the variable “Months Since Last Donation” to perform the regression. In fact, if we did that with our model, we end up with a model (not shown here) that has an R2=0.895, and this time the coefficient for “Months Since Last Donation” is significant, as is our “Total Contribution” variable. Our F statistic is a little over 72. Essentially, the R2 and F statistics are only slightly lower than in our original model, suggesting that the transformation was useful. However, looking at the correlation matrix, we still see a strong negative correlation between our two independent variables, suggesting that we still haven’t eliminated multicollinearity.

Centered Interaction Terms

Sometimes we can reduce multicollinearity by creating an interaction term between variables in question. In a model trying to predict performance on a test based on hours spent studying and hours of sleep, you might find that hours spent studying appears to be related with hours of sleep. So, you create a third independent variable, Sleep_Study_Interaction. You do this by computing the average value for both the hours of sleep and hours of studying variables. For each observation, you subtract each independent variable’s mean from its respective value for that observation. Once you’ve done that for each observation, multiply their differences together. This is your interaction term, Sleep_Study_Interaction. Run the regression now with the original two variables and the interaction term. When you subtract the means from the variables in question, you are in effect centering interaction term, which means you’re taking into account central tendency in your data.

Differencing Data

If you’re working with time series data, one way to reduce multicollinearity is to run your regression using differences. To do this, you take every variable – dependent and independent – and, beginning with the second observation – subtract the immediate prior observation’s values for those variables from the current observation. Now, instead of working with original data, you are working with the change in data from one period to the next. Differencing eliminates multicollinearity by removing the trend component of the time series. If all independent variables had followed more or less the same trend, they could end up highly correlated. Sometimes, however, trends can build on themselves for several periods, so multiple differencing may be required. In this case, subtracting the period before was taking a “first difference.” If we subtracted two periods before, it’s a “second difference,” and so on. Note also that with differencing, we lose the first observations in the data, depending on how many periods we have to difference, so if you have a small data set, differencing can reduce your degrees of freedom and increase your risk of making a Type I Error: concluding that an independent variable is not statistically significant when, in truth it is.

Other Transformations

Sometimes, it makes sense to take a look at a scatter plot of each independent variable’s values with that of the dependent variable to see if the relationship is fairly linear. If it is not, that’s a cue to transform an independent variable. If an independent variable appears to have a logarithmic relationship, you might substitute its natural log. Also, depending on the relationship, you can use other transformations: square root, square, negative reciprocal, etc.

Another consideration: if you’re predicting the impact of violent crime on a city’s median family income, instead of using the number of violent crimes committed in the city, you might instead divide it by the city’s population and come up with a per-capita figure. That will give more useful insights into the incidence of crime in the city.

Transforming data in these ways helps reduce multicollinearity by representing independent variables differently, so that they are less correlated with other independent variables.

Limits of Data Transformation

Transforming data has its own pitfalls. First, transforming data also transforms the model. A model that uses a per-capita crime figure for an independent variable has a very different interpretation than one using an aggregate crime figure. Also, interpretations of models and their results get more complicated as data is transformed. Ideally, models are supposed to be parsimonious – that is, they explain a great deal about the relationship as simply as possible. Typically, parsimony means as few independent variables as possible, but it also means as few transformations as possible. You also need to do more work. If you try to plug in new data to your resulting model for forecasting, you must remember to take the values for your data and transform them accordingly.

Living With Multicollinearity

Multicollinearity is par for the course when a model consists of two or more independent variables, so often the question isn’t whether multicollinearity exists, but rather how severe it is. Multicollinearity doesn’t bias your parameter estimates, but it inflates their variance, making them inefficient or untrustworthy. As you have seen from the remedies offered in this post, the cures can be worse than the disease. Correcting multicollinearity can also be an iterative process; the benefit of reducing multicollinearity may not justify the time and resources required to do so. Sometimes, any effort to reduce multicollinearity is futile. Generally, for the purposes of forecasting, it might be perfectly OK to disregard the multicollinearity. If, however, you’re using regression analysis to explain relationships, then you must try to reduce the multicollinearity.

A good approach is to run a couple of different models, some using variations of the remedies we’ve discussed here, and comparing their degree of multicollinearity with that of the original model. It is also important to compare the forecast accuracy of each. After all, if all you’re trying to do is forecast, then a model with slightly less multicollinearity but a higher degree of forecast error is probably not preferable to a more precise forecasting model with higher degrees of multicollinearity.

The Takeaways:

  1. Where you have multiple regression, you almost always have multicollinearity, especially in time series data.
  2. A correlation matrix is a good way to detect multicollinearity. Multicollinearity can be very serious if the correlation matrix shows that some of the independent variables are more highly correlated with each other than they are with the dependent variable.
  3. You should suspect multicollinearity if:
    1. You have a high R2 but low t-statistics;
    2. The sign for a coefficient is opposite of what is normally expected (a relationship that should be positive is negative, and vice-versa).
  4. Multicollinearity doesn’t bias parameter estimates, but makes them untrustworthy by enlarging their variance.
  5. There are several ways of remedying multicollinearity, with obtaining more data often being the best approach. Each remedy for multicollinearity contributes a new set of problems and limitations, so you must weigh the benefit of reduced multicollinearity on time and resources needed to do so, and the resulting impact on your forecast accuracy.

Next Forecast Friday Topic: Autocorrelation

These past two weeks, we discussed the problem of multicollinearity. Next week, we will discuss the problem of autocorrelation – the phenomenon that occurs when we violate the assumption that the error terms are not correlated with each other. We will discuss how to detect autocorrelation, discuss in greater depth the Durbin-Watson statistic’s use as a measure of the presence of autocorrelation, and how to correct for autocorrelation.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Forecast Friday Topic: Building Regression Models With Excel

July 8, 2010
 (Twelfth in a series)
We’ve spent the last six weeks discussing regression analysis as a forecasting method. As you have seen, simple regression is a bit tedious to work out by hand, but for multiple regression analysis, you almost always need the aid of a computerized software package. Today I will demonstrate for you how to use the Regression Analysis feature of Microsoft Excel’s1 Analysis ToolPak Add-In. Excel 2007 comes with the Analysis ToolPak Add-In, which you can choose to activate. One way to know if Analysis ToolPak is activated on your version of Excel is to click on the Data tab on your workspace and see if there is a “Data Analysis” icon. The following thumbnail will illustrate:

 

Seeing if the Data Analysis ToolPak Add-In is activated.

Checking for the Data Analysis Add-In

Notice towards the upper right corner of the image, “Data Analysis” is highlighted in orange. The presence of the Data Analysis icon means that we have activated the Analysis ToolPak Add-In. If it wasn’t there, you would need to activate the Add-In, which you could do very easily by clicking the “Office” button in the top left hand corner, then clicking the “Excel Options” button, which will take you through the process of activating your add-in.

Setting up a Regression – Our Data Set

Generally in direct mail marketing, three components that often determine how much one spends – or whether he/she buys at all – are recency, frequency, and monetary value – known in short as RFM. Generally, the longer it has been since one’s last purchase (recency), the less he/she is likely to spend. Hence, we would expect a minus sign by the coefficient for recency. Also, RFM theorizes that the more frequently one buys, the greater his/her purchase. So, we would expect a plus sign by the coefficient for frequency. Finally, the higher the customer’s average purchases (monetary value), the greater his/her spending, so we would also expect a plus sign here. RFM is also used heavily by nonprofits in their capital and contributor campaigns, since they are often heavily reliant upon direct mail.

In our example here, a local nonprofit decided to test whether each RFM component had a relationship to a donor’s contribution, so it randomly selected 20 donors who contributed to its last appeal. Naturally, our dependent variable, Y, was the Contribution amount. The nonprofit also looked at three independent variables: months since last contribution (X1), times donated in last 12 months (X2), and average contribution over the last 12 months (X3). These independent variables represent recency, frequency, and monetary value, respectively. The table below shows our dataset:

Giving Patterns of 20 Donors

Donor

Contribution

Months Since Last Donation

Times Donated in last 12 months

Average Contribution in last 12 months

1

95

10

1

85

2

110

8

2

95

3

100

10

2

90

4

115

8

3

75

5

100

9

1

95

6

120

6

2

100

7

105

9

1

90

8

125

10

1

125

9

105

9

2

100

10

130

4

3

150

11

135

7

4

125

12

150

2

8

150

13

140

4

3

125

14

155

2

9

140

15

140

2

8

130

16

160

2

10

150

17

145

3

6

135

18

165

1

12

150

19

150

3

4

160

20

170

1

12

140

The thumbnail below shows what the data set looks like in Excel:

Data Set in Excel

Regression Data Set in Excel

Running the Regression

To run the regression, we need to select the regression tool from the Analysis ToolPak. We do this by clicking on the Data Analysis Tab. The  next thumbnail shows us what we need to do:

Regression Option

Selecting the Regression option from the Data Analysis ToolPak

After selecting the regression tool, we need to select our independent variables and our dependent variables. It is best to make sure all columns containing your independent variables are adjacent to each other, as they are in columns D, E, and F. Notice that column C from rows 2 to 22 contains our Y-range values (including the column label). In columns D, E, and F, rows 2 through 22 contain their respective X-range values. Notice in the thumbnail how we indicate those column/row positions for Y-range and X-range values.

Regression Options

Regression Options

We also need to decide where to place the regression output and what data we want the output to contain. In the thumbnail below, we choose to have the output placed in a new worksheet, called “Regression Output”, and we also check the box indicating that we want the residuals printed. Also notice that we checked the box “Labels”, so that row 2 won’t be inadvertently added into the model.

Regression Options - Continued

Regression Options - Continued

Looking at the Output

Now we run the regression and get the following output:

Regression Output

Regression Output (residuals not shown)

As you can see, cell B5 contains our R2, equal to .933, indicating that 93.3% of the variation in a donor’s contribution amount is explained by changes in recency, frequency, and monetary value. Also, notice the F-statistic in cell E12. It’s a large, strong 73.90, and cell F12 to the right is 0.00, suggesting the model is significant. (Note, the Significance F in cell F12 and the P-Values in cells E17-E20 for each parameter estimate are quick cues to significance. If you’re using a 95% confidence interval – which we are here – then you want those values to be no higher than 0.05).

Now let’s look at each parameter estimate. Cells B17-B20 contain our regression coefficients. We have the following equation:

Contribution Estimate = 87.27 – 1.80 *Months_Since_Last_Donation + 2.45 *Times_Donated_Last_12_Months + 0.35 *Average_Contribution_Last_12_Months

Simplifying, we have:

Contribution Estimate = 87.27 – 1.80*RECENCY + 2.45*FREQUENCY + 0.35*MONETARY_VALUE

Ŷ = 87.27 – 1.80X1 + 2.45X2 + 0.35X3

Note that even though we opted to display the residuals for each observation, I chose not to show them here.  It would have run below the fold, and would have been difficult to see.  Besides, for our analysis, we’re not going to worry about residuals right now.

Interpreting the Output

As we can see, each month since a donor’s last contribution reduces his contribution by an average of $1.80, when we hold frequency and monetary value constant. Likewise, for each time a donor has given in the last 12 months, the size of his contribution increases by an average of $2.45, holding the other two variables constant. In addition, each one-dollar increase in a donor’s average contribution increases his contribution by an average of 35 cents. Hence, all of our coefficients have the signs we expect.

T-Statistics and P-Values

Next, we need to look at the t-statistics and P-values. As mentioned above, for a 95% confidence interval, a parameter estimate must have a p-value no greater than 0.05 (or 0.10 for a 90% confidence interval, etc.), in order to be significant. In like manner, for a 95% confidence interval, t-statistics should be values of at least 1.96 (slightly higher for small samples, but 1.96 will work) or less than -1.96 if the coefficient is negative, to be significant:

Parameter

Coefficient

T-statistic

Significant?

Intercept

87.27

4.32

Yes

Months since Last

(1.80)

(1.44)

No

Times Donated

2.45

2.87

Yes

Average Contribution

0.35

3.26

Yes

Notice that the coefficient for Months Since Last Donation has a t-statistic of -1.44. It is not significant. Another way to tell whether the parameter estimates are significant is to look at the Lower 95% and Upper 95% values in columns F and G.  If the lower and upper 95% confidence interval values for a parameter estimate are both negative or both positive, they are significant.  However, if the lower 95% value is negative and the upper 95% is positive (as is the case with Months Since Last Donation), then the parameter estimate is not significant, since its confidence interval range crosses zero.  Hence Months Since Last Donation is not significant.  Yet, the model still has a 93.3% coefficient of determination. Does this mean we can drop this variable from our regression? Not so fast!

Regression Violation Present!

Generally, when an independent variable we expect to be an important predictor of our dependent variable comes up as statistically insignificant, it is sometimes a sign of multicollinearity. And that is definitely the case with the nonprofit’s model. That will be our topic in next week’s Forecast Friday post.

Forecasting with the Output

Since we’re going to take on multicollinearity next week, let’s pretend our model is A-OK, and generate some forecasts.

We’ll go to our regression output worksheet, select cells A17 through B20, which contain our regression variables and coefficients, and then click Copy (or do a CTRL-C):

Selecting the Coefficients

Selecting the Coefficients

Next, let’s paste those coefficients and transpose them in another worksheet.  Here’s how to select the “Transpose” option when pasting:

Pasting Data Using the Transpose Option

Next, this is what the result of our transpose will be:

Transposed Data

Transposed Data

Now, the nonprofit organization looks at five prospective donors whom they are planning to solicit. They look at their past giving history as shown in the next thumbnail:

Prospective Donors - Before Applying Model

Knowing this information, we want to multiply those values by their respective coefficients. Take a look at the formula in cell F7 as we do just that, in the next thumbnail:

Forecasting with Regression Output

Note how the cell numbers containing the coefficients have their column letters enveloped in ‘$’. The dollar signs tell Excel that when we copy the formula down the next four rows, that it still reference those cells. Otherwise, for each row down, Excel would multiply each blank cell below the coefficients by the next donor’s information.  Here’s are the forecasts generated:

Forecasts made with Model

Next Forecast Friday Topic: Multicollinearity

Today you learned how to develop regression models using Excel and how to use Excel to interpret the output. You also found out that our model exhibited multicollinearity, a violation of one of the key regression assumptions. Next week and the week after, we will discuss multicollinearity in depth: how to detect it, how to correct it, and when to live with it. We will again be using the nonprofit’s model.  As I’ve said before, models are far from perfect and, as such, should only aid – not replace – the decision-making process.

 1 Note: Excel is a registered trademark of Microsoft Corporation. Use of Microsoft Excel in this post is intended only for a demonstration of how to use Excel for regression analysis and does not constitute an endorsement of Microsoft Excel or any other Microsoft product by Analysights, LLC.

*************************

If you Like Our Posts, Then “Like” Us on Facebook!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page!

Forecast Friday Topic: Multiple Regression Analysis

June 17, 2010

(Ninth in a series)

Quite often, when we try to forecast sales, more than one variable is often involved. Sales depends on how much advertising we do, the price of our products, the price of competitors’ products, the time of the year (if our product is seasonal), and also demographics of the buyers. And there can be many more factors. Hence, we need to measure the impact of all relevant variables that we know drive our sales or other dependent variable. That brings us to the need for multiple regression analysis. Because of its complexity, we will be spending the next several weeks discussing multiple regression analysis in easily digestible parts. Multiple regression is a highly useful technique, but is quite easy to forget if not used often.

Another thing to note, regression analysis is often used for both time series and cross-sectional analysis. Time series is what we have focused on all along. Cross-sectional analysis involves using regression to analyze variables on static data (such as predicting how much money a person will spend on a car based on income, race, age, etc.). We will use examples of both in our discussions of multiple regression.

Determining Parameter Estimates for Multiple Regression

When it comes to deriving the parameter estimates in a multiple regression, the process gets both complicated and tedious, even if you have just two independent variables. We strongly advise you to use the regression features of MS-Excel, or some statistical analysis tool like SAS, SPSS, or MINITAB. In fact, we will not work out the derivation of the parameters with the data sets, but will provide you the results. You are free to run the data we provide on your own to replicate the results we display. I do, however, want to show you the equations for computing the parameter estimates for a three-variable (two independent variables and one dependent variable), and point out something very important.

Let’s assume that sales is your dependent variable, Y, and advertising expenditures and price are your independent variables, X1 and X2, respectively. Also, the coefficients – your parameter estimates will have similar subscripts to correspond to their respective independent variable. Hence, your model will take on the form:

 

Now, how do you go about computing α, β1 and β2? The process is similar to that of a two-variable model, but a little more involved. Take a look:

The subscript “i” represents the individual oberservation.  In time series, the subscript can also be represented with a “t“.

What do you notice about the formulas for computing β1 and β2? First, you notice that the independent variables, X1 and X2, are included in the calculation for each coefficient. Why is this? Because when two or more independent variables are used to estimate the dependent variable, the independent variables themselves are likely to be related linearly as well. In fact, they need to be in order to perform multiple regression analysis. If either β1 or β2 turned out to be zero, then simple regression would be appropriate. However, if we omit one or more independent variables from the model that are related to those variables in the model, we run into serious problems, namely:

Specification Bias (Regression Assumptions Revisited)

Recall from last week’s Forecast Friday discussion on regression assumptions that 1) our equation must correctly specify the true regression model, namely that all relevant variables and no irrelevant variables are included in the model and 2) the independent variables must not be correlated with the error term. If either of these assumptions is violated, the parameter estimates you get will be biased. Looking at the above equations for β1 and β2, we can see that if we excluded one of the independent variables, say X2, from the model, the value derived for β1 will be incorrect because X1 has some relationship with X2. Moreover, X2‘s values are likely to be accounted for in the error terms, and because of its relationship with X1, X1 will be correlated with the error term, violating the second assumption above. Hence, you will end up with incorrect, biased estimators for your regression coefficient, β1.

Omitted Variables are Bad, but Excessive Variables Aren’t Much Better

Since omitting relevant variables can lead to biased parameter estimates, many analysts have a tendency to include any variable that might have any chance of affecting the dependent variable, Y. This is also bad. Additional variables means that you need to estimate more parameters, and that reduces your model’s degrees of freedom and the efficiency (trustworthiness) of your parameter estimates. Generally, for each variable – both dependent and independent – you are considering, you should have at least five data points. So, for a model with three independent variables, your data set should have 20 observations.

Another Important Regression Assumption

One last thing about multiple regression analysis – another assumption, which I deliberately left out of last week’s discussion, since it applies exclusively to multiple regression:

No combination of independent variables should have an exact linear relationship with one another.

OK, so what does this mean? Let’s assume you’re doing a model to forecast the effect of temperature on the speed at which ice melts. You use two independent variables: Celsius temperature and Fahrenheit temperature. What’s the problem here? There is a perfect linear relationship between these two variables. Every time you use a particular value of Fahrenheit temperature, you will get the same value of Celsius temperature. In this case, you will end up with multicollinearity, an assumption violation that results in inefficient parameter estimates. A relationship between independent variables need not be perfectly linear for multicollinearity to exist. Highly correlated variables can do the same thing. For example, independent variables such as “Husband Age” and “Wife Age,” or “Home Value” and “Home Square Footage” are examples of independent variables that are highly correlated.

You want to be sure that you do not put variables in the model that need not be there, because doing so could lead to multicollinearity.

Now Can We Get Into Multiple Regression????

Wasn’t that an ordeal? Well, now the fun can begin! I’m going to use an example from one of my old graduate school textbooks, because it’s good for several lessons in multiple regression. This data set is 25 annual observations to predict the percentage profit margin (Y) for U.S. savings and loan associations, based on changes in net revenues per deposit dollar (X1) and number of offices (X2). The data are as follows:

Year

Percentage Profit Margin (Yt)

Net Revenues Per Deposit Dollar (X1t)

Number of Offices (X2t)

1

0.75

3.92

7,298

2

0.71

3.61

6,855

3

0.66

3.32

6,636

4

0.61

3.07

6,506

5

0.70

3.06

6,450

6

0.72

3.11

6,402

7

0.77

3.21

6,368

8

0.74

3.26

6,340

9

0.90

3.42

6,349

10

0.82

3.42

6,352

11

0.75

3.45

6,361

12

0.77

3.58

6,369

13

0.78

3.66

6,546

14

0.84

3.78

6,672

15

0.79

3.82

6,890

16

0.70

3.97

7,115

17

0.68

4.07

7,327

18

0.72

4.25

7,546

19

0.55

4.41

7,931

20

0.63

4.49

8,097

21

0.56

4.70

8,468

22

0.41

4.58

8,717

23

0.51

4.69

8,991

24

0.47

4.71

9,179

25

0.32

4.78

9,318

Data taken from Spellman, L.J., “Entry and profitability in a rate-free savings and loan market.” Quarterly Review of Economics and Business, 18, no. 2 (1978): 87-95, Reprinted in Newbold, P. and Bos, T., Introductory Business & Economic Forecasting, 2nd Edition, Cincinnati (1994): 136-137

What is the relationship between the S&Ls’ profit margin percentage and the number of S&L offices? How about between the margin percentage and the net revenues per deposit dollar? Is the relationship positive (that is, profit margin percentage moves in the same direction as its independent variable(s))? Or negative (the dependent and independent variables move in opposite directions)? Let’s look at each independent variable’s individual relationship with the dependent variable.

Net Revenue Per Deposit Dollar (X1) and Percentage Profit Margin (Y)

Generally, if revenue per deposit dollar goes up, would we not expect the percentage profit margin to also go up? After all, if the S & L is making more revenue on the same dollar, it suggests more efficiency. Hence, we expect a positive relationship. So, in the resulting regression equation, we would expect the coefficient, β1, for net revenue per deposit dollar to have a “+” sign.

Number of S&L Offices (X2) and Percentage Profit Margin (Y)

Generally, if there are more S&L offices, would that not suggest either higher overhead, increased competition, or some combination of the two? Those would cut into profit margins. Hence, we expect a negative relationship. So, in the resulting regression equation, we would expect the coefficient, β2, for number of S&L offices to have a “-” sign.

Are our Expectations Correct?

Do our relationship expectations hold up?  They certainly do. The estimated multiple regression model is:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

What do the Parameter Estimates Mean?

Essentially, the model says that if net revenues per deposit dollar (X1t) increase by one unit, then percentage profit margin (Yt) will – on average – increase by 0.23720 percentage points, when the number of S&L offices is fixed. If the number of offices (X2t) increases by one, then percentage profit margin (Yt) will decrease by an average of 0.000249 percentage points, when net revenues are fixed.

Do Changes in the Independent Variables Explain Changes in The Dependent Variable?

We compute the coefficient of determination, R2, and get 0.865, indicating that changes in the number of S&L offices and in the net revenue per deposit dollar explain 86.5% of the variation in S&L percentage profit margin.

Are the Parameter Estimates Statistically Significant?

We have 25 observations, and three parameters – two coefficients for the independent variables, and one intercept – hence we have 22 degrees of freedom (25-3). If we choose a 95% confidence interval, we are saying that if we resampled and replicated this analysis 100 times, the average of our parameter estimates will be contain the true parameter approximately 95 times. To do this, we need to look at the t-values for each parameter estimate. For a two-tailed 95% significance test with 22 degrees of freedom, our critical t-value is 2.074. That means that if the t-statistic for a parameter estimate is greater than 2.074, then there is a strong positive relationship between the independent variable and the dependent variable; if the t-statistic for the parameter estimate is less than -2.074, then there is a strong negative relationship. This is what we get:

Parameter

Value

T-Statistic

Significant?

Intercept

1.5645000

19.70

Yes

B1t

0.2372000

4.27

Yes

B2t

(0.0002490)

(7.77)

Yes

So, yes, all our parameter estimates are significant.

Next Forecast Friday: Building on What You Learned

I think you’ve had enough for this week! But we are still not finished. We’re going to stop here and continue with further analysis of this example next week. Next week, we will discuss computing the 95% confidence interval for the parameter estimates; determining whether the model is valid; and checking for autocorrelation. The following Forecast Friday (July 1) blog post will discuss specification bias in greater detail, demonstrating the impact of omitting a key independent variable from the model.

Forecast Friday Topic: Simple Regression Analysis (Continued)

June 3, 2010

(Seventh in a series)

Last week I introduced the concept of simple linear regression and how it could be used in forecasting. I introduced the fictional businesswoman, Sue Stone, who runs her own CPA firm. Using the last 12 months of her firm’s sales, I walked you through the regression modeling process: determining the independent and dependent variables, estimating the parameter estimates, α and β, deriving the regression equation, calculating the residuals for each observation, and using those residuals to estimate the coefficient of determination – R2 – which indicates how much of the change in the dependent variable is explained by changes in the independent variable. Then I deliberately skipped a couple of steps to get straight to using the regression equation for forecasting. Today, I am going to fill in that gap, and then talk about a couple of other things so that we can move on to next week’s topic on multiple regression.

Revisiting Sue Stone

Last week, we helped Sue Stone develop a model using simple regression analysis, so that she could forecast sales. She had 12 months of sales data, which was her dependent variable, or Y, and each month (numbered from 1 to 12), was her independent variable, or X. Sue’s regression equation was as follows:

Where i is the period number corresponding to the month. So, in June 2009, i would be equal to 6; in January 2010, i would be equal to 13. Of course, since X is the month number, X=i in this example. Recall that Sue’s equation states that each passing month is associated with an average sales increase of $479.02, suggesting her sales are on an upward trend. Also note that Sue’s R2=.917, which says 91.7% of the change in Sue’s monthly sales is explained by changes in the passing months.

Are these claims valid? We need to do some further work here.

Are the Parameter Estimates Statistically Significant?

Measuring an entire population is often impossible. Quite often, we must measure a sample of the population and generalize our findings to the population. When we take an average or standard deviation of a data set that is a subset of the population, our values are estimates of the actual parameters for the population’s true average and standard deviation. These are subject to sampling error. Likewise, when we perform regression analysis on a sample of the population, our coefficients (a and b) are also subject to sampling error. Whenever we estimate population parameters (the population’s true α and β), we are frequently concerned that they might actually have values of zero. Even though we have derived values a=$9636.36 and b=$479.02, we want to perform a statistical significance test to make sure their distance from zero is meaningful and not due to sampling error.

Recall from the May 25 blog post, Using Statistics to Evaluate a Promotion, that in order to do significance testing, we must set up a hypothesis test. In this case, our null hypothesis is that the true population coefficient for month – β – is equal to zero. Our alternative hypothesis is that β is not equal to zero:

H0: β = 0

HA: β≠ 0

Our first step here is to compute the standard error of the estimate, that is, how spread out each value of the dependent variable (sales) is from the average value of sales. Since we are sampling from a population, we are looking for the estimator for the standard error of the estimate. That equation is:

Where ESS is the error sum of squares – or $2,937,062.94 – from Sue’s equation; n is the sample size, or 12; k is the number of independent variables in the model, in this case, just 1. When we plug those numbers into the above equation, we’re dividing the ESS by 10 and then taking the square root, so Sue’s estimator is:

sε = $541.95

Now that we know the estimator for the standard error of the estimate, we need to use that to find the estimator for the standard deviation of the regression slope (b). That equation is given by:

Remember from last week’s blog post that the sum of all the (x-xbar) squared values was 143. Since we have the estimator for the standard error of the estimate, we divide $541.95 by the square root of 143 to get an Sb = 45.32. Next we need to compute the t-statistic. If Sue’s t-statistic is greater than her critical t-value, then she’ll know the parameter estimate of $479.02 is significant. In Sue’s regression, she has 12 observations, and thus 10 degrees of freedom: (n-k-1) = (12-1-1) = 10. Assuming a 95% confidence interval, her critical t is 2.228. Since parameter estimates can be positive or negative, if her t value is less than -2.228 or greater than 2.228, Sue can reject her null hypothesis and conclude that her parameter estimates is meaningfully different from zero.

To compute the t-statistic, all Sue needs to do is divide her b1 coefficient ($479.02) by her sb ($45.32). She ends up with a t-statistic of 10.57, which is significant.

Next Sue must do the same for her intercept value, a. To do this, Sue, must compute the estimator of the standard deviation of the intercept (a). The equation for this estimate is:

All she needs to do is plug in her numbers from earlier: her sε = $541.95; n=12; she just takes her average x-bar of 6.5 and squares it, bringing it to 42.25; and the denominator is the same 143. Working that all in, Sue gets a standard error of 333.545. She divides her intercept value of $9636.36 by 333.545 and gets a t-statistic of 28.891, which exceeds the 2.228 critical t, so her intercept is also significant.

Prediction Intervals in Forecasting

Whew! Aren’t you glad those t-statistics calculations are over? If you run regressions in Excel, these values will be calculated for you automatically, but it’s very important that you understand how they were derived and the theory behind them. Now, we move back to forecasting. In last week’s post, we predicted just a single point with the regression equation. For January 2010, we substituted the number 13 for X, and got a point forecast for sales in that month: $15,863.64. But Sue needs a range, because she knows forecasts are not precise. Sue wants to develop a prediction interval. A prediction interval is simply the point forecast plus or minus the critical t value (2.228) for a desired level of confidence (95%, in this example) times the estimator of the standard error of the estimate ($541.95). So, Sue’s prediction interval is:

$15,863.64 ± 2.228($541.95)

= $15,863.64 ± $1,207.46

$14,656.18_____$17,071.10

So, since Sue had chosen a 95% level of confidence, she can be 95% confident that January 2010 sales will fall somewhere between $14,656.18 and $17,071.10

Recap and Plan for Next Week’s Post

Today, you learned how to test the parameter estimates for significance to determine the validity of your regression model. You also learned how to compute the estimates of the standard error of the estimates, as well as the estimators of the standard deviations of the slope and intercept. You then learned how to derive the t-statistics you need to determine whether those parameter estimates were indeed significant. And finally, you learned how to derive a prediction interval. Next week, we begin our discussion of multiple regression. We will begin by talking about the assumptions behind a regression model; then we will talk about adding a second independent variable into the model. From there, we will test the model for validity, assess the model against those assumptions, and generate projections.

Using Statistics to Evaluate a Promotion

May 25, 2010

Marketing – as much as cashflow – is the lifeblood of any business. No matter how good your product or service may be, it’s worthless if you can’t get it in front of your customers and get them to buy it. So all businesses, large and small, must engage in marketing. And we see countless types of marketing promotions or tactics being tried: radio and TV commercials, magazine and newspaper advertisements, public relations, coupons, email blasts, and so forth. But are our promotions working? The merchant John Wannamaker, often dubbed the father of modern advertising is said to have remarked, “Half the money I spend on advertising is wasted; the trouble is I don’t know which half.”

Some basic statistics can help you evaluate the effectiveness of your marketing and take away much of the mystique Wannamaker complained about. When deciding whether to do a promotion, managers and business owners have no way of knowing whether it will succeed; in fact, in today’s economy, budgets are still tight. The cost to roll out a full promotion can wipe out an entire marketing budget if it proves to be a fiasco. This is why many businesses do a test before doing a complete rollout. The testing helps to reduce the amount of uncertainty involved in an all-out campaign.

Quite often, large companies need to choose between two or more competing campaigns for rollout. But how do they know which will be effective? Consider the example of Jenny Kaplan, owner of K-Jen, a New Orleans-style restaurant. K-Jen serves up a tasty jambalaya entrée, which is priced at $10.00. Jenny believes that the jambalaya is a draw to the restaurant and believes that by offering a discount, she can increase the average amount of the table check. Jenny decides to issue coupons via email to patrons who have opted-in to receive such promotions. She wants to knock a dollar off the price of the jambalaya as the offer, but doesn’t know whether customers would respond better to an offer worded as “$1.00 off” or as “10% off.” So, Jenny decides to test the two concepts.

Jenny goes to her database of nearly 1,000 patrons and randomly selects 200 patrons. She decides to send half of those a coupon for $1.00 off for jambalaya, and the other half a coupon for 10% off. When the coupon offer expires 10 days later, Jenny finds that 10 coupons were redeemed for each offer – a redemption rate of 10% each. Jenny observes that either wording will get the same number of people to respond. But she wonders which offer generated the largest table check. So she looks at the guest checks to which the coupons were stapled. She notices the following:

Guest Check Amounts

 

Offer

 
 

$1.00 off

10% Off

 
 

$38.85

$50.16

 
 

$36.97

$54.44

 
 

$35.94

$32.20

 
 

$54.17

$32.69

 
 

$68.18

$51.09

 
 

$49.47

$46.18

 
 

$51.39

$57.72

 
 

$32.72

$44.30

 
 

$22.59

$59.29

 
 

$24.13

$22.94

 

 

Jenny quickly computes the average for each offer. The “$1.00 off” coupon generated an average table check of $41.44; the “10% off” coupon generated an average of $45.10. At first glance, it appears that the 10% off promotion generated a higher guest check. But is that difference meaningful, or is it due to chance? Jenny needs to do further analysis.

Hypothesis Testing

How does Jenny determine if the 10% off coupon really did better than the $1.00 off coupon? She can use statistical hypothesis testing, which is a structured analytical method for comparing the difference between two groups – in this case, two promotions. Jenny starts her analysis by formulating two hypotheses: a null hypothesis, which states that there is no difference in the average check amount for either offer; and an alternative hypothesis, which states that there is, in fact, a difference in the average check amount between the two offers. The null hypothesis is often denoted as H0, and the alternative hypothesis is denoted as HA. Jenny also refers to the $1.00 off offer as Offer #1, and the 10% off offer as Offer #2. She wants to compare the means of the two offers, the means of which are denoted as μ1 and μ2, respectively. Jenny writes down her two hypotheses:

H0: The average guest check amount for the two offers is equal.

HA: The average guest check amount for the two offers is not equal.

Or, more succinctly:

H0: μ12

HA: μ1≠μ2

 

Now, Jenny is ready to go to work. Note that the symbol μ denotes the population she wants to measure. Because Jenny did her test on a portion – a sample – of her database, the averages she computed were the sample average, which is denoted as . As we stated earlier, the average table checks for the “$1.00 off” and “10% off” offers were 1=$41.44 and 2=$45.10, respectively. Jenny needs to approximate μ using . She must also compute the sample standard deviation, or s for each offer.

Computing the Sample Standard Deviation

To compute the sample standard deviation, Jenny must subtract the mean of a particular offer from each of its check amounts in the sample; square the difference; sum them up; divide by the total observations minus 1(9) and then take the square root:

$1.00 Off

Actual Table Check

Average Table Check

Difference

Difference Squared

$38.85

$41.44

-$2.59

$6.71

$36.97

$41.44

-$4.47

$19.99

$35.94

$41.44

-$5.50

$30.26

$54.17

$41.44

$12.73

$162.03

$68.18

$41.44

$26.74

$714.97

$49.47

$41.44

$8.03

$64.46

$51.39

$41.44

$9.95

$98.98

$32.72

$41.44

-$8.72

$76.06

$22.59

$41.44

-$18.85

$355.36

$24.13

$41.44

-$17.31

$299.67

   

Total

$1,828.50

   

S21=

$203.17

   

S1=

$14.25

 

10% Off

Actual Table Check

Average Table Check

Difference

Difference Squared

$50.16

$45.10

$5.06

$25.59

$54.44

$45.10

$9.34

$87.22

$32.20

$45.10

-$12.90

$166.44

$32.69

$45.10

-$12.41

$154.03

$51.09

$45.10

$5.99

$35.87

$46.18

$45.10

$1.08

$1.16

$57.72

$45.10

$12.62

$159.24

$44.30

$45.10

-$0.80

$0.64

$59.29

$45.10

$14.19

$201.33

$22.94

$45.10

-$22.16

$491.11

   

Total

$1,322.63

   

S22=

$146.96

   

S2=

$12.12

 

Notice the denotation of S2. That is known as the variance. The variance and the standard deviation are used to measure the average distance between each data point and the mean. When data are normally distributed, about 95% of all observations fall within two standard deviations from the mean (actually 1.96 standard deviations). Hence, approximately 95% of the guest checks for the $1.00 off offer should fall between $41.44 ± 1.96*($14.25) or between $13.51 and $69.37. All ten fall within this range. For the 10% off offer, about 95% will fall between $45.10 ± 1.96*($12.12), or between $21.34 and $68.86. All 10 observations also fall within this range.

Degrees of Freedom and Pooled Standard Deviation

Jenny noticed two things immediately: first, that the 10% off coupon has the higher sample average, and second each individual table check is closer to it mean than it is for the $1.00 off coupon. Also notice that when we were computing the sample standard deviation for each offer, Jenny divided by 9, and not 10. Why? Because she was making estimates of the population standard deviation. Since samples are subject to error, we must account for that. Each observation gives us information into the population’s actual values. However, Jenny had to make an estimate based on that sample, so she gives up one observation to account for the sampling error – that is, she lost a degree of freedom. In this example, Jenny has 20 total observations; since she estimated the population standard deviation for both offers, she lost two degrees of freedom, leaving her with 18 (10 + 10 – 2).

Knowing the remaining degrees of freedom, Jenny must pool the standard deviations, weighting them by their degrees of freedom. This would be especially evident if the sample sizes of the two offers were not equal. The pooled standard deviation is given by:

FYI – n is simply the sample size. Jenny then computes the pooled standard deviation:

S2p = ((9 * $203.17) + (9 * $146.96))) / (10 + 10 – 2)

= ($1,828.53 + $1,322.64)/18

= $3,151.17/18

= $175.07

Now take the square root: $13.23

Hence, the pooled standard deviation is $13.23

Computing the t-Test Statistic

Now the fun begins. Jenny knows the sample mean of the two offers; she knows the hypothesized difference between the two population means (which we would expect to be zero, if the null hypothesis said they were equal); she knows the pooled standard deviation; she knows the sample size; and she knows the degrees of freedom. Jenny must now calculate the t-Test statistic. The t-Test Statistic, or the t-value, represents the number of estimated standard errors the sample average is from that of the population. The t-value is computed as follows:

 

So Jenny sets to work computing her t-Test Statistic:

t = (($41.44 – $45.10) – (0)) / ($13.23) * SQRT(1/10 + 1/10)

= -$3.66 / ($13.23 * SQRT(1/5))

=-$3.66 / ($13.23 * .45)

=-$3.66/$5.92

= -0.62

This t-statistic gives Jenny a basis for testing her hypothesis. Jenny’s t-statistic indicates that the difference in sample table checks between the two offers is 0.62 standard errors below the hypothesized difference of zero. We now need to determine the critical t – the value that we get from a t-distribution table that is available in most statistics textbooks and online. Since we are estimating with a 95% confidence interval, and since we must account for a small sample, our critical t-value is adjusted slightly from the 1.96 standard deviations from the mean. For 18 degrees of freedom, our critical t is 2.10. The larger the sample size, the closer to 1.96 the critical t would be.

So, does Jenny Accept or Reject her Null Hypothesis (Translation: Is the “10% Off” Offer Better than the “$1.00 Off” Offer)?

Jenny now has all the information she needs to determine whether one offer worked better than the other. What does the critical t of 2.10 mean? If Jenny’s t-statistic is greater than 2.10, or (since one offer can be lower than the other), less than -2.10, then she would reject her null hypothesis, as there is sufficient evidence to suggest that the two means are not equal. Is that the case?

Jenny’s t-statistic is -0.62, which is between -2.10 and 2.10. Hence, it is within the parameters. Jenny should not reject H0, since there is not enough evidence to suggest that one offer was better than the other at generating higher table checks. In fact, there’s nothing to say that the difference between the two offers is due to anything other than chance.

What Does Jenny Do Now?

Basically, Jenny can conclude that there’s not enough evidence that the “$1.00 off” coupon was worse/better than the “10% off” coupon in generating higher table check amounts, and vice-versa. This does not mean that our hypotheses were true or false, just that there was not enough statistical evidence to say so. In this case, we did not accept the null hypothesis, but rather, failed to reject it. Jenny can do a few things:

  1. She can run another test, and see if the same phenomenon holds.
  2. Jenny can accept the fact that both offers work equally well, and compare their overall average table checks to those of who ordered jambalaya without the coupons during the time the offer ran; if the coupons generated average table checks that were higher (using the hypothesis testing procedures outlined above) than those who paid full price, then she may choose to rollout a complete promotion using either or both of the offers described above.
  3. Jenny may decide that neither coupon offer raised average check amounts and choose not to do a full rollout after all.

So Why am I Telling You This?

The purpose of this blog post was to take you step-by-step into how you can use a simple concept like t-tests to judge the performance of two promotion concepts. Although a spreadsheet like Excel can run this test in seconds, I wanted to walk you through the theory in laymen’s terms, so that you can grasp the theory, and then apply it to your business. Analysights is in the business of helping companies – large and small – succeed at marketing, and this blog post is one ingredient in the recipe for your marketing success. If you would like some assistance in setting up a promotion test or in evaluating the effectiveness of a campaign, feel free to contact us at www.analysights.com.