Forecast Friday Topic: Building Regression Models With Excel

 (Twelfth in a series)
We’ve spent the last six weeks discussing regression analysis as a forecasting method. As you have seen, simple regression is a bit tedious to work out by hand, but for multiple regression analysis, you almost always need the aid of a computerized software package. Today I will demonstrate for you how to use the Regression Analysis feature of Microsoft Excel’s1 Analysis ToolPak Add-In. Excel 2007 comes with the Analysis ToolPak Add-In, which you can choose to activate. One way to know if Analysis ToolPak is activated on your version of Excel is to click on the Data tab on your workspace and see if there is a “Data Analysis” icon. The following thumbnail will illustrate:

 

Seeing if the Data Analysis ToolPak Add-In is activated.

Checking for the Data Analysis Add-In

Notice towards the upper right corner of the image, “Data Analysis” is highlighted in orange. The presence of the Data Analysis icon means that we have activated the Analysis ToolPak Add-In. If it wasn’t there, you would need to activate the Add-In, which you could do very easily by clicking the “Office” button in the top left hand corner, then clicking the “Excel Options” button, which will take you through the process of activating your add-in.

Setting up a Regression – Our Data Set

Generally in direct mail marketing, three components that often determine how much one spends – or whether he/she buys at all – are recency, frequency, and monetary value – known in short as RFM. Generally, the longer it has been since one’s last purchase (recency), the less he/she is likely to spend. Hence, we would expect a minus sign by the coefficient for recency. Also, RFM theorizes that the more frequently one buys, the greater his/her purchase. So, we would expect a plus sign by the coefficient for frequency. Finally, the higher the customer’s average purchases (monetary value), the greater his/her spending, so we would also expect a plus sign here. RFM is also used heavily by nonprofits in their capital and contributor campaigns, since they are often heavily reliant upon direct mail.

In our example here, a local nonprofit decided to test whether each RFM component had a relationship to a donor’s contribution, so it randomly selected 20 donors who contributed to its last appeal. Naturally, our dependent variable, Y, was the Contribution amount. The nonprofit also looked at three independent variables: months since last contribution (X1), times donated in last 12 months (X2), and average contribution over the last 12 months (X3). These independent variables represent recency, frequency, and monetary value, respectively. The table below shows our dataset:

Giving Patterns of 20 Donors

Donor

Contribution

Months Since Last Donation

Times Donated in last 12 months

Average Contribution in last 12 months

1

95

10

1

85

2

110

8

2

95

3

100

10

2

90

4

115

8

3

75

5

100

9

1

95

6

120

6

2

100

7

105

9

1

90

8

125

10

1

125

9

105

9

2

100

10

130

4

3

150

11

135

7

4

125

12

150

2

8

150

13

140

4

3

125

14

155

2

9

140

15

140

2

8

130

16

160

2

10

150

17

145

3

6

135

18

165

1

12

150

19

150

3

4

160

20

170

1

12

140

The thumbnail below shows what the data set looks like in Excel:

Data Set in Excel

Regression Data Set in Excel

Running the Regression

To run the regression, we need to select the regression tool from the Analysis ToolPak. We do this by clicking on the Data Analysis Tab. The  next thumbnail shows us what we need to do:

Regression Option

Selecting the Regression option from the Data Analysis ToolPak

After selecting the regression tool, we need to select our independent variables and our dependent variables. It is best to make sure all columns containing your independent variables are adjacent to each other, as they are in columns D, E, and F. Notice that column C from rows 2 to 22 contains our Y-range values (including the column label). In columns D, E, and F, rows 2 through 22 contain their respective X-range values. Notice in the thumbnail how we indicate those column/row positions for Y-range and X-range values.

Regression Options

Regression Options

We also need to decide where to place the regression output and what data we want the output to contain. In the thumbnail below, we choose to have the output placed in a new worksheet, called “Regression Output”, and we also check the box indicating that we want the residuals printed. Also notice that we checked the box “Labels”, so that row 2 won’t be inadvertently added into the model.

Regression Options - Continued

Regression Options - Continued

Looking at the Output

Now we run the regression and get the following output:

Regression Output

Regression Output (residuals not shown)

As you can see, cell B5 contains our R2, equal to .933, indicating that 93.3% of the variation in a donor’s contribution amount is explained by changes in recency, frequency, and monetary value. Also, notice the F-statistic in cell E12. It’s a large, strong 73.90, and cell F12 to the right is 0.00, suggesting the model is significant. (Note, the Significance F in cell F12 and the P-Values in cells E17-E20 for each parameter estimate are quick cues to significance. If you’re using a 95% confidence interval – which we are here – then you want those values to be no higher than 0.05).

Now let’s look at each parameter estimate. Cells B17-B20 contain our regression coefficients. We have the following equation:

Contribution Estimate = 87.27 – 1.80 *Months_Since_Last_Donation + 2.45 *Times_Donated_Last_12_Months + 0.35 *Average_Contribution_Last_12_Months

Simplifying, we have:

Contribution Estimate = 87.27 – 1.80*RECENCY + 2.45*FREQUENCY + 0.35*MONETARY_VALUE

Ŷ = 87.27 – 1.80X1 + 2.45X2 + 0.35X3

Note that even though we opted to display the residuals for each observation, I chose not to show them here.  It would have run below the fold, and would have been difficult to see.  Besides, for our analysis, we’re not going to worry about residuals right now.

Interpreting the Output

As we can see, each month since a donor’s last contribution reduces his contribution by an average of $1.80, when we hold frequency and monetary value constant. Likewise, for each time a donor has given in the last 12 months, the size of his contribution increases by an average of $2.45, holding the other two variables constant. In addition, each one-dollar increase in a donor’s average contribution increases his contribution by an average of 35 cents. Hence, all of our coefficients have the signs we expect.

T-Statistics and P-Values

Next, we need to look at the t-statistics and P-values. As mentioned above, for a 95% confidence interval, a parameter estimate must have a p-value no greater than 0.05 (or 0.10 for a 90% confidence interval, etc.), in order to be significant. In like manner, for a 95% confidence interval, t-statistics should be values of at least 1.96 (slightly higher for small samples, but 1.96 will work) or less than -1.96 if the coefficient is negative, to be significant:

Parameter

Coefficient

T-statistic

Significant?

Intercept

87.27

4.32

Yes

Months since Last

(1.80)

(1.44)

No

Times Donated

2.45

2.87

Yes

Average Contribution

0.35

3.26

Yes

Notice that the coefficient for Months Since Last Donation has a t-statistic of -1.44. It is not significant. Another way to tell whether the parameter estimates are significant is to look at the Lower 95% and Upper 95% values in columns F and G.  If the lower and upper 95% confidence interval values for a parameter estimate are both negative or both positive, they are significant.  However, if the lower 95% value is negative and the upper 95% is positive (as is the case with Months Since Last Donation), then the parameter estimate is not significant, since its confidence interval range crosses zero.  Hence Months Since Last Donation is not significant.  Yet, the model still has a 93.3% coefficient of determination. Does this mean we can drop this variable from our regression? Not so fast!

Regression Violation Present!

Generally, when an independent variable we expect to be an important predictor of our dependent variable comes up as statistically insignificant, it is sometimes a sign of multicollinearity. And that is definitely the case with the nonprofit’s model. That will be our topic in next week’s Forecast Friday post.

Forecasting with the Output

Since we’re going to take on multicollinearity next week, let’s pretend our model is A-OK, and generate some forecasts.

We’ll go to our regression output worksheet, select cells A17 through B20, which contain our regression variables and coefficients, and then click Copy (or do a CTRL-C):

Selecting the Coefficients

Selecting the Coefficients

Next, let’s paste those coefficients and transpose them in another worksheet.  Here’s how to select the “Transpose” option when pasting:

Pasting Data Using the Transpose Option

Next, this is what the result of our transpose will be:

Transposed Data

Transposed Data

Now, the nonprofit organization looks at five prospective donors whom they are planning to solicit. They look at their past giving history as shown in the next thumbnail:

Prospective Donors - Before Applying Model

Knowing this information, we want to multiply those values by their respective coefficients. Take a look at the formula in cell F7 as we do just that, in the next thumbnail:

Forecasting with Regression Output

Note how the cell numbers containing the coefficients have their column letters enveloped in ‘$’. The dollar signs tell Excel that when we copy the formula down the next four rows, that it still reference those cells. Otherwise, for each row down, Excel would multiply each blank cell below the coefficients by the next donor’s information.  Here’s are the forecasts generated:

Forecasts made with Model

Next Forecast Friday Topic: Multicollinearity

Today you learned how to develop regression models using Excel and how to use Excel to interpret the output. You also found out that our model exhibited multicollinearity, a violation of one of the key regression assumptions. Next week and the week after, we will discuss multicollinearity in depth: how to detect it, how to correct it, and when to live with it. We will again be using the nonprofit’s model.  As I’ve said before, models are far from perfect and, as such, should only aid – not replace – the decision-making process.

 1 Note: Excel is a registered trademark of Microsoft Corporation. Use of Microsoft Excel in this post is intended only for a demonstration of how to use Excel for regression analysis and does not constitute an endorsement of Microsoft Excel or any other Microsoft product by Analysights, LLC.

*************************

If you Like Our Posts, Then “Like” Us on Facebook!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page!

Tags: , , , , , , , , , , , , , , , , , , , , ,

2 Responses to “Forecast Friday Topic: Building Regression Models With Excel”

  1. Forecast Friday Topic: Multicollinearity – How to Detect it; How to Correct it « Insight Central Says:

    […] last week’s Forecast Friday post, we explored how to perform regression analysis using Excel. We looked at the giving history […]

  2. Thanks for a Great 2010! « Insight Central Says:

    […] Forecast Friday Topic: Building Regression Models With Excel July 2010 1 comment […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: