Notice towards the upper right corner of the image, “Data Analysis” is highlighted in orange. The presence of the Data Analysis icon means that we have activated the Analysis ToolPak Add-In. If it wasn’t there, you would need to activate the Add-In, which you could do very easily by clicking the “Office” button in the top left hand corner, then clicking the “Excel Options” button, which will take you through the process of activating your add-in.
Setting up a Regression – Our Data Set
Generally in direct mail marketing, three components that often determine how much one spends – or whether he/she buys at all – are recency, frequency, and monetary value – known in short as RFM. Generally, the longer it has been since one’s last purchase (recency), the less he/she is likely to spend. Hence, we would expect a minus sign by the coefficient for recency. Also, RFM theorizes that the more frequently one buys, the greater his/her purchase. So, we would expect a plus sign by the coefficient for frequency. Finally, the higher the customer’s average purchases (monetary value), the greater his/her spending, so we would also expect a plus sign here. RFM is also used heavily by nonprofits in their capital and contributor campaigns, since they are often heavily reliant upon direct mail.
In our example here, a local nonprofit decided to test whether each RFM component had a relationship to a donor’s contribution, so it randomly selected 20 donors who contributed to its last appeal. Naturally, our dependent variable, Y, was the Contribution amount. The nonprofit also looked at three independent variables: months since last contribution (X_{1}), times donated in last 12 months (X_{2}), and average contribution over the last 12 months (X_{3}). These independent variables represent recency, frequency, and monetary value, respectively. The table below shows our dataset:
Giving Patterns of 20 Donors |
||||
Donor |
Contribution |
Months Since Last Donation |
Times Donated in last 12 months |
Average Contribution in last 12 months |
1 |
95 |
10 |
1 |
85 |
2 |
110 |
8 |
2 |
95 |
3 |
100 |
10 |
2 |
90 |
4 |
115 |
8 |
3 |
75 |
5 |
100 |
9 |
1 |
95 |
6 |
120 |
6 |
2 |
100 |
7 |
105 |
9 |
1 |
90 |
8 |
125 |
10 |
1 |
125 |
9 |
105 |
9 |
2 |
100 |
10 |
130 |
4 |
3 |
150 |
11 |
135 |
7 |
4 |
125 |
12 |
150 |
2 |
8 |
150 |
13 |
140 |
4 |
3 |
125 |
14 |
155 |
2 |
9 |
140 |
15 |
140 |
2 |
8 |
130 |
16 |
160 |
2 |
10 |
150 |
17 |
145 |
3 |
6 |
135 |
18 |
165 |
1 |
12 |
150 |
19 |
150 |
3 |
4 |
160 |
20 |
170 |
1 |
12 |
140 |
The thumbnail below shows what the data set looks like in Excel:
Running the Regression
To run the regression, we need to select the regression tool from the Analysis ToolPak. We do this by clicking on the Data Analysis Tab. The next thumbnail shows us what we need to do:
After selecting the regression tool, we need to select our independent variables and our dependent variables. It is best to make sure all columns containing your independent variables are adjacent to each other, as they are in columns D, E, and F. Notice that column C from rows 2 to 22 contains our Y-range values (including the column label). In columns D, E, and F, rows 2 through 22 contain their respective X-range values. Notice in the thumbnail how we indicate those column/row positions for Y-range and X-range values.
We also need to decide where to place the regression output and what data we want the output to contain. In the thumbnail below, we choose to have the output placed in a new worksheet, called “Regression Output”, and we also check the box indicating that we want the residuals printed. Also notice that we checked the box “Labels”, so that row 2 won’t be inadvertently added into the model.
Looking at the Output
Now we run the regression and get the following output:
As you can see, cell B5 contains our R^{2}, equal to .933, indicating that 93.3% of the variation in a donor’s contribution amount is explained by changes in recency, frequency, and monetary value. Also, notice the F-statistic in cell E12. It’s a large, strong 73.90, and cell F12 to the right is 0.00, suggesting the model is significant. (Note, the Significance F in cell F12 and the P-Values in cells E17-E20 for each parameter estimate are quick cues to significance. If you’re using a 95% confidence interval – which we are here – then you want those values to be no higher than 0.05).
Now let’s look at each parameter estimate. Cells B17-B20 contain our regression coefficients. We have the following equation:
Contribution Estimate = 87.27 – 1.80 *Months_Since_Last_Donation + 2.45 *Times_Donated_Last_12_Months + 0.35 *Average_Contribution_Last_12_Months
Simplifying, we have:
Contribution Estimate = 87.27 – 1.80*RECENCY + 2.45*FREQUENCY + 0.35*MONETARY_VALUE
Ŷ = 87.27 – 1.80X_{1} + 2.45X_{2} + 0.35X_{3}
Note that even though we opted to display the residuals for each observation, I chose not to show them here. It would have run below the fold, and would have been difficult to see. Besides, for our analysis, we’re not going to worry about residuals right now.
Interpreting the Output
As we can see, each month since a donor’s last contribution reduces his contribution by an average of $1.80, when we hold frequency and monetary value constant. Likewise, for each time a donor has given in the last 12 months, the size of his contribution increases by an average of $2.45, holding the other two variables constant. In addition, each one-dollar increase in a donor’s average contribution increases his contribution by an average of 35 cents. Hence, all of our coefficients have the signs we expect.
T-Statistics and P-Values
Next, we need to look at the t-statistics and P-values. As mentioned above, for a 95% confidence interval, a parameter estimate must have a p-value no greater than 0.05 (or 0.10 for a 90% confidence interval, etc.), in order to be significant. In like manner, for a 95% confidence interval, t-statistics should be values of at least 1.96 (slightly higher for small samples, but 1.96 will work) or less than -1.96 if the coefficient is negative, to be significant:
Parameter |
Coefficient |
T-statistic |
Significant? |
Intercept |
87.27 |
4.32 |
Yes |
Months since Last |
(1.80) |
(1.44) |
No |
Times Donated |
2.45 |
2.87 |
Yes |
Average Contribution |
0.35 |
3.26 |
Yes |
Notice that the coefficient for Months Since Last Donation has a t-statistic of -1.44. It is not significant. Another way to tell whether the parameter estimates are significant is to look at the Lower 95% and Upper 95% values in columns F and G. If the lower and upper 95% confidence interval values for a parameter estimate are both negative or both positive, they are significant. However, if the lower 95% value is negative and the upper 95% is positive (as is the case with Months Since Last Donation), then the parameter estimate is not significant, since its confidence interval range crosses zero. Hence Months Since Last Donation is not significant. Yet, the model still has a 93.3% coefficient of determination. Does this mean we can drop this variable from our regression? Not so fast!
Regression Violation Present!
Generally, when an independent variable we expect to be an important predictor of our dependent variable comes up as statistically insignificant, it is sometimes a sign of multicollinearity. And that is definitely the case with the nonprofit’s model. That will be our topic in next week’s Forecast Friday post.
Forecasting with the Output
Since we’re going to take on multicollinearity next week, let’s pretend our model is A-OK, and generate some forecasts.
We’ll go to our regression output worksheet, select cells A17 through B20, which contain our regression variables and coefficients, and then click Copy (or do a CTRL-C):
Next, let’s paste those coefficients and transpose them in another worksheet. Here’s how to select the “Transpose” option when pasting:
Next, this is what the result of our transpose will be:
Now, the nonprofit organization looks at five prospective donors whom they are planning to solicit. They look at their past giving history as shown in the next thumbnail:
Knowing this information, we want to multiply those values by their respective coefficients. Take a look at the formula in cell F7 as we do just that, in the next thumbnail:
Note how the cell numbers containing the coefficients have their column letters enveloped in ‘$’. The dollar signs tell Excel that when we copy the formula down the next four rows, that it still reference those cells. Otherwise, for each row down, Excel would multiply each blank cell below the coefficients by the next donor’s information. Here’s are the forecasts generated:
Next Forecast Friday Topic: Multicollinearity
Today you learned how to develop regression models using Excel and how to use Excel to interpret the output. You also found out that our model exhibited multicollinearity, a violation of one of the key regression assumptions. Next week and the week after, we will discuss multicollinearity in depth: how to detect it, how to correct it, and when to live with it. We will again be using the nonprofit’s model. As I’ve said before, models are far from perfect and, as such, should only aid – not replace – the decision-making process.
^{1} Note: Excel is a registered trademark of Microsoft Corporation. Use of Microsoft Excel in this post is intended only for a demonstration of how to use Excel for regression analysis and does not constitute an endorsement of Microsoft Excel or any other Microsoft product by Analysights, LLC.
*************************
If you Like Our Posts, Then “Like” Us on Facebook!
Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page!