The data available to us has never been more voluminous. Thanks to technology, data about us and our environment are collected almost continuously. When we use a cell phone to call someone else’s cell phone, several pieces of information are collected: the two phone numbers involved in the call; the time the call started and ended; the cell phone towers closest to the two parties; the cell phone carriers; the distance of the call; the date; and many more. Cell phone companies use this information to determine where to increase capacity; refine, price, and promote their plans more effectively; and identify regions with inadequate coverage.
Multiply these different pieces of data by the number of calls in a year, a month, a day – even an hour – and you can easily see that we are dealing with enormous amounts of records and observations. While it’s good for decision makers to see what sales, school enrollment, cell phone usage, or any other pattern looks like in total, quite often they are even more interested in breaking down data into groups to see if certain groups behave differently. Quite often we hear decision makers asking questions like these:
 How do depositors under age 35 compare with those between 3554 and 55 & over in their choice of banking products?
 How will voter support for Candidate A differ by race or ethnicity?
 How does cell phone usage differ between men and women?
 Does the length or severity of a prison sentence differ by race?
When we break data down into subgroups, we are trying to see whether knowing about these groups adds any additional meaningful information. This helps us customize marketing messages, product packages, pricing structures, and sales channels for different segments of our customers. There are many different ways we can break data down: by region, age, race, gender, income, spending levels; the list is limitless.
To give you an example of how data can be analyzed by groups, let’s revisit Jenny Kaplan, owner of KJen, the New Orleansstyle restaurant. If you recall from the May 25 post, Jenny tested two coupon offers for her $10 jambalaya entrée: one offering 10% off and another offering $1 off. Even though the savings was the same, Jenny thought customers would respond differently. As Jenny found, neither offer was better than the other at increasing the average size of the table check. Now, Jenny wants to see if there is a preference for one offer over the other, based on customer age.
Jenny knows that of her 1,000patron database, about 50% are the ages of 18 to 35; the rest are older than 35. So Jenny decides to send out 1,000 coupons via email as follows:
$1 off 
10% off 
Total Coupons 

1835 
250 
250 
500 
Over 35 
250 
250 
500 
Total Coupons 
500 
500 
1,000 
Half of Jenny’s customers received one coupon offer and half received the other. Looking carefully at the table above, half the people in each age group got one offer and the other half got the other offer. At the end of the promotion period, Jenny received back 200 coupons. She tracks the coupon codes back to her database and finds the following pattern:
Coupons Redeemed (Actual) 

$1 off 
10% off 
Coupons Redeemed 

1835 
35 
65 
100 
Over 35 
55 
45 
100 
Coupons Redeemed 
90 
110 
200 
Exactly 200 coupons were redeemed, 100 from each age group. But notice something else: of the 200 people redeeming the coupon, 110 redeemed the coupon offering 10% off; just 90 redeemed the $1 off coupon. Does this mean the 10% off coupon was the better offer? Not so fast!
What Else is the Table Telling Us?
Look at each age group. Of the 100 customers aged 1835, 65 redeemed the 10% off coupon; but of the 100 customers age 35 and up, just 45 did. Is that a meaningful difference or just a fluke? Do persons over 35 prefer an offer of $1 off to one of 10% off? There’s one way to tell: a chisquared test for statistical significance.
The ChiSquared Test
Generally, a chisquared test is useful in determining associations between categories and observed results. The chisquared – χ^{2} – statistic is value needed to determine statistical significance. In order to compute χ^{2}, Jenny needs to know two things: the actual frequency distribution of the coupons redeemed (which is shown in the last table above), and the expected frequencies.
Expected frequencies are the types of frequencies you would expect the distribution of data to fall, based on probability. In this case, we have two equal sized groups: customers age 1835 and customers over 35. Knowing nothing else besides the fact that the same number of people in these groups redeemed coupons, and that 110 of them redeemed the 10% off coupon, and 90 redeemed the $1 off coupon, we would expect that 55 customers in each group would redeem the 10% off coupon and 45 in each group would redeem the $1 off coupon. Hence, in our expected frequencies, we still expect 55% of the total customers to redeem the 10% off offer. Jenny’s expected frequencies are:
Coupons Redeemed (Expected) 

$1 off 
10% off 
Coupons Redeemed 

1835  45  55  100 
Over 35  45  55  100 
Coupons Redeemed  90  110  200 
As you can see, the totals for each row and column match those in the actual frequency table above. The mathematical way to compute the expected frequencies for each cell would be to multiply its corresponding column total by its corresponding row total and then divide it by the total number of observations. So, we would compute as follows:
Frequency of: 
Formula: 
Result 
1835 redeeming $1 off:  =(100*90)/200 
=45 
1835 redeeming 10% off:  =(100*110)/200 
=55 
Over 35 redeeming $1 off:  =(100*90)/200 
=45 
Over 35 redeeming 10% off:  =(100*110)/200 
=55 
Now that Jenny knows the expected frequencies, she must determine the critical χ^{2} statistic to determine significance, then she must compute the χ^{2} statistic for her data. If the latter χ^{2} is greater than the critical χ^{2} statistic, then Jenny knows that the customer’s age group is associated the coupon offer redeemed.
Determining the Critical χ^{2} Statistic
To find out what her critical χ^{2} statistic is, Jenny must first determine the degrees of freedom in her data. For crosstabulation tables, the number of degrees of freedom is a straightforward calculation:
Degrees of freedom = (# of rows – 1) * (# of columns 1)
So, Jenny has two rows of data and two columns, so she has (21)*(21) = 1 degree of freedom. With this information, Jenny grabs her old college statistics book and looks at the χ^{2} distribution table in the appendix. For a 95% confidence interval with one degree of freedom, her critical χ^{2} statistic is 3.84. When Jenny calculates the χ^{2} statistic from her frequencies, she will compare it with the critical χ^{2} statistic. If Jenny’s χ^{2} statistic is greater than the critical, she will conclude that the difference is statistically significant and that age does relate to which coupon offer is redeemed.
Calculating the χ^{2} Value From Observed Frequencies
Now, Jenny needs to compare the actual number of coupons redeemed for each group to their expected number. Essentially, to compute her χ^{2} value, Jenny follows a particular formula. For each cell, she subtracts the expected frequency of that cell from the actual frequency, squares the difference, and then divides it by the expected frequency. She does this for each cell. Then she sums up her results to get her χ^{2} value:
$1 off 
10% off 

1835  =(3545)^2/45 = 2.22  =(6555)^2/55=1.82 
Over 35  =(5545)^2/45 = 2.22  =(4555)^2/55=1.82 
χ^{2}= 
2.22+1.82+2.22+1.82  
= 
8.08 
Jenny’s χ^{2} value is 8.08, much higher than the critical 3.84, indicating that there is indeed an association between age and coupon redemption.
Interpreting the Results
Jenny concludes that patrons over the age of 35 are more inclined than patrons age 1835 to take advantage of a coupon stating $1 off; patrons age 1835 are more inclined to prefer the 10% off coupon. The way Jenny uses this information depends on the objectives of her business. If Jenny feels that KJen needs to attract more middleaged and senior citizens, she should use the $1 off coupon when targeting them. If Jenny feels KJen isn’t selling enough Jambalaya, then she might try to stimulate demand by couponing, sending the $1 off coupon to patrons over the age of 35 and the 10% off coupon to those 1835.
Jenny might even have a counterintuitive use for the information. If most of KJen’s regular patrons are over age 35, they may already be loyal customers. Jenny might still send them coupons, but give the 10% off coupon instead. Why? These customers are likely to buy the jambalaya anyway, so why not give them the coupon they are not as likely to redeem? After all, why give someone a discount if they’re going to buy anyway! Giving the 10% off coupon to these customers does two things: first, it shows them that KJen still cares about their business and keeps them aware of KJen as a dining option. Second, by using the lower redeeming coupon, Jenny can reduce her exposure to subsidizing loyal customers. In this instance, Jenny uses the coupons for advertising and promoting awareness, rather than moving orders of jambalaya.
There are several more ways to analyze data by subgroup, some of which will be discussed in future posts. It is important to remember that your research objectives dictate the information you collect, which dictate the appropriate analysis to conduct.
*************************
If you Like Our Posts, Then “Like” Us on Facebook and Twitter!
Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Likeing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.
Tags: actual frequencies, Analysights, categorical data, chisquare, chisquared, coupons, crosstabs, crosstabulations, data analysis, degrees of freedom, expected frequencies, observed frequencies, statistical analysis, statistical significance, Statistics, subgroup analysis
July 21, 2010 at 11:13 pm 
Great article. How would a small business owner leverage this information? It is very technical, yet necessary. Can it be
processed into user friendly terms for application?
August 21, 2010 at 11:41 am 
Thanks Bridget.
Without getting into too much math, the three things a business owner needs to know would be the 1) actual number of respondents in each group for each treatment; 2) probability of respondents to each treatment; and 3) the significance factor.
Say, for instance, you flipped a coin 50 times, and 30 times it came up heads. Your actual occurrence of heads is 60%; but your expected occurrence would be 25 times – or 50%, since there are just one of two possible outcomes. In like manner, when the business owner gets his actual results, he/she should assess the expected probability of each group responding, and then compare the two. He/she would do this by subtracting the expected occurrences from the actual for each response group to each treatment, and then squaring each result and adding them up.
The key thing to avoid having to refer back to a statistics book, is that most business cases require only a 95% confidence level, and a lot of small businesses are not going to be analyzing more than just a few treatments against a few subgroups. If, for instance, Jenny Kaplan in our example were testing three types of coupons and five age groups, thats only 8 degrees of freedom (51)*(31). Rarely are small businesses going to even need that many.
Your critical chisquared for 95% confidence, as a rule of thumb, will be: 1 degree of freedom, around 4; 2 d.f=6; 3 d.f.=7.5, 4 d.f.=9.5; and then add on about 1.50 to your critical chisquared statistic for each degree of freedom, until you reach 10 which is actually 18.3, but you’d have 18.5.
All the business owner needs to do then is compare his/her sum of squared differences with his critical chisquared, and if they’re higher, then the differences are significant. If they’re close, the business owner should retest.