Analyzing Subgroups of Data

The data available to us has never been more voluminous. Thanks to technology, data about us and our environment are collected almost continuously. When we use a cell phone to call someone else’s cell phone, several pieces of information are collected: the two phone numbers involved in the call; the time the call started and ended; the cell phone towers closest to the two parties; the cell phone carriers; the distance of the call; the date; and many more. Cell phone companies use this information to determine where to increase capacity; refine, price, and promote their plans more effectively; and identify regions with inadequate coverage.

Multiply these different pieces of data by the number of calls in a year, a month, a day – even an hour – and you can easily see that we are dealing with enormous amounts of records and observations. While it’s good for decision makers to see what sales, school enrollment, cell phone usage, or any other pattern looks like in total, quite often they are even more interested in breaking down data into groups to see if certain groups behave differently. Quite often we hear decision makers asking questions like these:

  • How do depositors under age 35 compare with those between 35-54 and 55 & over in their choice of banking products?
  • How will voter support for Candidate A differ by race or ethnicity?
  • How does cell phone usage differ between men and women?
  • Does the length or severity of a prison sentence differ by race?

When we break data down into subgroups, we are trying to see whether knowing about these groups adds any additional meaningful information. This helps us customize marketing messages, product packages, pricing structures, and sales channels for different segments of our customers. There are many different ways we can break data down: by region, age, race, gender, income, spending levels; the list is limitless.

To give you an example of how data can be analyzed by groups, let’s revisit Jenny Kaplan, owner of K-Jen, the New Orleans-style restaurant. If you recall from the May 25 post, Jenny tested two coupon offers for her $10 jambalaya entrée: one offering 10% off and another offering $1 off. Even though the savings was the same, Jenny thought customers would respond differently. As Jenny found, neither offer was better than the other at increasing the average size of the table check. Now, Jenny wants to see if there is a preference for one offer over the other, based on customer age.

Jenny knows that of her 1,000-patron database, about 50% are the ages of 18 to 35; the rest are older than 35. So Jenny decides to send out 1,000 coupons via email as follows:


$1 off

10% off

Total Coupons





Over 35




Total Coupons




Half of Jenny’s customers received one coupon offer and half received the other. Looking carefully at the table above, half the people in each age group got one offer and the other half got the other offer. At the end of the promotion period, Jenny received back 200 coupons. She tracks the coupon codes back to her database and finds the following pattern:

Coupons Redeemed (Actual)


$1 off

10% off

Coupons Redeemed





Over 35




Coupons Redeemed





Exactly 200 coupons were redeemed, 100 from each age group. But notice something else: of the 200 people redeeming the coupon, 110 redeemed the coupon offering 10% off; just 90 redeemed the $1 off coupon. Does this mean the 10% off coupon was the better offer? Not so fast!

What Else is the Table Telling Us?

Look at each age group. Of the 100 customers aged 18-35, 65 redeemed the 10% off coupon; but of the 100 customers age 35 and up, just 45 did. Is that a meaningful difference or just a fluke? Do persons over 35 prefer an offer of $1 off to one of 10% off? There’s one way to tell: a chi-squared test for statistical significance.

The Chi-Squared Test

Generally, a chi-squared test is useful in determining associations between categories and observed results. The chi-squared – χ2 – statistic is value needed to determine statistical significance. In order to compute χ2, Jenny needs to know two things: the actual frequency distribution of the coupons redeemed (which is shown in the last table above), and the expected frequencies.

Expected frequencies are the types of frequencies you would expect the distribution of data to fall, based on probability. In this case, we have two equal sized groups: customers age 18-35 and customers over 35. Knowing nothing else besides the fact that the same number of people in these groups redeemed coupons, and that 110 of them redeemed the 10% off coupon, and 90 redeemed the $1 off coupon, we would expect that 55 customers in each group would redeem the 10% off coupon and 45 in each group would redeem the $1 off coupon. Hence, in our expected frequencies, we still expect 55% of the total customers to redeem the 10% off offer. Jenny’s expected frequencies are:

Coupons Redeemed (Expected)


$1 off

10% off

Coupons Redeemed

18-35 45 55 100
Over 35 45 55 100
Coupons Redeemed 90 110 200


As you can see, the totals for each row and column match those in the actual frequency table above. The mathematical way to compute the expected frequencies for each cell would be to multiply its corresponding column total by its corresponding row total and then divide it by the total number of observations. So, we would compute as follows:

Frequency of:



18-35 redeeming $1 off: =(100*90)/200


18-35 redeeming 10% off: =(100*110)/200


Over 35 redeeming $1 off: =(100*90)/200


Over 35 redeeming 10% off: =(100*110)/200



Now that Jenny knows the expected frequencies, she must determine the critical χ2 statistic to determine significance, then she must compute the χ2 statistic for her data. If the latter χ2 is greater than the critical χ2 statistic, then Jenny knows that the customer’s age group is associated the coupon offer redeemed.

Determining the Critical χ2 Statistic

To find out what her critical χ2 statistic is, Jenny must first determine the degrees of freedom in her data. For cross-tabulation tables, the number of degrees of freedom is a straightforward calculation:

Degrees of freedom = (# of rows – 1) * (# of columns -1)

So, Jenny has two rows of data and two columns, so she has (2-1)*(2-1) = 1 degree of freedom. With this information, Jenny grabs her old college statistics book and looks at the χ2 distribution table in the appendix. For a 95% confidence interval with one degree of freedom, her critical χ2 statistic is 3.84. When Jenny calculates the χ2 statistic from her frequencies, she will compare it with the critical χ2 statistic. If Jenny’s χ2 statistic is greater than the critical, she will conclude that the difference is statistically significant and that age does relate to which coupon offer is redeemed.

Calculating the χ2 Value From Observed Frequencies

Now, Jenny needs to compare the actual number of coupons redeemed for each group to their expected number. Essentially, to compute her χ2 value, Jenny follows a particular formula. For each cell, she subtracts the expected frequency of that cell from the actual frequency, squares the difference, and then divides it by the expected frequency. She does this for each cell. Then she sums up her results to get her χ2 value:


$1 off

10% off

18-35 =(35-45)^2/45 = 2.22 =(65-55)^2/55=1.82
Over 35 =(55-45)^2/45 = 2.22 =(45-55)^2/55=1.82






Jenny’s χ2 value is 8.08, much higher than the critical 3.84, indicating that there is indeed an association between age and coupon redemption.

Interpreting the Results

Jenny concludes that patrons over the age of 35 are more inclined than patrons age 18-35 to take advantage of a coupon stating $1 off; patrons age 18-35 are more inclined to prefer the 10% off coupon. The way Jenny uses this information depends on the objectives of her business. If Jenny feels that K-Jen needs to attract more middle-aged and senior citizens, she should use the $1 off coupon when targeting them. If Jenny feels K-Jen isn’t selling enough Jambalaya, then she might try to stimulate demand by couponing, sending the $1 off coupon to patrons over the age of 35 and the 10% off coupon to those 18-35.

Jenny might even have a counterintuitive use for the information. If most of K-Jen’s regular patrons are over age 35, they may already be loyal customers. Jenny might still send them coupons, but give the 10% off coupon instead. Why? These customers are likely to buy the jambalaya anyway, so why not give them the coupon they are not as likely to redeem? After all, why give someone a discount if they’re going to buy anyway! Giving the 10% off coupon to these customers does two things: first, it shows them that K-Jen still cares about their business and keeps them aware of K-Jen as a dining option. Second, by using the lower redeeming coupon, Jenny can reduce her exposure to subsidizing loyal customers. In this instance, Jenny uses the coupons for advertising and promoting awareness, rather than moving orders of jambalaya.

There are several more ways to analyze data by subgroup, some of which will be discussed in future posts. It is important to remember that your research objectives dictate the information you collect, which dictate the appropriate analysis to conduct.


If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Tags: , , , , , , , , , , , , , , ,

2 Responses to “Analyzing Subgroups of Data”

  1. Bridget Says:

    Great article. How would a small business owner leverage this information? It is very technical, yet necessary. Can it be
    processed into user friendly terms for application?

    • analysights Says:

      Thanks Bridget.

      Without getting into too much math, the three things a business owner needs to know would be the 1) actual number of respondents in each group for each treatment; 2) probability of respondents to each treatment; and 3) the significance factor.

      Say, for instance, you flipped a coin 50 times, and 30 times it came up heads. Your actual occurrence of heads is 60%; but your expected occurrence would be 25 times – or 50%, since there are just one of two possible outcomes. In like manner, when the business owner gets his actual results, he/she should assess the expected probability of each group responding, and then compare the two. He/she would do this by subtracting the expected occurrences from the actual for each response group to each treatment, and then squaring each result and adding them up.

      The key thing to avoid having to refer back to a statistics book, is that most business cases require only a 95% confidence level, and a lot of small businesses are not going to be analyzing more than just a few treatments against a few subgroups. If, for instance, Jenny Kaplan in our example were testing three types of coupons and five age groups, thats only 8 degrees of freedom (5-1)*(3-1). Rarely are small businesses going to even need that many.

      Your critical chi-squared for 95% confidence, as a rule of thumb, will be: 1 degree of freedom, around 4; 2 d.f=6; 3 d.f.=7.5, 4 d.f.=9.5; and then add on about 1.50 to your critical chi-squared statistic for each degree of freedom, until you reach 10 which is actually 18.3, but you’d have 18.5.

      All the business owner needs to do then is compare his/her sum of squared differences with his critical chi-squared, and if they’re higher, then the differences are significant. If they’re close, the business owner should re-test.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: