Posts Tagged ‘standard deviation’

Identifying Outliers in a Data Set

September 14, 2010

Last week, we talked about what to do when your data set has records with missing or null values. Another problem that crops in data sets is that of extreme values, commonly known as outliers. Like missing data, outliers can wreak havoc with your statistical models and analyses, especially in regression analysis, which places greater weight on extreme values. Today, we’re going to talk about diagnosing outliers in your data and, tomorrow, we will discuss what to do about them.

Outliers occur in two ways – naturally and erroneously. Naturally, because not everybody or every phenomenon is typical. There are a small number of people who are much taller than most other persons and a small number who are much shorter; one or two gamblers at a casino may have a much larger roulette win than most other players; a few light bulbs may last many more (or far fewer) hours than most other bulbs of the same brand. These natural examples are rare, but can happen.

Outliers also occur because of error. Sometimes when entering data, we misplace a decimal point, or enter an extra zero at the end of a number, or transpose numbers. It is important to verify that all information is collected and recorded properly.

Diagnosing Outliers

There are a couple of ways to check data for outliers. These include:

Visually Inspect Data

Plot your data on a chart or graph. Do some points stand out from the “crowd?” If so, what is the record? Can you verify that it was entered correctly?

Automatically Minimize Exposure to Outliers

One way to check for outliers is to assume you’ll have some and adjust your data accordingly. You may say that a set percentage (say 1% to 5%) of your data on both ends is an outlier and then either remove those observations, or set a floor or ceiling based on the remaining data. For example, if you have 1,000 records in your data set and you assume that 1% on both ends is an outlier, you can either remove the bottom and top 10 observations from your analysis, or you can change the values of the bottom 10 to the value of the 11th lowest and those of the top 10 to that of the 11th highest value.

The problem here is that this approach is arbitrary and doesn’t take into account the uniqueness of each individual data set. Moreover, if you decided to delete those bottom and top records, you lose information. I don’t particularly recommend this approach, but in the interest of expediency it may be helpful.

Parametric Extremity

In parametric extremity, we use the data set’s parameters to determine how a particular value diverges from the center of the data set’s distribution. The obvious center of the distribution is the mean; the measure of divergence is the standard deviation. When data is normally distributed, virtually all observations are located within three standard deviations from the mean (in each direction). Hence, we may set a rule that an outlier is any value that is at least +/- 3 standard deviations from the mean.

This approach also has some drawbacks. The mean and standard deviation are computed from all values, including outliers. Hence, outliers tend to pull the mean towards them and inflate the standard deviation. As a result, they tend to bias the criteria used for judging whether a value is an outlier. Indeed, outliers introduce bias towards including extreme values.

Non-Parametric Extremity

Another approach to measuring divergence is through non-parametric methods. Essentially, the concept is the same, and the mean is still the center; however the divergence is measured by the inter-quartile range (IQR). Essentially, you order your data set and then break it into four equal parts. The lowest 25% is your first quartile; the next 25% is your second quartile (whose upper bound is the median); and so on. Essentially, anything higher than the top of the third quartile or lower than the bottom of the second quartile is reviewed for outliers.

If done haphazardly, non-parametric extremity will give you the same problem as establishing a set percentage on both ends as outliers. To avoid this drawback, again inspect the points that fall outside the second and third quartiles. Those closest to the outer bounds of the IQR can remain in your data set; those far away should be measured for accuracy, and if accurate can be adjusted or removed.

These are just a few of the ways you can identify outliers in your data set. Frequently, classifying a value as an outlier is a judgment call, and diagnosis and correction are two separate events. How you diagnose outliers is just as important to the integrity of your analysis as how you deal with those outliers.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Advertisements

Using Statistics to Evaluate a Promotion

May 25, 2010

Marketing – as much as cashflow – is the lifeblood of any business. No matter how good your product or service may be, it’s worthless if you can’t get it in front of your customers and get them to buy it. So all businesses, large and small, must engage in marketing. And we see countless types of marketing promotions or tactics being tried: radio and TV commercials, magazine and newspaper advertisements, public relations, coupons, email blasts, and so forth. But are our promotions working? The merchant John Wannamaker, often dubbed the father of modern advertising is said to have remarked, “Half the money I spend on advertising is wasted; the trouble is I don’t know which half.”

Some basic statistics can help you evaluate the effectiveness of your marketing and take away much of the mystique Wannamaker complained about. When deciding whether to do a promotion, managers and business owners have no way of knowing whether it will succeed; in fact, in today’s economy, budgets are still tight. The cost to roll out a full promotion can wipe out an entire marketing budget if it proves to be a fiasco. This is why many businesses do a test before doing a complete rollout. The testing helps to reduce the amount of uncertainty involved in an all-out campaign.

Quite often, large companies need to choose between two or more competing campaigns for rollout. But how do they know which will be effective? Consider the example of Jenny Kaplan, owner of K-Jen, a New Orleans-style restaurant. K-Jen serves up a tasty jambalaya entrée, which is priced at $10.00. Jenny believes that the jambalaya is a draw to the restaurant and believes that by offering a discount, she can increase the average amount of the table check. Jenny decides to issue coupons via email to patrons who have opted-in to receive such promotions. She wants to knock a dollar off the price of the jambalaya as the offer, but doesn’t know whether customers would respond better to an offer worded as “$1.00 off” or as “10% off.” So, Jenny decides to test the two concepts.

Jenny goes to her database of nearly 1,000 patrons and randomly selects 200 patrons. She decides to send half of those a coupon for $1.00 off for jambalaya, and the other half a coupon for 10% off. When the coupon offer expires 10 days later, Jenny finds that 10 coupons were redeemed for each offer – a redemption rate of 10% each. Jenny observes that either wording will get the same number of people to respond. But she wonders which offer generated the largest table check. So she looks at the guest checks to which the coupons were stapled. She notices the following:

Guest Check Amounts

 

Offer

 
 

$1.00 off

10% Off

 
 

$38.85

$50.16

 
 

$36.97

$54.44

 
 

$35.94

$32.20

 
 

$54.17

$32.69

 
 

$68.18

$51.09

 
 

$49.47

$46.18

 
 

$51.39

$57.72

 
 

$32.72

$44.30

 
 

$22.59

$59.29

 
 

$24.13

$22.94

 

 

Jenny quickly computes the average for each offer. The “$1.00 off” coupon generated an average table check of $41.44; the “10% off” coupon generated an average of $45.10. At first glance, it appears that the 10% off promotion generated a higher guest check. But is that difference meaningful, or is it due to chance? Jenny needs to do further analysis.

Hypothesis Testing

How does Jenny determine if the 10% off coupon really did better than the $1.00 off coupon? She can use statistical hypothesis testing, which is a structured analytical method for comparing the difference between two groups – in this case, two promotions. Jenny starts her analysis by formulating two hypotheses: a null hypothesis, which states that there is no difference in the average check amount for either offer; and an alternative hypothesis, which states that there is, in fact, a difference in the average check amount between the two offers. The null hypothesis is often denoted as H0, and the alternative hypothesis is denoted as HA. Jenny also refers to the $1.00 off offer as Offer #1, and the 10% off offer as Offer #2. She wants to compare the means of the two offers, the means of which are denoted as μ1 and μ2, respectively. Jenny writes down her two hypotheses:

H0: The average guest check amount for the two offers is equal.

HA: The average guest check amount for the two offers is not equal.

Or, more succinctly:

H0: μ12

HA: μ1≠μ2

 

Now, Jenny is ready to go to work. Note that the symbol μ denotes the population she wants to measure. Because Jenny did her test on a portion – a sample – of her database, the averages she computed were the sample average, which is denoted as . As we stated earlier, the average table checks for the “$1.00 off” and “10% off” offers were 1=$41.44 and 2=$45.10, respectively. Jenny needs to approximate μ using . She must also compute the sample standard deviation, or s for each offer.

Computing the Sample Standard Deviation

To compute the sample standard deviation, Jenny must subtract the mean of a particular offer from each of its check amounts in the sample; square the difference; sum them up; divide by the total observations minus 1(9) and then take the square root:

$1.00 Off

Actual Table Check

Average Table Check

Difference

Difference Squared

$38.85

$41.44

-$2.59

$6.71

$36.97

$41.44

-$4.47

$19.99

$35.94

$41.44

-$5.50

$30.26

$54.17

$41.44

$12.73

$162.03

$68.18

$41.44

$26.74

$714.97

$49.47

$41.44

$8.03

$64.46

$51.39

$41.44

$9.95

$98.98

$32.72

$41.44

-$8.72

$76.06

$22.59

$41.44

-$18.85

$355.36

$24.13

$41.44

-$17.31

$299.67

   

Total

$1,828.50

   

S21=

$203.17

   

S1=

$14.25

 

10% Off

Actual Table Check

Average Table Check

Difference

Difference Squared

$50.16

$45.10

$5.06

$25.59

$54.44

$45.10

$9.34

$87.22

$32.20

$45.10

-$12.90

$166.44

$32.69

$45.10

-$12.41

$154.03

$51.09

$45.10

$5.99

$35.87

$46.18

$45.10

$1.08

$1.16

$57.72

$45.10

$12.62

$159.24

$44.30

$45.10

-$0.80

$0.64

$59.29

$45.10

$14.19

$201.33

$22.94

$45.10

-$22.16

$491.11

   

Total

$1,322.63

   

S22=

$146.96

   

S2=

$12.12

 

Notice the denotation of S2. That is known as the variance. The variance and the standard deviation are used to measure the average distance between each data point and the mean. When data are normally distributed, about 95% of all observations fall within two standard deviations from the mean (actually 1.96 standard deviations). Hence, approximately 95% of the guest checks for the $1.00 off offer should fall between $41.44 ± 1.96*($14.25) or between $13.51 and $69.37. All ten fall within this range. For the 10% off offer, about 95% will fall between $45.10 ± 1.96*($12.12), or between $21.34 and $68.86. All 10 observations also fall within this range.

Degrees of Freedom and Pooled Standard Deviation

Jenny noticed two things immediately: first, that the 10% off coupon has the higher sample average, and second each individual table check is closer to it mean than it is for the $1.00 off coupon. Also notice that when we were computing the sample standard deviation for each offer, Jenny divided by 9, and not 10. Why? Because she was making estimates of the population standard deviation. Since samples are subject to error, we must account for that. Each observation gives us information into the population’s actual values. However, Jenny had to make an estimate based on that sample, so she gives up one observation to account for the sampling error – that is, she lost a degree of freedom. In this example, Jenny has 20 total observations; since she estimated the population standard deviation for both offers, she lost two degrees of freedom, leaving her with 18 (10 + 10 – 2).

Knowing the remaining degrees of freedom, Jenny must pool the standard deviations, weighting them by their degrees of freedom. This would be especially evident if the sample sizes of the two offers were not equal. The pooled standard deviation is given by:

FYI – n is simply the sample size. Jenny then computes the pooled standard deviation:

S2p = ((9 * $203.17) + (9 * $146.96))) / (10 + 10 – 2)

= ($1,828.53 + $1,322.64)/18

= $3,151.17/18

= $175.07

Now take the square root: $13.23

Hence, the pooled standard deviation is $13.23

Computing the t-Test Statistic

Now the fun begins. Jenny knows the sample mean of the two offers; she knows the hypothesized difference between the two population means (which we would expect to be zero, if the null hypothesis said they were equal); she knows the pooled standard deviation; she knows the sample size; and she knows the degrees of freedom. Jenny must now calculate the t-Test statistic. The t-Test Statistic, or the t-value, represents the number of estimated standard errors the sample average is from that of the population. The t-value is computed as follows:

 

So Jenny sets to work computing her t-Test Statistic:

t = (($41.44 – $45.10) – (0)) / ($13.23) * SQRT(1/10 + 1/10)

= -$3.66 / ($13.23 * SQRT(1/5))

=-$3.66 / ($13.23 * .45)

=-$3.66/$5.92

= -0.62

This t-statistic gives Jenny a basis for testing her hypothesis. Jenny’s t-statistic indicates that the difference in sample table checks between the two offers is 0.62 standard errors below the hypothesized difference of zero. We now need to determine the critical t – the value that we get from a t-distribution table that is available in most statistics textbooks and online. Since we are estimating with a 95% confidence interval, and since we must account for a small sample, our critical t-value is adjusted slightly from the 1.96 standard deviations from the mean. For 18 degrees of freedom, our critical t is 2.10. The larger the sample size, the closer to 1.96 the critical t would be.

So, does Jenny Accept or Reject her Null Hypothesis (Translation: Is the “10% Off” Offer Better than the “$1.00 Off” Offer)?

Jenny now has all the information she needs to determine whether one offer worked better than the other. What does the critical t of 2.10 mean? If Jenny’s t-statistic is greater than 2.10, or (since one offer can be lower than the other), less than -2.10, then she would reject her null hypothesis, as there is sufficient evidence to suggest that the two means are not equal. Is that the case?

Jenny’s t-statistic is -0.62, which is between -2.10 and 2.10. Hence, it is within the parameters. Jenny should not reject H0, since there is not enough evidence to suggest that one offer was better than the other at generating higher table checks. In fact, there’s nothing to say that the difference between the two offers is due to anything other than chance.

What Does Jenny Do Now?

Basically, Jenny can conclude that there’s not enough evidence that the “$1.00 off” coupon was worse/better than the “10% off” coupon in generating higher table check amounts, and vice-versa. This does not mean that our hypotheses were true or false, just that there was not enough statistical evidence to say so. In this case, we did not accept the null hypothesis, but rather, failed to reject it. Jenny can do a few things:

  1. She can run another test, and see if the same phenomenon holds.
  2. Jenny can accept the fact that both offers work equally well, and compare their overall average table checks to those of who ordered jambalaya without the coupons during the time the offer ran; if the coupons generated average table checks that were higher (using the hypothesis testing procedures outlined above) than those who paid full price, then she may choose to rollout a complete promotion using either or both of the offers described above.
  3. Jenny may decide that neither coupon offer raised average check amounts and choose not to do a full rollout after all.

So Why am I Telling You This?

The purpose of this blog post was to take you step-by-step into how you can use a simple concept like t-tests to judge the performance of two promotion concepts. Although a spreadsheet like Excel can run this test in seconds, I wanted to walk you through the theory in laymen’s terms, so that you can grasp the theory, and then apply it to your business. Analysights is in the business of helping companies – large and small – succeed at marketing, and this blog post is one ingredient in the recipe for your marketing success. If you would like some assistance in setting up a promotion test or in evaluating the effectiveness of a campaign, feel free to contact us at www.analysights.com.