Posts Tagged ‘data analysis’

Identifying Outliers in a Data Set

September 14, 2010

Last week, we talked about what to do when your data set has records with missing or null values. Another problem that crops in data sets is that of extreme values, commonly known as outliers. Like missing data, outliers can wreak havoc with your statistical models and analyses, especially in regression analysis, which places greater weight on extreme values. Today, we’re going to talk about diagnosing outliers in your data and, tomorrow, we will discuss what to do about them.

Outliers occur in two ways – naturally and erroneously. Naturally, because not everybody or every phenomenon is typical. There are a small number of people who are much taller than most other persons and a small number who are much shorter; one or two gamblers at a casino may have a much larger roulette win than most other players; a few light bulbs may last many more (or far fewer) hours than most other bulbs of the same brand. These natural examples are rare, but can happen.

Outliers also occur because of error. Sometimes when entering data, we misplace a decimal point, or enter an extra zero at the end of a number, or transpose numbers. It is important to verify that all information is collected and recorded properly.

Diagnosing Outliers

There are a couple of ways to check data for outliers. These include:

Visually Inspect Data

Plot your data on a chart or graph. Do some points stand out from the “crowd?” If so, what is the record? Can you verify that it was entered correctly?

Automatically Minimize Exposure to Outliers

One way to check for outliers is to assume you’ll have some and adjust your data accordingly. You may say that a set percentage (say 1% to 5%) of your data on both ends is an outlier and then either remove those observations, or set a floor or ceiling based on the remaining data. For example, if you have 1,000 records in your data set and you assume that 1% on both ends is an outlier, you can either remove the bottom and top 10 observations from your analysis, or you can change the values of the bottom 10 to the value of the 11th lowest and those of the top 10 to that of the 11th highest value.

The problem here is that this approach is arbitrary and doesn’t take into account the uniqueness of each individual data set. Moreover, if you decided to delete those bottom and top records, you lose information. I don’t particularly recommend this approach, but in the interest of expediency it may be helpful.

Parametric Extremity

In parametric extremity, we use the data set’s parameters to determine how a particular value diverges from the center of the data set’s distribution. The obvious center of the distribution is the mean; the measure of divergence is the standard deviation. When data is normally distributed, virtually all observations are located within three standard deviations from the mean (in each direction). Hence, we may set a rule that an outlier is any value that is at least +/- 3 standard deviations from the mean.

This approach also has some drawbacks. The mean and standard deviation are computed from all values, including outliers. Hence, outliers tend to pull the mean towards them and inflate the standard deviation. As a result, they tend to bias the criteria used for judging whether a value is an outlier. Indeed, outliers introduce bias towards including extreme values.

Non-Parametric Extremity

Another approach to measuring divergence is through non-parametric methods. Essentially, the concept is the same, and the mean is still the center; however the divergence is measured by the inter-quartile range (IQR). Essentially, you order your data set and then break it into four equal parts. The lowest 25% is your first quartile; the next 25% is your second quartile (whose upper bound is the median); and so on. Essentially, anything higher than the top of the third quartile or lower than the bottom of the second quartile is reviewed for outliers.

If done haphazardly, non-parametric extremity will give you the same problem as establishing a set percentage on both ends as outliers. To avoid this drawback, again inspect the points that fall outside the second and third quartiles. Those closest to the outer bounds of the IQR can remain in your data set; those far away should be measured for accuracy, and if accurate can be adjusted or removed.

These are just a few of the ways you can identify outliers in your data set. Frequently, classifying a value as an outlier is a judgment call, and diagnosis and correction are two separate events. How you diagnose outliers is just as important to the integrity of your analysis as how you deal with those outliers.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Advertisements

The Challenges of Data Mining

September 7, 2010

For the last five weeks, I have been onsite at a client – a regional commercial bank – building statistical models that will help them predict which noncustomers have a good chance of becoming customers; the best first product – checking account, home equity line, savings account, or CD – to offer them; and the revenue this noncustomer may bring to the bank in his/her first year. Although the project is still ongoing, it’s been one of the most fascinating I’ve worked on in a long time, not only because I love the challenge of building predictive models, but also because it reminds me of the challenges and careful considerations one must take into account when using data for strategic advantage. If you’re exploring data mining for your company’s marketing, finance, or operations efforts, you’ll appreciate these principles I’m about to share with you.

Know What You’re Trying to Learn From the Data

Back in the early 1990s, there was a statistic that the amount of information available to the human race doubled every seven years. With the evolution of mobile technology, the Web, social media, and other “smart” technology, I wouldn’t be surprised if now the amount of data doubles every seven weeks! The constantly increasing amount of data creates new possibilities for analysis, but we must be very careful not to overwhelm ourselves with the volume of data just for the sake of analysis. Lots of data are available to us, but not all of it is relevant to our business purposes.

This is why it is so important to define what you’re trying to learn from the data before you attempt any analysis. One of the challenges we had to do at the bank was to define a “new customer” and a “noncustomer”. We had to decide the timeframe for when the new customer opened the account. If we picked too short a timeframe (say, March through June 2010), we would have too few data points to work with; if we picked too long a timeframe (say, all the way back to 2008), many of our “new” customers wouldn’t be so new, and their banking habits and behaviors might be very different from those who opened their accounts more recently, and that could have undesirable results for the model.

Know the Unit of Analysis for Your Project

Many large companies have data at so many levels. Banks have data at the transaction, account, individual, household, ZIP code, census tract, territory, state, and regional levels. What is the appropriate unit of analysis? It depends on your definition of the business problem and the levels of data you have access to. If the bank was trying to demonstrate compliance with lending requirements set by the Community Reinvestment Act (CRA), then it would need to analyze data at the census tract level. However, the bank is trying to acquire new adult customers. Since the bank is interested in acquiring any adult within a household, the household was a suitable unit of analysis. Besides, the only data available about prospects was overlay data from a third party data vendor, which is at the household level.

Sometimes you need data at a level of analysis that you don’t have. Let’s say that you need to analyze at the ZIP code level, but only have household data. In that case, you need to roll-up or summarize your data at that level – that is, you transpose your data to the appropriate level of granularity. But what if you needed data at a lower level of granularity than what you currently have, like having ZIP code data but needing a household-level of analysis? Unless you have a way of segmenting your ZIP code data down to the household level, you either cannot perform the analysis, or you must start from scratch collecting the household-level data.

Understand the Business Rules Governing Use of Data

Many businesses are subject to laws, regulations, and internal business policies that restrict the activities they perform and the manner in which they perform those activities. Banks are no exception. In fact, I’d argue that banks are the rule! In running one of the models, I found that a household’s likelihood of acquiring a checking account was much greater if it was a married-couple household. But the bank’s policies forbid it to market products on the basis of marital status! So, we had to go back to the drawing board. Fortunately, we had a way of estimating the number of adults in the household, which we used as a proxy for marital status. What was nice about this approach was that it took into account different household dynamics: the presence of grown children residing with parents; an aged parent living in his/her adult child’s home; domestic partnerships; and cohabiting couples. As a result, we had a predictor model that would be nondiscriminatory to certain demographics.

Before you work with any data, it is vital that you talk to the compliance officers who oversee the restrictions on marketing, privacy, and other uses of customer data.

Understand the History of the Data

The history of a business and its practices often shapes that data you will have to analyze. Just because a business no longer engages in a particular practice, those past activities can affect the results you get from your model. A few months before this project with the bank, I was working on a mailing list for another client. I noticed that many – as much as five percent – of the customers on the list were age 98, which was very bizarre. So I looked at the birthdates of these nonagenarians and almost all of them had the same birthdate: November 11, 1911. Their birthdates had been populated as 11/11/1111! What happened was that this client has previously required the birthdate to be entered for each of its customers. However, when data was being entered, many past employees attempted to bypass it by entering eight 1s into the birthdate field! Although the practice of requiring birthdates for customers had long been rescinded, the client’s data still reflected those past practices. Without knowledge of this previous practice, a model based on age could have caused the client to market to a much older group of people than it should have.

In defining “new customers” for the bank, we chose the beginning of 2009 as our earliest point. Why? Because at the latter point of 2008, there was the banking and financial crisis. Had we included new customers from early 2008, we would have seen dramatic variances in their banking behaviors, balances, and transactions, which would have had adverse consequences for both our propensity and revenue models.

Accept that Data is Always Dirty

Just because volumes and volumes of data are available doesn’t mean it is ready to use. The example of the birthdates is one case in point. The example of the data granularity is another. Still, there are other problems. Some customers and noncustomers might not have any data recorded in a particular field. For some of the bank’s customers and prospects, the third-party overlay data did not contain age, income, or gender information. For the entire data set, the values in the fields for income, wealth, and home market values were quite spread out. Some had extremely high values in those fields; others extremely low values. As a result of these extreme and missing data, we needed to make adjustments so that the models would not produce undesirable results or suspect predictions.

For the missing values, we computed the median values of all observations in the data set, and then substituted those. For the extreme values, we did a couple of things. For some values, we set up bins, such as “$0 – $5,000”; “$5,001-$10,000” and so on. For others, we took the natural log. Still, for others, we computed ratios, such as a household’s savings-to-income ratio. These approaches helped to reduce the variation in the data.

Realize that Your Results are Only as Good as Your Assumptions

Anytime you use analyze data for making decisions, you are making assumptions. Whenever we use data to construct a model, we assume that the patterns we discover in our data mining effort will hold up to the new customers we seek to acquire. When we analyze time series data, we are assuming that patterns of the past and present will hold up to the future. When we imputed the median values for missing data, we were assuming that those customers and prospects whose data was missing were like the “typical” customer. Making assumptions is a double-edged sword. We need to make them in order to direct our analyses and planning. Yet, if our assumptions are mistaken – and to some degree, every assumption is – our models will be useless. That’s why we must be very careful in our presuppositions about data and why we should test the results of our models before fully deploying them.

Analyzing Subgroups of Data

July 21, 2010

The data available to us has never been more voluminous. Thanks to technology, data about us and our environment are collected almost continuously. When we use a cell phone to call someone else’s cell phone, several pieces of information are collected: the two phone numbers involved in the call; the time the call started and ended; the cell phone towers closest to the two parties; the cell phone carriers; the distance of the call; the date; and many more. Cell phone companies use this information to determine where to increase capacity; refine, price, and promote their plans more effectively; and identify regions with inadequate coverage.

Multiply these different pieces of data by the number of calls in a year, a month, a day – even an hour – and you can easily see that we are dealing with enormous amounts of records and observations. While it’s good for decision makers to see what sales, school enrollment, cell phone usage, or any other pattern looks like in total, quite often they are even more interested in breaking down data into groups to see if certain groups behave differently. Quite often we hear decision makers asking questions like these:

  • How do depositors under age 35 compare with those between 35-54 and 55 & over in their choice of banking products?
  • How will voter support for Candidate A differ by race or ethnicity?
  • How does cell phone usage differ between men and women?
  • Does the length or severity of a prison sentence differ by race?

When we break data down into subgroups, we are trying to see whether knowing about these groups adds any additional meaningful information. This helps us customize marketing messages, product packages, pricing structures, and sales channels for different segments of our customers. There are many different ways we can break data down: by region, age, race, gender, income, spending levels; the list is limitless.

To give you an example of how data can be analyzed by groups, let’s revisit Jenny Kaplan, owner of K-Jen, the New Orleans-style restaurant. If you recall from the May 25 post, Jenny tested two coupon offers for her $10 jambalaya entrée: one offering 10% off and another offering $1 off. Even though the savings was the same, Jenny thought customers would respond differently. As Jenny found, neither offer was better than the other at increasing the average size of the table check. Now, Jenny wants to see if there is a preference for one offer over the other, based on customer age.

Jenny knows that of her 1,000-patron database, about 50% are the ages of 18 to 35; the rest are older than 35. So Jenny decides to send out 1,000 coupons via email as follows:

  

$1 off

10% off

Total Coupons

18-35

250

250

500

Over 35

250

250

500

Total Coupons

500

500

1,000

Half of Jenny’s customers received one coupon offer and half received the other. Looking carefully at the table above, half the people in each age group got one offer and the other half got the other offer. At the end of the promotion period, Jenny received back 200 coupons. She tracks the coupon codes back to her database and finds the following pattern:

Coupons Redeemed (Actual)

  

$1 off

10% off

Coupons Redeemed

18-35

35

65

100

Over 35

55

45

100

Coupons Redeemed

90

110

200

 

Exactly 200 coupons were redeemed, 100 from each age group. But notice something else: of the 200 people redeeming the coupon, 110 redeemed the coupon offering 10% off; just 90 redeemed the $1 off coupon. Does this mean the 10% off coupon was the better offer? Not so fast!

What Else is the Table Telling Us?

Look at each age group. Of the 100 customers aged 18-35, 65 redeemed the 10% off coupon; but of the 100 customers age 35 and up, just 45 did. Is that a meaningful difference or just a fluke? Do persons over 35 prefer an offer of $1 off to one of 10% off? There’s one way to tell: a chi-squared test for statistical significance.

The Chi-Squared Test

Generally, a chi-squared test is useful in determining associations between categories and observed results. The chi-squared – χ2 – statistic is value needed to determine statistical significance. In order to compute χ2, Jenny needs to know two things: the actual frequency distribution of the coupons redeemed (which is shown in the last table above), and the expected frequencies.

Expected frequencies are the types of frequencies you would expect the distribution of data to fall, based on probability. In this case, we have two equal sized groups: customers age 18-35 and customers over 35. Knowing nothing else besides the fact that the same number of people in these groups redeemed coupons, and that 110 of them redeemed the 10% off coupon, and 90 redeemed the $1 off coupon, we would expect that 55 customers in each group would redeem the 10% off coupon and 45 in each group would redeem the $1 off coupon. Hence, in our expected frequencies, we still expect 55% of the total customers to redeem the 10% off offer. Jenny’s expected frequencies are:

Coupons Redeemed (Expected)

  

$1 off

10% off

Coupons Redeemed

18-35 45 55 100
Over 35 45 55 100
Coupons Redeemed 90 110 200

 

As you can see, the totals for each row and column match those in the actual frequency table above. The mathematical way to compute the expected frequencies for each cell would be to multiply its corresponding column total by its corresponding row total and then divide it by the total number of observations. So, we would compute as follows:

Frequency of:

Formula:

Result

18-35 redeeming $1 off: =(100*90)/200

=45

18-35 redeeming 10% off: =(100*110)/200

=55

Over 35 redeeming $1 off: =(100*90)/200

=45

Over 35 redeeming 10% off: =(100*110)/200

=55

 

Now that Jenny knows the expected frequencies, she must determine the critical χ2 statistic to determine significance, then she must compute the χ2 statistic for her data. If the latter χ2 is greater than the critical χ2 statistic, then Jenny knows that the customer’s age group is associated the coupon offer redeemed.

Determining the Critical χ2 Statistic

To find out what her critical χ2 statistic is, Jenny must first determine the degrees of freedom in her data. For cross-tabulation tables, the number of degrees of freedom is a straightforward calculation:

Degrees of freedom = (# of rows – 1) * (# of columns -1)

So, Jenny has two rows of data and two columns, so she has (2-1)*(2-1) = 1 degree of freedom. With this information, Jenny grabs her old college statistics book and looks at the χ2 distribution table in the appendix. For a 95% confidence interval with one degree of freedom, her critical χ2 statistic is 3.84. When Jenny calculates the χ2 statistic from her frequencies, she will compare it with the critical χ2 statistic. If Jenny’s χ2 statistic is greater than the critical, she will conclude that the difference is statistically significant and that age does relate to which coupon offer is redeemed.

Calculating the χ2 Value From Observed Frequencies

Now, Jenny needs to compare the actual number of coupons redeemed for each group to their expected number. Essentially, to compute her χ2 value, Jenny follows a particular formula. For each cell, she subtracts the expected frequency of that cell from the actual frequency, squares the difference, and then divides it by the expected frequency. She does this for each cell. Then she sums up her results to get her χ2 value:

  

$1 off

10% off

18-35 =(35-45)^2/45 = 2.22 =(65-55)^2/55=1.82
Over 35 =(55-45)^2/45 = 2.22 =(45-55)^2/55=1.82
     

χ2=

2.22+1.82+2.22+1.82  

=

8.08  

 

Jenny’s χ2 value is 8.08, much higher than the critical 3.84, indicating that there is indeed an association between age and coupon redemption.

Interpreting the Results

Jenny concludes that patrons over the age of 35 are more inclined than patrons age 18-35 to take advantage of a coupon stating $1 off; patrons age 18-35 are more inclined to prefer the 10% off coupon. The way Jenny uses this information depends on the objectives of her business. If Jenny feels that K-Jen needs to attract more middle-aged and senior citizens, she should use the $1 off coupon when targeting them. If Jenny feels K-Jen isn’t selling enough Jambalaya, then she might try to stimulate demand by couponing, sending the $1 off coupon to patrons over the age of 35 and the 10% off coupon to those 18-35.

Jenny might even have a counterintuitive use for the information. If most of K-Jen’s regular patrons are over age 35, they may already be loyal customers. Jenny might still send them coupons, but give the 10% off coupon instead. Why? These customers are likely to buy the jambalaya anyway, so why not give them the coupon they are not as likely to redeem? After all, why give someone a discount if they’re going to buy anyway! Giving the 10% off coupon to these customers does two things: first, it shows them that K-Jen still cares about their business and keeps them aware of K-Jen as a dining option. Second, by using the lower redeeming coupon, Jenny can reduce her exposure to subsidizing loyal customers. In this instance, Jenny uses the coupons for advertising and promoting awareness, rather than moving orders of jambalaya.

There are several more ways to analyze data by subgroup, some of which will be discussed in future posts. It is important to remember that your research objectives dictate the information you collect, which dictate the appropriate analysis to conduct.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.