Posts Tagged ‘outliers’

Company Practices Can Cause “Dirty” Data

April 28, 2014

As technical people, we often use a not-so-technical phrase to describe the use of bad data in our analyses: “Garbage in, garbage out.” Anytime we build a model or perform an analysis based on data that is dirty or incorrect, we will get results that are undesirable. Data has many opportunities to get murky, with a major cause being the way the business collects and stores it. And dirty data isn’t always incorrect data; the way the company enters data can be correct for operational purposes, but not useful for a particular analysis being done for the marketing department. Here are some examples:

The Return That Wasn’t a Return

I was recently at an outlet store buying some shirts for my sons. After walking out, I realized the sales clerk rang up the full, not sale, price. I went back to the store to have the difference refunded. The clerk re-scanned the receipt, cancelled the previous sale and re-rang the shirts at the sale price. Although I ended up with the same outcome – a refund – I thought of the problems this process could cause.

What if the retailer wanted to predict the likelihood of merchandise returns? My transaction, which was actually a price adjustment, would be treated as a return. Depending on how often this happens, a particular store can be flagged as having above-average returns relative to comparable stores, and be required to implement more stringent return policies that weren’t necessary to begin with.

And consider the flipside of this process: by treating the erroneous ring-up as a return, the retailer won’t be alerted to the possibility that clerks at this store may be making mistakes in ringing up information; perhaps sale prices aren’t being entered into the store’s system; or perhaps equipment storing price updates isn’t functioning properly.

And processing the price adjustment the way the clerk did actually creates even more data that needs to be stored: the initial transaction, the return transaction, and the corrected transaction.

The Company With Very Old Customers

Some years ago, I worked for a company that did direct mailings. I needed to conduct an analysis of its customers and identify the variables that predicted those most likely to respond to a solicitation. The company collected the birthdates of its customers. From that field, I calculated the age of each individual customer. And I found that nearly ten percent of their customers were quite old – much older than the market segments this company targeted. A deeper dive on the birthdate field revealed that virtually all of them had the same birthdate: November 11, 1911. (This was back around the turn of the millennium when companies still recorded dates with two-digit years).

How did this happen? Well, as discussed in the prior post on problem definition, I consulted the company’s “data experts.” I learned that the birthdate field was a required field for first-time customers. The call center representative could not move from the birthdate field to the next field unless values were entered into the birthdate field. Hence, many representatives in the call center simply entered “11-11-11” to bypass the field when a first-time customer refused to give his or her birthdate.

In this case, the company’s requirement to collect birthdate information met sharp resistance from customers, causing the call center to enter dummy data to get around the operational constraints. Incidentally, the company later made the birthdate field optional.

Customers Who Hadn’t Purchased in Almost a Century

Back in the late 1990s, I went to work for a catalog retailer, building response models. The cataloger was concerned that its models were generating undesirable results. I tried running the models with its data and confirmed the models to be untrustworthy. So I started running frequency distributions on all its data fields. To my surprise, I found a field, “Months since last purchase,” in which many customers had the value “999.” Wow – many customers hadn’t purchased since 1916 – almost 83 years earlier!

I knew immediately what happened. In the past, when data was often read into systems using magnetic tape, the way the data systems were programmed required all fields to be populated; if a value for a particular field was missing, the value for the next field would get read into its place, and so forth; and when the program read to the end of the record, it often went to the next record and then read values from there until all fields for the previous record were populated. This was a data nightmare.

To get around this, fields whose data was missing or unknown were filled with a series of 9s, so that all the other data would be entered into the system correctly. This process was fine and dandy, as long as the company’s analysts accounted for this practice during their analysis. The cataloger, however, would run its regressions using those ‘999s,’ resulting in serious outliers, and regressions of little value.

In this case, the cataloger’s attempt to rectify one data malady resulted in a new data malady. I corrected this by recoding the values, breaking those whose last purchase date was known into intervals, and using ranking values: a 1 for the most recent customers, a 2 for the next most recent, a 3 for the next most recent, and so forth, and gave the lowest rank to those whose last purchase was unknown.

The Moral of the Story

Company policy is a major cause of dirty data. These examples – which are far from comprehensive – illustrate how the way data is entered can cause problems. Often, a data fix proves shortsighted, as it causes new problems down the road. This is why it is so important for analysts to consult the company’s data experts before undertaking any major data mining effort. Knowing how a company collects and stores data and making allowances for it will increase the likelihood of a successful data mining effort.

Correcting for Outliers

September 15, 2010

Yesterday, we discussed approaches for discerning outliers in your data set. Today we’re going to discuss what to do about them. Most of the remedies for dealing with outliers are similar to those of dealing with missing data: doing nothing, deleting observations, ignoring the variable, and imputing values. We will discuss the remedies below.

Doing nothing

As with missing data, you may choose to do nothing about the outliers, especially if you rank numeric values, which essentially negates the effect of outliers. This is true of many decision tree algorithms. Neural networks, however, may be seriously disrupted by a few outlying values.

Delete the observations with outlying values

This is another approach that, like with missing data, I do not recommend because of the selection bias it introduces in the model. However, in cases of truly extreme outliers, eliminating one or two that are way off the charts may improve results.

Ignoring the variable

Sometimes we can exclude a variable with outliers. Perhaps we can replace it with information referring to it, or use proxy information. For example, if a food manufacturer was trying to measure coupon redemption by certain metropolitan areas, there might be sharp outliers within each metro area. Instead of the metro area, the food manufacturer may substitute information about the metro area – number of supermarkets, newspaper circulation (assuming its coupons appear in the Sunday paper), average shopping basket amount, etc. Much of this information is available through third party vendors or from sources like the U.S. Census Bureau.

Imputing the values

As with missing values, you would simply try to predict the “right” value to substitute for an outlying variable. You might even cap the outliers at the bottom or top. For example, you might look at the 5th and 95th percentiles, and set the lowest values to the 5th percentile and the top values to the 95% percentile. You may even choose to eliminate those falling outside the 5th through 95th percentiles. However, as I mentioned yesterday, such capping ignores the uniqueness of each data set. You need to treat each data set differently when identifying and correcting its outliers.

If an observation has an outlier, you might also look to see what values other similar observations tend to have for that variable, and substitute the mean or median for the extreme value. For instance, an ice cream parlor chain might see that sales of mint chocolate chip ice cream in one store might be much higher than that of other stores in the area. The sales director might look at stores of similar size (e.g., square footage, sales volume, full-time equivalent employees, etc.), or similar territory (e.g., all ice cream parlors in the greater Bismarck, ND area), and check the average or median sales of mint chocolate chip ice cream and substitute that for the outlying store.

It is important to remember however that outliers can be caused because of external factors. Before blindly imputing values for mint chocolate chip ice cream sales in that particular store, the sales director should find out if customers near that store have preferences for mint, or whether a few customers buy the mint chocolate chip a lot more than others. It might even be that the other parlors could have severe stock-outs of the flavor, suggesting distribution problems. In this case, the outlying parlor could be normal and all other parlors could be selling too little mint chocolate chip ice cream!

Binning values

Sometimes, the best way to deal with outliers is to collapse the values into a few equal-sized categories. You might order your values from high to low and then break them into equal groups. This process is called binning. Low, Medium, and High are common bins. Others might be Outstanding, Above Average, Average, Below Average, and Poor. Outliers fall into appropriate ranges with binning.

Transforming Data

Sometimes you can eliminate outliers by transforming data. Binning is one form of transformation. Taking the natural log of a value can also reduce the variation caused by extreme values. Another way to eradicate outliers might be ratios. For example, if the ice cream parlor chain wanted to measure store sales, some stores may have much higher sales than others. However, the chain can reduce outliers and normalize data by computing a “sales per square foot” value.

It is important to note that transforming data also transforms your analysis and models, and that once you’ve done your analysis on the transformed data, you must convert your results back to the original form in order for them to make sense.

As you can see, correcting for outliers isn’t much different from correcting for missing data. However, you must be careful in your approach to correcting either outliers or missing data. Outliers by themselves can still alert you to valuable information, such as data collection problems. There’s no “best” way to correct for outliers in general; quite often the best approach for correcting outliers depends on the nature of the data, the business objective, and the impact the correction will have on the results of the analysis that is supporting that business objective. How you correct an outlier is just as critical as how you define it.


If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Identifying Outliers in a Data Set

September 14, 2010

Last week, we talked about what to do when your data set has records with missing or null values. Another problem that crops in data sets is that of extreme values, commonly known as outliers. Like missing data, outliers can wreak havoc with your statistical models and analyses, especially in regression analysis, which places greater weight on extreme values. Today, we’re going to talk about diagnosing outliers in your data and, tomorrow, we will discuss what to do about them.

Outliers occur in two ways – naturally and erroneously. Naturally, because not everybody or every phenomenon is typical. There are a small number of people who are much taller than most other persons and a small number who are much shorter; one or two gamblers at a casino may have a much larger roulette win than most other players; a few light bulbs may last many more (or far fewer) hours than most other bulbs of the same brand. These natural examples are rare, but can happen.

Outliers also occur because of error. Sometimes when entering data, we misplace a decimal point, or enter an extra zero at the end of a number, or transpose numbers. It is important to verify that all information is collected and recorded properly.

Diagnosing Outliers

There are a couple of ways to check data for outliers. These include:

Visually Inspect Data

Plot your data on a chart or graph. Do some points stand out from the “crowd?” If so, what is the record? Can you verify that it was entered correctly?

Automatically Minimize Exposure to Outliers

One way to check for outliers is to assume you’ll have some and adjust your data accordingly. You may say that a set percentage (say 1% to 5%) of your data on both ends is an outlier and then either remove those observations, or set a floor or ceiling based on the remaining data. For example, if you have 1,000 records in your data set and you assume that 1% on both ends is an outlier, you can either remove the bottom and top 10 observations from your analysis, or you can change the values of the bottom 10 to the value of the 11th lowest and those of the top 10 to that of the 11th highest value.

The problem here is that this approach is arbitrary and doesn’t take into account the uniqueness of each individual data set. Moreover, if you decided to delete those bottom and top records, you lose information. I don’t particularly recommend this approach, but in the interest of expediency it may be helpful.

Parametric Extremity

In parametric extremity, we use the data set’s parameters to determine how a particular value diverges from the center of the data set’s distribution. The obvious center of the distribution is the mean; the measure of divergence is the standard deviation. When data is normally distributed, virtually all observations are located within three standard deviations from the mean (in each direction). Hence, we may set a rule that an outlier is any value that is at least +/- 3 standard deviations from the mean.

This approach also has some drawbacks. The mean and standard deviation are computed from all values, including outliers. Hence, outliers tend to pull the mean towards them and inflate the standard deviation. As a result, they tend to bias the criteria used for judging whether a value is an outlier. Indeed, outliers introduce bias towards including extreme values.

Non-Parametric Extremity

Another approach to measuring divergence is through non-parametric methods. Essentially, the concept is the same, and the mean is still the center; however the divergence is measured by the inter-quartile range (IQR). Essentially, you order your data set and then break it into four equal parts. The lowest 25% is your first quartile; the next 25% is your second quartile (whose upper bound is the median); and so on. Essentially, anything higher than the top of the third quartile or lower than the bottom of the second quartile is reviewed for outliers.

If done haphazardly, non-parametric extremity will give you the same problem as establishing a set percentage on both ends as outliers. To avoid this drawback, again inspect the points that fall outside the second and third quartiles. Those closest to the outer bounds of the IQR can remain in your data set; those far away should be measured for accuracy, and if accurate can be adjusted or removed.

These are just a few of the ways you can identify outliers in your data set. Frequently, classifying a value as an outlier is a judgment call, and diagnosis and correction are two separate events. How you diagnose outliers is just as important to the integrity of your analysis as how you deal with those outliers.


If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Dealing With Missing Data

September 8, 2010

As I mentioned in yesterday’s post, data is fraught with problems. There are records that have missing values, outliers, or other characteristics that require you to work with the data in various ways. Today, I am going to discuss what to do about missing data.

Records have missing data for lots of reasons. Perhaps a customer doesn’t provide a home phone number because he/she uses a cell phone most of the time; or he/she may not want to be bothered with sales calls, so he doesn’t provide your business with that phone number. Sometimes customers are just too new to have any data generated as yet. If your cell phone provider wanted to examine the past 12 months’ usage of cell phone plans, but you only switched to them two months ago, you may not have generated a lot of data for them to work with. Data may be incomplete. This is true of third-party overlay data like those found from Experian or Acxiom. Some customers have demographic and purchase pattern information more readily available than others, so overlay data will not be complete. And, sometimes, data just isn’t collected at all – it may not have been of interest until now.

Missing data causes problems in data mining and predictive modeling. If a significant percent of the records have null or missing values for a particular record – say date of birth – then it might be difficult for a business to build a statistical model using age as a predictor or classification variable. Some data mining packages omit entire observations from regression models because one or more of the observation’s predictor variables has missing values. Depending on the number of observations in your model and the number that are kicked out, you can see your model’s degrees of freedom greatly reduced.

There’s no one approach to correcting for missing data. As I said yesterday, it depends a lot on your business problem and your timeframe. There are a handful of approaches that are frequently used when dealing with missing data. Among them:

Just ignoring it. If the number of observations with missing values is small, then you might be able to get by without making any changes. The lack of data probably won’t impact your results very much.

Deleting the observations with missing data. This is similar to the automatic approach of those data mining packages mentioned above. I almost never recommend this approach, as it introduces selection bias into your model. The observations that remain may not be representative of your relevant population.

Ignoring the variable. If a large number of observations have null or missing values for a given variable, it may be best to exclude it from your analysis.

Imputing the values. I do a lot of imputations with missing values. However, this approach has drawbacks of its own. Basically, when you impute a value, you are predicting what that observation’s value is. For example, if you were working with lending data by census tract and discovered that a handful of loan applicants’ incomes were not forwarded to you, you might try to make educated guesses at what the incomes were. So, you might look at the census tracts in which the applicants with missing data live. You might then look for the median income of the census tract from the U.S. Census, as well as any other demographics, and then substitute that value. It won’t be perfect, but it might be close.

Imputation has problems, however. The lending example I gave was just that – an example. If you’re a lender doing a credit risk analysis of applicants, such imputation would be unacceptable – and even illegal. For marketing, such imputation may be acceptable. Also, if a large number of observations are missing values for a given variable, then imputation may only make the problem worse.

Building Separate Models. Another approach would be to separate the observations based on the data they have available and conducting separate analyses or building separate models.

Waiting Until You Can Collect the Data You Need. This is a problem if you need results right away, but sometimes, it’s all you can do.

Missing data is one of the challenges of data mining and predictive modeling. Because the absence of data reduces the information available to us, we often need to do something to make up for it. However, we must realize that our remedies for missing data create problems of their own and can actually cause even more harm if we do not deploy these remedies with careful thought.

Forecast Friday Topic: Heteroscedasticity

August 12, 2010

(Seventeenth in a series)

Recall that one of the important assumptions in regression analysis is that a regression equation exhibit homoscedasticity: the condition that the error terms have a constant variance. Today we discuss heteroscedasticity, the violation of that assumption.

Heteroscedasticity, like autocorrelation and multicollinearity, results in inefficient parameter estimates. The standard errors of the parameter estimates tend to be biased, which means that the t-ratios and confidence intervals calculated around the suspect independent variable will not be valid, and will generate dubious predictions.

Heteroscedasticity occurs mostly in cross-sectional, as opposed to time series, data and mostly in large data sets. When data sets are large, the range of values for an independent variable can be quite wide. This is especially the case in data where income or other measures of wealth are used as independent variables. Persons with low income have few options about how to spend their money while persons with high incomes have many options. If you were trying to predict that the conviction rate for crimes was different in low income counties vs. high income counties, your model may exhibit heteroscedasticity because a low-income person may not have the funds for an adequate defense, and may be restricted to a public defender, or other inexpensive attorney. A wealthy individual, on the other hand, can hire the very best defense lawyer money could buy; or he could choose an inexpensive lawyer, or even the public defender. The wealthy individual may even be able to make restitution in lieu of a conviction.

How does this disparity affect your model? Recall from our earlier discussions on regression analysis that the least-squares method places more weight on extreme values. When outliers exist in data, they generate large residuals that get scattered out from those of the remaining observations. While heteroscedastic error terms will still have a mean of zero, their variance is greatly out of whack, resulting in inefficient parameter estimates.

In today’s Forecast Friday post, we will look at a data set for a regional housing market, perform a regression, and show how to detect heteroscedasticity visually.

Heteroscedasticity in the Housing Market

The best depiction of heteroscedasticity comes from my college econometrics textbook, Introducing Econometrics, by William S. Brown. In the chapter on heteroscedasticity, Brown provides a data set of housing statistics from the 1980 Census for Pierce County, Washington, which I am going to use for our model. The housing market is certainly one market where heteroscedasticity is deeply entrenched, since there is a dramatic range for both incomes and home market values. In our data set, we have 59 census tracts within Pierce County. Our independent variable is the median family income for the census tract; our dependent variable is the OwnRatio – the ratio of the number of families who own their homes to the number of families who rent. Our data set is as follows:

Housing Data





















































































































































































Data taken from U.S. Bureau of Census 1980 Pierce County, WA; Reprinted in Brown, W.S., Introducing Econometrics, St. Paul (1991): 198-200.

When we run our regression, we get the following equation:

Ŷ= 0.000297*Income – 2.221

Both the intercept and independent variable’s parameter estimates are significant, with the intercept parameter having a t-ratio of -4.094 and the income estimate having one of 9.182. R2 is 0.597, and the F-statistic is a strong 84.31. The model seems to be pretty good – strong t-ratios and F-statistic, a high coefficient of determination, and the sign on the parameter estimate for Income is positive, as we would expect. Generally, the higher the income, the greater the Own-to-rent ratio. So far so good.

The problem comes when we do a visual inspection of our data: first the independent variable against the dependent variable and the independent variable against the regression residuals. First, let’s take a look at the scatter plot of Income and OwnRatio:

Without even looking at the residuals, we can see that as median family income increases, the data points begin to spread out. Look at what happens to the distance between data points above and below the line when median family incomes reach $20,000: OwnRatios vary drastically.

Now let’s plot Income against the regression’s residuals:

This scatter plot shows essentially the same phenomenon as the previous graph, but from a different perspective. We can clearly see the error terms fanning out as Income increases. In fact, we can see the residuals diverging at increasing rates once Income starts moving from $10,000 to $15,000, and just compounding as incomes go higher. Roughly half the residuals fall on both the positive and the negative side, allowing us to meet the regression assumption of our residuals having a mean of zero, hence our parameter estimates are not biased. However, because we violated the constant variance assumption, the standard error of our regression is biased, so our parameter estimates are suspect.

Visual Inspection Only Gets You So Far

By visually inspecting our residuals, we can clearly see that our error terms are not homoscedastic. When you have a regression model, especially for cross-sectional data sets like this, you should visually inspect every independent variable against the dependent variable and against the error terms in order to get a priori indication of heteroscedasticity. However, visual inspection alone is not a guarantee that heteroscedasticity exists. There are three particularly simple methods to detecting heteroscedasticity which we will discuss in next week’s Forecast Friday post: the Park Test, the Goldfeld-Quandt Test, and the Breusch-Pagan Test.


Help us Reach 200 Fans on Facebook by Tomorrow!

Thanks to all of you, Analysights now has 150 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights!  By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter.   Thanks for your help!