Posts Tagged ‘imputation’

Correcting for Outliers

September 15, 2010

Yesterday, we discussed approaches for discerning outliers in your data set. Today we’re going to discuss what to do about them. Most of the remedies for dealing with outliers are similar to those of dealing with missing data: doing nothing, deleting observations, ignoring the variable, and imputing values. We will discuss the remedies below.

Doing nothing

As with missing data, you may choose to do nothing about the outliers, especially if you rank numeric values, which essentially negates the effect of outliers. This is true of many decision tree algorithms. Neural networks, however, may be seriously disrupted by a few outlying values.

Delete the observations with outlying values

This is another approach that, like with missing data, I do not recommend because of the selection bias it introduces in the model. However, in cases of truly extreme outliers, eliminating one or two that are way off the charts may improve results.

Ignoring the variable

Sometimes we can exclude a variable with outliers. Perhaps we can replace it with information referring to it, or use proxy information. For example, if a food manufacturer was trying to measure coupon redemption by certain metropolitan areas, there might be sharp outliers within each metro area. Instead of the metro area, the food manufacturer may substitute information about the metro area – number of supermarkets, newspaper circulation (assuming its coupons appear in the Sunday paper), average shopping basket amount, etc. Much of this information is available through third party vendors or from sources like the U.S. Census Bureau.

Imputing the values

As with missing values, you would simply try to predict the “right” value to substitute for an outlying variable. You might even cap the outliers at the bottom or top. For example, you might look at the 5th and 95th percentiles, and set the lowest values to the 5th percentile and the top values to the 95% percentile. You may even choose to eliminate those falling outside the 5th through 95th percentiles. However, as I mentioned yesterday, such capping ignores the uniqueness of each data set. You need to treat each data set differently when identifying and correcting its outliers.

If an observation has an outlier, you might also look to see what values other similar observations tend to have for that variable, and substitute the mean or median for the extreme value. For instance, an ice cream parlor chain might see that sales of mint chocolate chip ice cream in one store might be much higher than that of other stores in the area. The sales director might look at stores of similar size (e.g., square footage, sales volume, full-time equivalent employees, etc.), or similar territory (e.g., all ice cream parlors in the greater Bismarck, ND area), and check the average or median sales of mint chocolate chip ice cream and substitute that for the outlying store.

It is important to remember however that outliers can be caused because of external factors. Before blindly imputing values for mint chocolate chip ice cream sales in that particular store, the sales director should find out if customers near that store have preferences for mint, or whether a few customers buy the mint chocolate chip a lot more than others. It might even be that the other parlors could have severe stock-outs of the flavor, suggesting distribution problems. In this case, the outlying parlor could be normal and all other parlors could be selling too little mint chocolate chip ice cream!

Binning values

Sometimes, the best way to deal with outliers is to collapse the values into a few equal-sized categories. You might order your values from high to low and then break them into equal groups. This process is called binning. Low, Medium, and High are common bins. Others might be Outstanding, Above Average, Average, Below Average, and Poor. Outliers fall into appropriate ranges with binning.

Transforming Data

Sometimes you can eliminate outliers by transforming data. Binning is one form of transformation. Taking the natural log of a value can also reduce the variation caused by extreme values. Another way to eradicate outliers might be ratios. For example, if the ice cream parlor chain wanted to measure store sales, some stores may have much higher sales than others. However, the chain can reduce outliers and normalize data by computing a “sales per square foot” value.

It is important to note that transforming data also transforms your analysis and models, and that once you’ve done your analysis on the transformed data, you must convert your results back to the original form in order for them to make sense.

As you can see, correcting for outliers isn’t much different from correcting for missing data. However, you must be careful in your approach to correcting either outliers or missing data. Outliers by themselves can still alert you to valuable information, such as data collection problems. There’s no “best” way to correct for outliers in general; quite often the best approach for correcting outliers depends on the nature of the data, the business objective, and the impact the correction will have on the results of the analysis that is supporting that business objective. How you correct an outlier is just as critical as how you define it.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Dealing With Missing Data

September 8, 2010

As I mentioned in yesterday’s post, data is fraught with problems. There are records that have missing values, outliers, or other characteristics that require you to work with the data in various ways. Today, I am going to discuss what to do about missing data.

Records have missing data for lots of reasons. Perhaps a customer doesn’t provide a home phone number because he/she uses a cell phone most of the time; or he/she may not want to be bothered with sales calls, so he doesn’t provide your business with that phone number. Sometimes customers are just too new to have any data generated as yet. If your cell phone provider wanted to examine the past 12 months’ usage of cell phone plans, but you only switched to them two months ago, you may not have generated a lot of data for them to work with. Data may be incomplete. This is true of third-party overlay data like those found from Experian or Acxiom. Some customers have demographic and purchase pattern information more readily available than others, so overlay data will not be complete. And, sometimes, data just isn’t collected at all – it may not have been of interest until now.

Missing data causes problems in data mining and predictive modeling. If a significant percent of the records have null or missing values for a particular record – say date of birth – then it might be difficult for a business to build a statistical model using age as a predictor or classification variable. Some data mining packages omit entire observations from regression models because one or more of the observation’s predictor variables has missing values. Depending on the number of observations in your model and the number that are kicked out, you can see your model’s degrees of freedom greatly reduced.

There’s no one approach to correcting for missing data. As I said yesterday, it depends a lot on your business problem and your timeframe. There are a handful of approaches that are frequently used when dealing with missing data. Among them:

Just ignoring it. If the number of observations with missing values is small, then you might be able to get by without making any changes. The lack of data probably won’t impact your results very much.

Deleting the observations with missing data. This is similar to the automatic approach of those data mining packages mentioned above. I almost never recommend this approach, as it introduces selection bias into your model. The observations that remain may not be representative of your relevant population.

Ignoring the variable. If a large number of observations have null or missing values for a given variable, it may be best to exclude it from your analysis.

Imputing the values. I do a lot of imputations with missing values. However, this approach has drawbacks of its own. Basically, when you impute a value, you are predicting what that observation’s value is. For example, if you were working with lending data by census tract and discovered that a handful of loan applicants’ incomes were not forwarded to you, you might try to make educated guesses at what the incomes were. So, you might look at the census tracts in which the applicants with missing data live. You might then look for the median income of the census tract from the U.S. Census, as well as any other demographics, and then substitute that value. It won’t be perfect, but it might be close.

Imputation has problems, however. The lending example I gave was just that – an example. If you’re a lender doing a credit risk analysis of applicants, such imputation would be unacceptable – and even illegal. For marketing, such imputation may be acceptable. Also, if a large number of observations are missing values for a given variable, then imputation may only make the problem worse.

Building Separate Models. Another approach would be to separate the observations based on the data they have available and conducting separate analyses or building separate models.

Waiting Until You Can Collect the Data You Need. This is a problem if you need results right away, but sometimes, it’s all you can do.

Missing data is one of the challenges of data mining and predictive modeling. Because the absence of data reduces the information available to us, we often need to do something to make up for it. However, we must realize that our remedies for missing data create problems of their own and can actually cause even more harm if we do not deploy these remedies with careful thought.