Yesterday, we discussed approaches for discerning outliers in your data set. Today we’re going to discuss what to do about them. Most of the remedies for dealing with outliers are similar to those of dealing with missing data: doing nothing, deleting observations, ignoring the variable, and imputing values. We will discuss the remedies below.

*Doing nothing*

As with missing data, you may choose to do nothing about the outliers, especially if you rank numeric values, which essentially negates the effect of outliers. This is true of many decision tree algorithms. Neural networks, however, may be seriously disrupted by a few outlying values.

*Delete the observations with outlying values*

This is another approach that, like with missing data, I do not recommend because of the selection bias it introduces in the model. However, in cases of truly extreme outliers, eliminating one or two that are way off the charts may improve results.

*Ignoring the variable*

Sometimes we can exclude a variable with outliers. Perhaps we can replace it with information referring to it, or use proxy information. For example, if a food manufacturer was trying to measure coupon redemption by certain metropolitan areas, there might be sharp outliers within each metro area. Instead of the metro area, the food manufacturer may substitute information about the metro area – number of supermarkets, newspaper circulation (assuming its coupons appear in the Sunday paper), average shopping basket amount, etc. Much of this information is available through third party vendors or from sources like the U.S. Census Bureau.

*Imputing the values*

As with missing values, you would simply try to predict the “right” value to substitute for an outlying variable. You might even cap the outliers at the bottom or top. For example, you might look at the 5^{th} and 95^{th} percentiles, and set the lowest values to the 5^{th} percentile and the top values to the 95% percentile. You may even choose to eliminate those falling outside the 5^{th} through 95^{th} percentiles. However, as I mentioned yesterday, such capping ignores the uniqueness of each data set. You need to treat each data set differently when identifying and correcting its outliers.

If an observation has an outlier, you might also look to see what values other similar observations tend to have for that variable, and substitute the mean or median for the extreme value. For instance, an ice cream parlor chain might see that sales of mint chocolate chip ice cream in one store might be much higher than that of other stores in the area. The sales director might look at stores of similar size (e.g., square footage, sales volume, full-time equivalent employees, etc.), or similar territory (e.g., all ice cream parlors in the greater Bismarck, ND area), and check the average or median sales of mint chocolate chip ice cream and substitute that for the outlying store.

It is important to remember however that outliers can be caused because of external factors. Before blindly imputing values for mint chocolate chip ice cream sales in that particular store, the sales director should find out if customers near that store have preferences for mint, or whether a few customers buy the mint chocolate chip a lot more than others. It might even be that the other parlors could have severe stock-outs of the flavor, suggesting distribution problems. In this case, the outlying parlor could be normal and all other parlors could be selling too little mint chocolate chip ice cream!

*Binning values*

Sometimes, the best way to deal with outliers is to collapse the values into a few equal-sized categories. You might order your values from high to low and then break them into equal groups. This process is called *binning*. Low, Medium, and High are common bins. Others might be Outstanding, Above Average, Average, Below Average, and Poor. Outliers fall into appropriate ranges with binning.

*Transforming Data*

Sometimes you can eliminate outliers by transforming data. Binning is one form of transformation. Taking the natural log of a value can also reduce the variation caused by extreme values. Another way to eradicate outliers might be ratios. For example, if the ice cream parlor chain wanted to measure store sales, some stores may have much higher sales than others. However, the chain can reduce outliers and normalize data by computing a “sales per square foot” value.

It is important to note that transforming data also transforms your analysis and models, and that once you’ve done your analysis on the transformed data, you must convert your results back to the original form in order for them to make sense.

As you can see, correcting for outliers isn’t much different from correcting for missing data. However, you must be careful in your approach to correcting either outliers or missing data. Outliers by themselves can still alert you to valuable information, such as data collection problems. There’s no “best” way to correct for outliers in general; quite often the best approach for correcting outliers depends on the nature of the data, the business objective, and the impact the correction will have on the results of the analysis that is supporting that business objective. How you correct an outlier is just as critical as how you define it.

*************************

**If you Like Our Posts, Then “Like” Us on Facebook and Twitter!
**

Analysights is now doing the social media thing! If you like *Forecast Friday* – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Tags: Analysights, binning, data analysis, data mining, data transformation, imputation, missing data, outliers, statistical analysis

## Leave a Reply