Last week, we talked about what to do when your data set has records with missing or null values. Another problem that crops in data sets is that of extreme values, commonly known as outliers. Like missing data, outliers can wreak havoc with your statistical models and analyses, especially in regression analysis, which places greater weight on extreme values. Today, we’re going to talk about diagnosing outliers in your data and, tomorrow, we will discuss what to do about them.

Outliers occur in two ways – naturally and erroneously. Naturally, because not everybody or every phenomenon is typical. There are a small number of people who are much taller than most other persons and a small number who are much shorter; one or two gamblers at a casino may have a much larger roulette win than most other players; a few light bulbs may last many more (or far fewer) hours than most other bulbs of the same brand. These natural examples are rare, but can happen.

Outliers also occur because of error. Sometimes when entering data, we misplace a decimal point, or enter an extra zero at the end of a number, or transpose numbers. It is important to verify that all information is collected and recorded properly.

**Diagnosing Outliers**

There are a couple of ways to check data for outliers. These include:

*Visually Inspect Data*

Plot your data on a chart or graph. Do some points stand out from the “crowd?” If so, what is the record? Can you verify that it was entered correctly?

*Automatically Minimize Exposure to Outliers*

One way to check for outliers is to assume you’ll have some and adjust your data accordingly. You may say that a set percentage (say 1% to 5%) of your data on both ends is an outlier and then either remove those observations, or set a floor or ceiling based on the remaining data. For example, if you have 1,000 records in your data set and you assume that 1% on both ends is an outlier, you can either remove the bottom and top 10 observations from your analysis, or you can change the values of the bottom 10 to the value of the 11^{th} lowest and those of the top 10 to that of the 11^{th} highest value.

The problem here is that this approach is arbitrary and doesn’t take into account the uniqueness of each individual data set. Moreover, if you decided to delete those bottom and top records, you lose information. I don’t particularly recommend this approach, but in the interest of expediency it may be helpful.

*Parametric Extremity*

In parametric extremity, we use the data set’s parameters to determine how a particular value diverges from the center of the data set’s distribution. The obvious center of the distribution is the mean; the measure of divergence is the standard deviation. When data is normally distributed, virtually all observations are located within three standard deviations from the mean (in each direction). Hence, we may set a rule that an outlier is any value that is at least +/- 3 standard deviations from the mean.

This approach also has some drawbacks. The mean and standard deviation are computed from all values, including outliers. Hence, outliers tend to pull the mean towards them and inflate the standard deviation. As a result, they tend to bias the criteria used for judging whether a value is an outlier. Indeed, outliers introduce bias towards *including* extreme values.

*Non-Parametric Extremity*

Another approach to measuring divergence is through non-parametric methods. Essentially, the concept is the same, and the mean is still the center; however the divergence is measured by the inter-quartile range (IQR). Essentially, you order your data set and then break it into four equal parts. The lowest 25% is your first quartile; the next 25% is your second quartile (whose upper bound is the *median*); and so on. Essentially, anything higher than the top of the third quartile or lower than the bottom of the second quartile is reviewed for outliers.

If done haphazardly, non-parametric extremity will give you the same problem as establishing a set percentage on both ends as outliers. To avoid this drawback, again inspect the points that fall outside the second and third quartiles. Those closest to the outer bounds of the IQR can remain in your data set; those far away should be measured for accuracy, and if accurate can be adjusted or removed.

These are just a few of the ways you can identify outliers in your data set. Frequently, classifying a value as an outlier is a judgment call, and diagnosis and correction are two separate events. How you diagnose outliers is just as important to the integrity of your analysis as how you deal with those outliers.

*************************

**If you Like Our Posts, Then “Like” Us on Facebook and Twitter!
**

Analysights is now doing the social media thing! If you like *Forecast Friday* – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Tags: central tendency, data analysis, data mining, inter-quartile range, mean, missing data, missing values, non-parametric extremity, outliers, parametric extremity, standard deviation, statistical analysis

April 2, 2011 at 3:42 pm |

You have treated missing values and outliers as the main problems of data analysis which you presented ways of handling them separately, I am thinking of imputing missing values in a data set which is noted to have serious outliers. is this a feasible research topic? Kindly assist with any suggestions.

Thanks

Nicholas Pindar Dibal

April 16, 2011 at 1:42 pm |

Nicholas,

There are no hard and fast rules for dealing with outliers and missing values. In fact, most data sets will contain both. Many of the processes are similar for treating both. The key thing you need to ask yourself before imputing data is, “will the values I impute be representative of the sample at large.” If you go about imputing data haphazardly, it could materially change your results.

Alex