As I mentioned in yesterday’s post, data is fraught with problems. There are records that have missing values, outliers, or other characteristics that require you to work with the data in various ways. Today, I am going to discuss what to do about missing data.
Records have missing data for lots of reasons. Perhaps a customer doesn’t provide a home phone number because he/she uses a cell phone most of the time; or he/she may not want to be bothered with sales calls, so he doesn’t provide your business with that phone number. Sometimes customers are just too new to have any data generated as yet. If your cell phone provider wanted to examine the past 12 months’ usage of cell phone plans, but you only switched to them two months ago, you may not have generated a lot of data for them to work with. Data may be incomplete. This is true of third-party overlay data like those found from Experian or Acxiom. Some customers have demographic and purchase pattern information more readily available than others, so overlay data will not be complete. And, sometimes, data just isn’t collected at all – it may not have been of interest until now.
Missing data causes problems in data mining and predictive modeling. If a significant percent of the records have null or missing values for a particular record – say date of birth – then it might be difficult for a business to build a statistical model using age as a predictor or classification variable. Some data mining packages omit entire observations from regression models because one or more of the observation’s predictor variables has missing values. Depending on the number of observations in your model and the number that are kicked out, you can see your model’s degrees of freedom greatly reduced.
There’s no one approach to correcting for missing data. As I said yesterday, it depends a lot on your business problem and your timeframe. There are a handful of approaches that are frequently used when dealing with missing data. Among them:
Just ignoring it. If the number of observations with missing values is small, then you might be able to get by without making any changes. The lack of data probably won’t impact your results very much.
Deleting the observations with missing data. This is similar to the automatic approach of those data mining packages mentioned above. I almost never recommend this approach, as it introduces selection bias into your model. The observations that remain may not be representative of your relevant population.
Ignoring the variable. If a large number of observations have null or missing values for a given variable, it may be best to exclude it from your analysis.
Imputing the values. I do a lot of imputations with missing values. However, this approach has drawbacks of its own. Basically, when you impute a value, you are predicting what that observation’s value is. For example, if you were working with lending data by census tract and discovered that a handful of loan applicants’ incomes were not forwarded to you, you might try to make educated guesses at what the incomes were. So, you might look at the census tracts in which the applicants with missing data live. You might then look for the median income of the census tract from the U.S. Census, as well as any other demographics, and then substitute that value. It won’t be perfect, but it might be close.
Imputation has problems, however. The lending example I gave was just that – an example. If you’re a lender doing a credit risk analysis of applicants, such imputation would be unacceptable – and even illegal. For marketing, such imputation may be acceptable. Also, if a large number of observations are missing values for a given variable, then imputation may only make the problem worse.
Building Separate Models. Another approach would be to separate the observations based on the data they have available and conducting separate analyses or building separate models.
Waiting Until You Can Collect the Data You Need. This is a problem if you need results right away, but sometimes, it’s all you can do.
Missing data is one of the challenges of data mining and predictive modeling. Because the absence of data reduces the information available to us, we often need to do something to make up for it. However, we must realize that our remedies for missing data create problems of their own and can actually cause even more harm if we do not deploy these remedies with careful thought.