The Challenges of Data Mining

For the last five weeks, I have been onsite at a client – a regional commercial bank – building statistical models that will help them predict which noncustomers have a good chance of becoming customers; the best first product – checking account, home equity line, savings account, or CD – to offer them; and the revenue this noncustomer may bring to the bank in his/her first year. Although the project is still ongoing, it’s been one of the most fascinating I’ve worked on in a long time, not only because I love the challenge of building predictive models, but also because it reminds me of the challenges and careful considerations one must take into account when using data for strategic advantage. If you’re exploring data mining for your company’s marketing, finance, or operations efforts, you’ll appreciate these principles I’m about to share with you.

Know What You’re Trying to Learn From the Data

Back in the early 1990s, there was a statistic that the amount of information available to the human race doubled every seven years. With the evolution of mobile technology, the Web, social media, and other “smart” technology, I wouldn’t be surprised if now the amount of data doubles every seven weeks! The constantly increasing amount of data creates new possibilities for analysis, but we must be very careful not to overwhelm ourselves with the volume of data just for the sake of analysis. Lots of data are available to us, but not all of it is relevant to our business purposes.

This is why it is so important to define what you’re trying to learn from the data before you attempt any analysis. One of the challenges we had to do at the bank was to define a “new customer” and a “noncustomer”. We had to decide the timeframe for when the new customer opened the account. If we picked too short a timeframe (say, March through June 2010), we would have too few data points to work with; if we picked too long a timeframe (say, all the way back to 2008), many of our “new” customers wouldn’t be so new, and their banking habits and behaviors might be very different from those who opened their accounts more recently, and that could have undesirable results for the model.

Know the Unit of Analysis for Your Project

Many large companies have data at so many levels. Banks have data at the transaction, account, individual, household, ZIP code, census tract, territory, state, and regional levels. What is the appropriate unit of analysis? It depends on your definition of the business problem and the levels of data you have access to. If the bank was trying to demonstrate compliance with lending requirements set by the Community Reinvestment Act (CRA), then it would need to analyze data at the census tract level. However, the bank is trying to acquire new adult customers. Since the bank is interested in acquiring any adult within a household, the household was a suitable unit of analysis. Besides, the only data available about prospects was overlay data from a third party data vendor, which is at the household level.

Sometimes you need data at a level of analysis that you don’t have. Let’s say that you need to analyze at the ZIP code level, but only have household data. In that case, you need to roll-up or summarize your data at that level – that is, you transpose your data to the appropriate level of granularity. But what if you needed data at a lower level of granularity than what you currently have, like having ZIP code data but needing a household-level of analysis? Unless you have a way of segmenting your ZIP code data down to the household level, you either cannot perform the analysis, or you must start from scratch collecting the household-level data.

Understand the Business Rules Governing Use of Data

Many businesses are subject to laws, regulations, and internal business policies that restrict the activities they perform and the manner in which they perform those activities. Banks are no exception. In fact, I’d argue that banks are the rule! In running one of the models, I found that a household’s likelihood of acquiring a checking account was much greater if it was a married-couple household. But the bank’s policies forbid it to market products on the basis of marital status! So, we had to go back to the drawing board. Fortunately, we had a way of estimating the number of adults in the household, which we used as a proxy for marital status. What was nice about this approach was that it took into account different household dynamics: the presence of grown children residing with parents; an aged parent living in his/her adult child’s home; domestic partnerships; and cohabiting couples. As a result, we had a predictor model that would be nondiscriminatory to certain demographics.

Before you work with any data, it is vital that you talk to the compliance officers who oversee the restrictions on marketing, privacy, and other uses of customer data.

Understand the History of the Data

The history of a business and its practices often shapes that data you will have to analyze. Just because a business no longer engages in a particular practice, those past activities can affect the results you get from your model. A few months before this project with the bank, I was working on a mailing list for another client. I noticed that many – as much as five percent – of the customers on the list were age 98, which was very bizarre. So I looked at the birthdates of these nonagenarians and almost all of them had the same birthdate: November 11, 1911. Their birthdates had been populated as 11/11/1111! What happened was that this client has previously required the birthdate to be entered for each of its customers. However, when data was being entered, many past employees attempted to bypass it by entering eight 1s into the birthdate field! Although the practice of requiring birthdates for customers had long been rescinded, the client’s data still reflected those past practices. Without knowledge of this previous practice, a model based on age could have caused the client to market to a much older group of people than it should have.

In defining “new customers” for the bank, we chose the beginning of 2009 as our earliest point. Why? Because at the latter point of 2008, there was the banking and financial crisis. Had we included new customers from early 2008, we would have seen dramatic variances in their banking behaviors, balances, and transactions, which would have had adverse consequences for both our propensity and revenue models.

Accept that Data is Always Dirty

Just because volumes and volumes of data are available doesn’t mean it is ready to use. The example of the birthdates is one case in point. The example of the data granularity is another. Still, there are other problems. Some customers and noncustomers might not have any data recorded in a particular field. For some of the bank’s customers and prospects, the third-party overlay data did not contain age, income, or gender information. For the entire data set, the values in the fields for income, wealth, and home market values were quite spread out. Some had extremely high values in those fields; others extremely low values. As a result of these extreme and missing data, we needed to make adjustments so that the models would not produce undesirable results or suspect predictions.

For the missing values, we computed the median values of all observations in the data set, and then substituted those. For the extreme values, we did a couple of things. For some values, we set up bins, such as “$0 – $5,000”; “$5,001-$10,000” and so on. For others, we took the natural log. Still, for others, we computed ratios, such as a household’s savings-to-income ratio. These approaches helped to reduce the variation in the data.

Realize that Your Results are Only as Good as Your Assumptions

Anytime you use analyze data for making decisions, you are making assumptions. Whenever we use data to construct a model, we assume that the patterns we discover in our data mining effort will hold up to the new customers we seek to acquire. When we analyze time series data, we are assuming that patterns of the past and present will hold up to the future. When we imputed the median values for missing data, we were assuming that those customers and prospects whose data was missing were like the “typical” customer. Making assumptions is a double-edged sword. We need to make them in order to direct our analyses and planning. Yet, if our assumptions are mistaken – and to some degree, every assumption is – our models will be useless. That’s why we must be very careful in our presuppositions about data and why we should test the results of our models before fully deploying them.


Tags: , , , , , , , , ,

2 Responses to “The Challenges of Data Mining”

  1. V Says:

    I’m interested in learning what’s your approach in general when it comes to missing values. Sometimes missing values for certain fields, when classified at a user level separately, can generate insights and help differentiate users who enter all values vs those who do not. Imputation assumes that values are missing at random, whereas an analysis might show an interesting trend or purposeful omission of data by a particular user group.

    Keep posting, there’re hardly enough blogs out there that discuss analytics as well as this one does.

    • analysights Says:


      First, let me thank you for your compliment about Insight Central – we try very hard to demystify analytics!

      You are absolutely right about the useful insights that missing values can provide, as well as your point about imputation. There are no hard and fast rules about what to do about missing data. It really depends on the objectives of the project your working on. How we handle missing values for one project can be very different for how we handle missing values for another.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: