Posts Tagged ‘transforming data’

The Challenges of Data Mining

September 7, 2010

For the last five weeks, I have been onsite at a client – a regional commercial bank – building statistical models that will help them predict which noncustomers have a good chance of becoming customers; the best first product – checking account, home equity line, savings account, or CD – to offer them; and the revenue this noncustomer may bring to the bank in his/her first year. Although the project is still ongoing, it’s been one of the most fascinating I’ve worked on in a long time, not only because I love the challenge of building predictive models, but also because it reminds me of the challenges and careful considerations one must take into account when using data for strategic advantage. If you’re exploring data mining for your company’s marketing, finance, or operations efforts, you’ll appreciate these principles I’m about to share with you.

Know What You’re Trying to Learn From the Data

Back in the early 1990s, there was a statistic that the amount of information available to the human race doubled every seven years. With the evolution of mobile technology, the Web, social media, and other “smart” technology, I wouldn’t be surprised if now the amount of data doubles every seven weeks! The constantly increasing amount of data creates new possibilities for analysis, but we must be very careful not to overwhelm ourselves with the volume of data just for the sake of analysis. Lots of data are available to us, but not all of it is relevant to our business purposes.

This is why it is so important to define what you’re trying to learn from the data before you attempt any analysis. One of the challenges we had to do at the bank was to define a “new customer” and a “noncustomer”. We had to decide the timeframe for when the new customer opened the account. If we picked too short a timeframe (say, March through June 2010), we would have too few data points to work with; if we picked too long a timeframe (say, all the way back to 2008), many of our “new” customers wouldn’t be so new, and their banking habits and behaviors might be very different from those who opened their accounts more recently, and that could have undesirable results for the model.

Know the Unit of Analysis for Your Project

Many large companies have data at so many levels. Banks have data at the transaction, account, individual, household, ZIP code, census tract, territory, state, and regional levels. What is the appropriate unit of analysis? It depends on your definition of the business problem and the levels of data you have access to. If the bank was trying to demonstrate compliance with lending requirements set by the Community Reinvestment Act (CRA), then it would need to analyze data at the census tract level. However, the bank is trying to acquire new adult customers. Since the bank is interested in acquiring any adult within a household, the household was a suitable unit of analysis. Besides, the only data available about prospects was overlay data from a third party data vendor, which is at the household level.

Sometimes you need data at a level of analysis that you don’t have. Let’s say that you need to analyze at the ZIP code level, but only have household data. In that case, you need to roll-up or summarize your data at that level – that is, you transpose your data to the appropriate level of granularity. But what if you needed data at a lower level of granularity than what you currently have, like having ZIP code data but needing a household-level of analysis? Unless you have a way of segmenting your ZIP code data down to the household level, you either cannot perform the analysis, or you must start from scratch collecting the household-level data.

Understand the Business Rules Governing Use of Data

Many businesses are subject to laws, regulations, and internal business policies that restrict the activities they perform and the manner in which they perform those activities. Banks are no exception. In fact, I’d argue that banks are the rule! In running one of the models, I found that a household’s likelihood of acquiring a checking account was much greater if it was a married-couple household. But the bank’s policies forbid it to market products on the basis of marital status! So, we had to go back to the drawing board. Fortunately, we had a way of estimating the number of adults in the household, which we used as a proxy for marital status. What was nice about this approach was that it took into account different household dynamics: the presence of grown children residing with parents; an aged parent living in his/her adult child’s home; domestic partnerships; and cohabiting couples. As a result, we had a predictor model that would be nondiscriminatory to certain demographics.

Before you work with any data, it is vital that you talk to the compliance officers who oversee the restrictions on marketing, privacy, and other uses of customer data.

Understand the History of the Data

The history of a business and its practices often shapes that data you will have to analyze. Just because a business no longer engages in a particular practice, those past activities can affect the results you get from your model. A few months before this project with the bank, I was working on a mailing list for another client. I noticed that many – as much as five percent – of the customers on the list were age 98, which was very bizarre. So I looked at the birthdates of these nonagenarians and almost all of them had the same birthdate: November 11, 1911. Their birthdates had been populated as 11/11/1111! What happened was that this client has previously required the birthdate to be entered for each of its customers. However, when data was being entered, many past employees attempted to bypass it by entering eight 1s into the birthdate field! Although the practice of requiring birthdates for customers had long been rescinded, the client’s data still reflected those past practices. Without knowledge of this previous practice, a model based on age could have caused the client to market to a much older group of people than it should have.

In defining “new customers” for the bank, we chose the beginning of 2009 as our earliest point. Why? Because at the latter point of 2008, there was the banking and financial crisis. Had we included new customers from early 2008, we would have seen dramatic variances in their banking behaviors, balances, and transactions, which would have had adverse consequences for both our propensity and revenue models.

Accept that Data is Always Dirty

Just because volumes and volumes of data are available doesn’t mean it is ready to use. The example of the birthdates is one case in point. The example of the data granularity is another. Still, there are other problems. Some customers and noncustomers might not have any data recorded in a particular field. For some of the bank’s customers and prospects, the third-party overlay data did not contain age, income, or gender information. For the entire data set, the values in the fields for income, wealth, and home market values were quite spread out. Some had extremely high values in those fields; others extremely low values. As a result of these extreme and missing data, we needed to make adjustments so that the models would not produce undesirable results or suspect predictions.

For the missing values, we computed the median values of all observations in the data set, and then substituted those. For the extreme values, we did a couple of things. For some values, we set up bins, such as “$0 – $5,000”; “$5,001-$10,000” and so on. For others, we took the natural log. Still, for others, we computed ratios, such as a household’s savings-to-income ratio. These approaches helped to reduce the variation in the data.

Realize that Your Results are Only as Good as Your Assumptions

Anytime you use analyze data for making decisions, you are making assumptions. Whenever we use data to construct a model, we assume that the patterns we discover in our data mining effort will hold up to the new customers we seek to acquire. When we analyze time series data, we are assuming that patterns of the past and present will hold up to the future. When we imputed the median values for missing data, we were assuming that those customers and prospects whose data was missing were like the “typical” customer. Making assumptions is a double-edged sword. We need to make them in order to direct our analyses and planning. Yet, if our assumptions are mistaken – and to some degree, every assumption is – our models will be useless. That’s why we must be very careful in our presuppositions about data and why we should test the results of our models before fully deploying them.

Forecast Friday Topic: Multicollinearity – Correcting and Accepting it

July 22, 2010

(Fourteenth in a series)

In last week’s Forecast Friday post, we discussed how to detect multicollinearity in a regression model and how dropping a suspect variable or variables from the model can be one approach to reducing or eliminating multicollinearity. However, removing variables can cause other problems – particularly specification bias – if the suspect variable is indeed an important predictor. Today we will discuss two additional approaches to correcting multicollinearity – obtaining more data and transforming variables – and will discuss when it’s best to just accept the multicollinearity.

Obtaining More Data

Multicollinearity is really an issue with the sample, not the population. Sometimes, sampling produces a data set that might be too homogeneous. One way to remedy this would be to add more observations to the data set. Enlarging the sample will introduce more variation in the data series, which reduces the effect of sampling error and helps increase precision when estimating various properties of the data. Increased sample sizes can reduce either the presence or the impact of multicollinearity, or both. Obtaining more data is often the best way to remedy multicollinearity.

Obtaining more data does have problems, however. Sometimes, additional data just isn’t available. This is especially the case with time series data, which can be limited or otherwise finite. If you need to obtain that additional information through great effort, it can be costly and time consuming. Also, the additional data you add to your sample could be quite similar to your original data set, so there would be no benefit to enlarging your data set. The new data could even make problems worse!

Transforming Variables

Another way statisticians and modelers go about eliminating multicollinearity is through data transformation. This can be done in a number of ways.

Combine Some Variables

The most obvious way would be to find a way to combine some of the variables. After all, multicollinearity suggests that two or more independent variables are strongly correlated. Perhaps you can multiply two variables together and use the product of those two variables in place of them.

So, in our example of the donor history, we had the two variables “Average Contribution in Last 12 Months” and “Times Donated in Last 12 Months.” We can multiply them to create a composite variable, “Total Contributions in Last 12 Months,” and then use that new variable, along with the variable “Months Since Last Donation” to perform the regression. In fact, if we did that with our model, we end up with a model (not shown here) that has an R2=0.895, and this time the coefficient for “Months Since Last Donation” is significant, as is our “Total Contribution” variable. Our F statistic is a little over 72. Essentially, the R2 and F statistics are only slightly lower than in our original model, suggesting that the transformation was useful. However, looking at the correlation matrix, we still see a strong negative correlation between our two independent variables, suggesting that we still haven’t eliminated multicollinearity.

Centered Interaction Terms

Sometimes we can reduce multicollinearity by creating an interaction term between variables in question. In a model trying to predict performance on a test based on hours spent studying and hours of sleep, you might find that hours spent studying appears to be related with hours of sleep. So, you create a third independent variable, Sleep_Study_Interaction. You do this by computing the average value for both the hours of sleep and hours of studying variables. For each observation, you subtract each independent variable’s mean from its respective value for that observation. Once you’ve done that for each observation, multiply their differences together. This is your interaction term, Sleep_Study_Interaction. Run the regression now with the original two variables and the interaction term. When you subtract the means from the variables in question, you are in effect centering interaction term, which means you’re taking into account central tendency in your data.

Differencing Data

If you’re working with time series data, one way to reduce multicollinearity is to run your regression using differences. To do this, you take every variable – dependent and independent – and, beginning with the second observation – subtract the immediate prior observation’s values for those variables from the current observation. Now, instead of working with original data, you are working with the change in data from one period to the next. Differencing eliminates multicollinearity by removing the trend component of the time series. If all independent variables had followed more or less the same trend, they could end up highly correlated. Sometimes, however, trends can build on themselves for several periods, so multiple differencing may be required. In this case, subtracting the period before was taking a “first difference.” If we subtracted two periods before, it’s a “second difference,” and so on. Note also that with differencing, we lose the first observations in the data, depending on how many periods we have to difference, so if you have a small data set, differencing can reduce your degrees of freedom and increase your risk of making a Type I Error: concluding that an independent variable is not statistically significant when, in truth it is.

Other Transformations

Sometimes, it makes sense to take a look at a scatter plot of each independent variable’s values with that of the dependent variable to see if the relationship is fairly linear. If it is not, that’s a cue to transform an independent variable. If an independent variable appears to have a logarithmic relationship, you might substitute its natural log. Also, depending on the relationship, you can use other transformations: square root, square, negative reciprocal, etc.

Another consideration: if you’re predicting the impact of violent crime on a city’s median family income, instead of using the number of violent crimes committed in the city, you might instead divide it by the city’s population and come up with a per-capita figure. That will give more useful insights into the incidence of crime in the city.

Transforming data in these ways helps reduce multicollinearity by representing independent variables differently, so that they are less correlated with other independent variables.

Limits of Data Transformation

Transforming data has its own pitfalls. First, transforming data also transforms the model. A model that uses a per-capita crime figure for an independent variable has a very different interpretation than one using an aggregate crime figure. Also, interpretations of models and their results get more complicated as data is transformed. Ideally, models are supposed to be parsimonious – that is, they explain a great deal about the relationship as simply as possible. Typically, parsimony means as few independent variables as possible, but it also means as few transformations as possible. You also need to do more work. If you try to plug in new data to your resulting model for forecasting, you must remember to take the values for your data and transform them accordingly.

Living With Multicollinearity

Multicollinearity is par for the course when a model consists of two or more independent variables, so often the question isn’t whether multicollinearity exists, but rather how severe it is. Multicollinearity doesn’t bias your parameter estimates, but it inflates their variance, making them inefficient or untrustworthy. As you have seen from the remedies offered in this post, the cures can be worse than the disease. Correcting multicollinearity can also be an iterative process; the benefit of reducing multicollinearity may not justify the time and resources required to do so. Sometimes, any effort to reduce multicollinearity is futile. Generally, for the purposes of forecasting, it might be perfectly OK to disregard the multicollinearity. If, however, you’re using regression analysis to explain relationships, then you must try to reduce the multicollinearity.

A good approach is to run a couple of different models, some using variations of the remedies we’ve discussed here, and comparing their degree of multicollinearity with that of the original model. It is also important to compare the forecast accuracy of each. After all, if all you’re trying to do is forecast, then a model with slightly less multicollinearity but a higher degree of forecast error is probably not preferable to a more precise forecasting model with higher degrees of multicollinearity.

The Takeaways:

  1. Where you have multiple regression, you almost always have multicollinearity, especially in time series data.
  2. A correlation matrix is a good way to detect multicollinearity. Multicollinearity can be very serious if the correlation matrix shows that some of the independent variables are more highly correlated with each other than they are with the dependent variable.
  3. You should suspect multicollinearity if:
    1. You have a high R2 but low t-statistics;
    2. The sign for a coefficient is opposite of what is normally expected (a relationship that should be positive is negative, and vice-versa).
  4. Multicollinearity doesn’t bias parameter estimates, but makes them untrustworthy by enlarging their variance.
  5. There are several ways of remedying multicollinearity, with obtaining more data often being the best approach. Each remedy for multicollinearity contributes a new set of problems and limitations, so you must weigh the benefit of reduced multicollinearity on time and resources needed to do so, and the resulting impact on your forecast accuracy.

Next Forecast Friday Topic: Autocorrelation

These past two weeks, we discussed the problem of multicollinearity. Next week, we will discuss the problem of autocorrelation – the phenomenon that occurs when we violate the assumption that the error terms are not correlated with each other. We will discuss how to detect autocorrelation, discuss in greater depth the Durbin-Watson statistic’s use as a measure of the presence of autocorrelation, and how to correct for autocorrelation.


If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.