As technical people, we often use a not-so-technical phrase to describe the use of bad data in our analyses: “Garbage in, garbage out.” Anytime we build a model or perform an analysis based on data that is dirty or incorrect, we will get results that are undesirable. Data has many opportunities to get murky, with a major cause being the way the business collects and stores it. And dirty data isn’t always incorrect data; the way the company enters data can be correct for operational purposes, but not useful for a particular analysis being done for the marketing department. Here are some examples:
The Return That Wasn’t a Return
I was recently at an outlet store buying some shirts for my sons. After walking out, I realized the sales clerk rang up the full, not sale, price. I went back to the store to have the difference refunded. The clerk re-scanned the receipt, cancelled the previous sale and re-rang the shirts at the sale price. Although I ended up with the same outcome – a refund – I thought of the problems this process could cause.
What if the retailer wanted to predict the likelihood of merchandise returns? My transaction, which was actually a price adjustment, would be treated as a return. Depending on how often this happens, a particular store can be flagged as having above-average returns relative to comparable stores, and be required to implement more stringent return policies that weren’t necessary to begin with.
And consider the flipside of this process: by treating the erroneous ring-up as a return, the retailer won’t be alerted to the possibility that clerks at this store may be making mistakes in ringing up information; perhaps sale prices aren’t being entered into the store’s system; or perhaps equipment storing price updates isn’t functioning properly.
And processing the price adjustment the way the clerk did actually creates even more data that needs to be stored: the initial transaction, the return transaction, and the corrected transaction.
The Company With Very Old Customers
Some years ago, I worked for a company that did direct mailings. I needed to conduct an analysis of its customers and identify the variables that predicted those most likely to respond to a solicitation. The company collected the birthdates of its customers. From that field, I calculated the age of each individual customer. And I found that nearly ten percent of their customers were quite old – much older than the market segments this company targeted. A deeper dive on the birthdate field revealed that virtually all of them had the same birthdate: November 11, 1911. (This was back around the turn of the millennium when companies still recorded dates with two-digit years).
How did this happen? Well, as discussed in the prior post on problem definition, I consulted the company’s “data experts.” I learned that the birthdate field was a required field for first-time customers. The call center representative could not move from the birthdate field to the next field unless values were entered into the birthdate field. Hence, many representatives in the call center simply entered “11-11-11” to bypass the field when a first-time customer refused to give his or her birthdate.
In this case, the company’s requirement to collect birthdate information met sharp resistance from customers, causing the call center to enter dummy data to get around the operational constraints. Incidentally, the company later made the birthdate field optional.
Customers Who Hadn’t Purchased in Almost a Century
Back in the late 1990s, I went to work for a catalog retailer, building response models. The cataloger was concerned that its models were generating undesirable results. I tried running the models with its data and confirmed the models to be untrustworthy. So I started running frequency distributions on all its data fields. To my surprise, I found a field, “Months since last purchase,” in which many customers had the value “999.” Wow – many customers hadn’t purchased since 1916 – almost 83 years earlier!
I knew immediately what happened. In the past, when data was often read into systems using magnetic tape, the way the data systems were programmed required all fields to be populated; if a value for a particular field was missing, the value for the next field would get read into its place, and so forth; and when the program read to the end of the record, it often went to the next record and then read values from there until all fields for the previous record were populated. This was a data nightmare.
To get around this, fields whose data was missing or unknown were filled with a series of 9s, so that all the other data would be entered into the system correctly. This process was fine and dandy, as long as the company’s analysts accounted for this practice during their analysis. The cataloger, however, would run its regressions using those ‘999s,’ resulting in serious outliers, and regressions of little value.
In this case, the cataloger’s attempt to rectify one data malady resulted in a new data malady. I corrected this by recoding the values, breaking those whose last purchase date was known into intervals, and using ranking values: a 1 for the most recent customers, a 2 for the next most recent, a 3 for the next most recent, and so forth, and gave the lowest rank to those whose last purchase was unknown.
The Moral of the Story
Company policy is a major cause of dirty data. These examples – which are far from comprehensive – illustrate how the way data is entered can cause problems. Often, a data fix proves shortsighted, as it causes new problems down the road. This is why it is so important for analysts to consult the company’s data experts before undertaking any major data mining effort. Knowing how a company collects and stores data and making allowances for it will increase the likelihood of a successful data mining effort.