Posts Tagged ‘data analysis’

“Big Data” Benefits Small Businesses Too

May 8, 2014

(This post appeared on our successor blog, The Analysights Data Mine, on Monday, May 5, 2014).

One misconception about “big data” is that it is only for large enterprises. On its face, such a claim would sound logical; but, in reality, “big data” is just as vital to a small business as it is a major corporation. While the amount of data a small business generates is nowhere near as large as that which a large corporation might generate, a small business can still analyze that data to find insightful ways to run more efficiently.

Imagine a family restaurant in your local town. Such a restaurant may not have a loyalty card like a chain restaurant; it may not have any process by which to target customers; in fact, the restaurant may not even be computerized. But the restaurant still generates a LOT of useful data.

What is the richest source of the restaurant’s data? It’s the check on which the server records the table’s orders. If a restaurant saves these checks, the owner can tally the entrees, appetizers, and side orders that were made during a given period of time. This can help the restaurateur learn a lot of useful information, such as:

  • What entrée or entrees are most commonly sold?
  • What side dishes are most commonly ordered with a particular entrée?
  • What is the most popular entrée sold on a Friday or Saturday night?
  • How many refills does a typical table order?
  • What is the average number of patrons per table?
  • What are the busiest and slowest nights/times of the week?
  • How many tables and/or patrons come in on a particular night of the week?

Information like this can help the restaurateur estimate how many of each entrée it must prepare for on a given day; order sufficient ingredients for such entrees and menu items; forecast business volume for various nights of the week, and staff adequately.

In addition, such information can aid menu planning and upgrades. For example, the restaurant owner can use the above information to look for commonalities among the most popular items. Perhaps the most popular entrees sold each involve some prominent ingredient. In this case, the restaurant can direct its chef to test new entrees and menu items that feature that ingredient. Moreover, if particular entrees are not selling very well, maybe the restaurant owner can try to feature or promote them in some way, or discontinue the item altogether.

Also, in the age of social media, sites like Yelp and TripAdvisor can provide the restaurateur with free market research. If customers are complaining about long waits for service, the restaurateur may use that to increase staffing, provide extra training to the waitstaff. If reviewers are raving about specific menu items, the restaurateur can promote those items or create new entrees that are similar.

“Big Data” is a subjective and relative term. Data collected by a small family restaurant is usually not large enough to warrant the use of business intelligence tools such as SAS or SPSS to analyze it, but is still large enough to provide valuable insights for a small business to operate successfully.

 

******************************************************************************************************************************

Follow Analysights on Facebook and Twitter!

Now you can keep track of new posts on either Insight Central or our successor blog, The Analysights Data Mine, by simply “Liking” us on Facebook (look for Analysights), or by following @Analysights on Twitter.  Each time a new post is published, you will find out about it from either your Facebook newsfeed or your Twitter feed.  Thank you for following our blog; we also look forward to following you on Twitter as well!

Company Practices Can Cause “Dirty” Data

April 28, 2014

As technical people, we often use a not-so-technical phrase to describe the use of bad data in our analyses: “Garbage in, garbage out.” Anytime we build a model or perform an analysis based on data that is dirty or incorrect, we will get results that are undesirable. Data has many opportunities to get murky, with a major cause being the way the business collects and stores it. And dirty data isn’t always incorrect data; the way the company enters data can be correct for operational purposes, but not useful for a particular analysis being done for the marketing department. Here are some examples:

The Return That Wasn’t a Return

I was recently at an outlet store buying some shirts for my sons. After walking out, I realized the sales clerk rang up the full, not sale, price. I went back to the store to have the difference refunded. The clerk re-scanned the receipt, cancelled the previous sale and re-rang the shirts at the sale price. Although I ended up with the same outcome – a refund – I thought of the problems this process could cause.

What if the retailer wanted to predict the likelihood of merchandise returns? My transaction, which was actually a price adjustment, would be treated as a return. Depending on how often this happens, a particular store can be flagged as having above-average returns relative to comparable stores, and be required to implement more stringent return policies that weren’t necessary to begin with.

And consider the flipside of this process: by treating the erroneous ring-up as a return, the retailer won’t be alerted to the possibility that clerks at this store may be making mistakes in ringing up information; perhaps sale prices aren’t being entered into the store’s system; or perhaps equipment storing price updates isn’t functioning properly.

And processing the price adjustment the way the clerk did actually creates even more data that needs to be stored: the initial transaction, the return transaction, and the corrected transaction.

The Company With Very Old Customers

Some years ago, I worked for a company that did direct mailings. I needed to conduct an analysis of its customers and identify the variables that predicted those most likely to respond to a solicitation. The company collected the birthdates of its customers. From that field, I calculated the age of each individual customer. And I found that nearly ten percent of their customers were quite old – much older than the market segments this company targeted. A deeper dive on the birthdate field revealed that virtually all of them had the same birthdate: November 11, 1911. (This was back around the turn of the millennium when companies still recorded dates with two-digit years).

How did this happen? Well, as discussed in the prior post on problem definition, I consulted the company’s “data experts.” I learned that the birthdate field was a required field for first-time customers. The call center representative could not move from the birthdate field to the next field unless values were entered into the birthdate field. Hence, many representatives in the call center simply entered “11-11-11” to bypass the field when a first-time customer refused to give his or her birthdate.

In this case, the company’s requirement to collect birthdate information met sharp resistance from customers, causing the call center to enter dummy data to get around the operational constraints. Incidentally, the company later made the birthdate field optional.

Customers Who Hadn’t Purchased in Almost a Century

Back in the late 1990s, I went to work for a catalog retailer, building response models. The cataloger was concerned that its models were generating undesirable results. I tried running the models with its data and confirmed the models to be untrustworthy. So I started running frequency distributions on all its data fields. To my surprise, I found a field, “Months since last purchase,” in which many customers had the value “999.” Wow – many customers hadn’t purchased since 1916 – almost 83 years earlier!

I knew immediately what happened. In the past, when data was often read into systems using magnetic tape, the way the data systems were programmed required all fields to be populated; if a value for a particular field was missing, the value for the next field would get read into its place, and so forth; and when the program read to the end of the record, it often went to the next record and then read values from there until all fields for the previous record were populated. This was a data nightmare.

To get around this, fields whose data was missing or unknown were filled with a series of 9s, so that all the other data would be entered into the system correctly. This process was fine and dandy, as long as the company’s analysts accounted for this practice during their analysis. The cataloger, however, would run its regressions using those ‘999s,’ resulting in serious outliers, and regressions of little value.

In this case, the cataloger’s attempt to rectify one data malady resulted in a new data malady. I corrected this by recoding the values, breaking those whose last purchase date was known into intervals, and using ranking values: a 1 for the most recent customers, a 2 for the next most recent, a 3 for the next most recent, and so forth, and gave the lowest rank to those whose last purchase was unknown.

The Moral of the Story

Company policy is a major cause of dirty data. These examples – which are far from comprehensive – illustrate how the way data is entered can cause problems. Often, a data fix proves shortsighted, as it causes new problems down the road. This is why it is so important for analysts to consult the company’s data experts before undertaking any major data mining effort. Knowing how a company collects and stores data and making allowances for it will increase the likelihood of a successful data mining effort.

Big Data Success Starts With Well-Defined Business Problem

April 18, 2014

(This post also appears on our successor blog, The Analysights Data Mine).

Lots of companies are jumping on the “Big Data” bandwagon; few of them, however, have given real thought to how they will use their data or what they want to achieve with the knowledge the data will give them.  Before reaping the benefits of data mining, companies need to decide what is really important to them.  In order to mine data for actionable insights, technical and business people within the organization need to discuss the business’ needs.

Data mining efforts and processes will vary, depending on a company’s priorities.  A company will use data very differently if its aim is to acquire new customers than if it wants to sell new products to existing customers, or find ways to reduce the cost of servicing customers.  Problem definition puts those priorities in focus.

Problem definition isn’t just about identifying the company’s priorities, however.  In order to help the business achieve its goals, analysts must understand the constraints (e.g., internal privacy policies, regulations, etc.) under which the company operates, whether the necessary data is available, whether data mining is even necessary to solve the problem, the audience at whom data mining is directed, and the experience and intuition of the business and technical sides.

What Does The Company Want to Solve?

Banks, cell phone companies, cable companies, and casinos collect lots of information on their customers.  But their data is of little value if they don’t know what they want to do with it.  In the banking industry, where acquiring new customers often means luring them away from another bank,  a bank’s objective might be to cross-sell, or get its current depositors and borrowers to acquire more of – its products, so that they will be less inclined to leave the bank.  If that’s the case, then the bank’s data mining effort will involve looking at the products its current customers have and the order and manner in which the customer acquired those products.

On the other hand, if the bank’s objective is to identify which customers are at risk of leaving, its data mining effort will examine the activity of departing households in the months leading up to their defection, and compare it to those households it retained.

If a casino’s goal is to decide on what new slot machines to install, its data mining effort will look at the slot machine themes its top patrons play most and use that in its choice of new slot machines.

Who is the Audience the Company is Targeting?

Ok, so the bank wants to prevent customers from leaving.  But do they want to prevent all customers from leaving?  Usually, only a small percentage of households account for all of a bank’s profit; many banking customers are actually unprofitable.  If the bank wants to retain its most profitable customers, it needs only analyze that subgroup of its customer base.  The bank’s predictions of its premier customers’ likelihood to leave based on a model developed on all its customers would be highly inaccurate.  In this case, the bank would need to build a model only on its most profitable customers.

Does the Problem Require Data Mining?

Data mining isn’t always needed.  Years ago, when I was working for a catalog company, I developed regression models to predict which customers were likely to order from a particular catalog.  When a model was requested for the company’s holiday catalog, I was told that it would go to 85 percent of the customer list.  When such a large proportion of the customer base – or the entire customer base for that matter – is to receive communication, then a model is not necessary.  More intuitive methods would have sufficed.

Is Data Available?

Before a data mining effort can be undertaken, the data necessary to solve the business problem must be available or obtainable.  If a bank wants to know the next best product to recommend to its existing customers, it needs to know the first product these customers acquired, how they acquired it, the length of time between their acquisition of their second product, then their third product, and so forth. The bank also needs to understand what products its customers acquired simultaneously (such as a checking account and a credit card), current activity with those products, and the sequence of product acquisition (e.g., checking account first, savings account second, certificate of deposit third, etc.).

It is extremely important that analysts consult both those on the business side and the IT department about the availability of data.  These internal experts often know what data is collected on customers, where it resides, and how it is stored.  In many cases, these experts may have access to data that doesn’t make it into the enterprise’s data warehouse.  And they may know what certain esoteric values for fields in the data warehouse mean.  Consulting these experts can save analysts a lot of time in understanding the data.

Under What Constraints Does the Business Operate?

Companies have internal policies regulating how their operation; are subject to regulations and laws governing the industries and localities in which they operate; and also are bound by ethical standards in those industries and locations.

Often, a company has access to data that, if used in making business decisions, can be illegal or viewed as unethical.  The company doesn’t acquire this data illegally; the data just cannot be used for certain business practices.

For example, I was building customer acquisition models for a bank a few years ago.  The bank’s data warehouse had access to summarized credit score statistics by block groups, as defined by the U.S. Bureau of the Census.  However, banks are subject to the Community Reinvestment Act (CRA), a 1977 law that was passed to prevent banks from excluding low- to moderate-income neighborhoods in their communities from lending decisions.  Obviously, credit scores are going to be lower in lower-income areas. Hence, under CRA guidelines, I could not use the summarized credit statistics to build a model for lending products.  I could, however, use those statistics for a model for deposit products; for post campaign analysis, to see which types of customers responded to the campaign; and also to demonstrate compliance with the CRA.

In addition, the bank’s internal policies did not allow the use of marital status in promoting products.  Hence, when using demographic data that the bank purchased, I had to ignore the field, “married” when building my model.  In cases like these, less direct approaches can be used.  The purchased data also contained a field called “number of adults (in the household).  This was totally appropriate to use, since it did not necessarily mean that a household with two adults was a married-couple household.

Again, the analyst must consult the company’s business experts so it can understand these operational constraints.

Are the Business Experts’ Opinions and Intuition Spot-On?

It’s often said that novices make mistakes out of ignorance and veterans make mistakes out of arrogance.  The business experts have a lot of experience in the company and a great deal of intuition, which can be very insightful.  However, they can be wrong too.  With every data mining effort, the data must be allowed to tell the story.  Does the data validate what the experts say?  For example, most checking accounts are automatically bundled with a debit card; a bank’s business experts know this; and the analysis will often bear this out.

However, if the business experts say that a typical progression in a customer’s banking relationship starts with demand deposit accounts (e.g., checking accounts) then consumer lending products (e.g., auto and personal loans), followed by time deposits (e.g., savings accounts and certificates of deposit), does the analysis confirm that?

 

Problem definition is the hardest, trickiest, yet most important, prerequisite to getting the most out of “Big Data.”  By knowing what the business needs to solve, analysts must also consider the audience the data mining effort is targeting; whether data mining is necessary; the availability of data and the conditions under which it may be used; and the experience of the business experts.  Effective problem definition begets data mining efforts that produce insights a company can act upon.

Data, Data Everywhere

September 29, 2010

Every time we use a cell phone, surf the Web, interact on Facebook, make a purchase, what have you, we create data that businesses, charities, and other organizations analyze to learn more about us.

While such a scenario sounds Orwellian, it is not necessarily terrible.  For example, if your local supermarket chain knows from your frequent shopper card that you buy Kashi Go-Lean cereal four or five packages at a time, you might appreciate them telling you when Kashi goes on sale.  It’s a win-win situation: you want to stock up on Kashi for the best price, and the store wants to bait you with the Kashi in the hopes you’ll buy more than the cereal.

But I digress.  The fact that we create data both seamlessly and almost instantaneously with every one of life’s transactions has greatly increased the demand for tools and specialized professionals to analyze that data and help companies turn it into actionable information.  In fact, IBM is banking its future growth on analytics, a market estimated to be worth $100 billion, by its planned purchase of Netezza, which it announced last week.

Analytics is big business, and even if your job description doesn’t require you to analyze data, you should be aware of it.  Almost anything electronic can be tracked and/or monitored these days.  Anytime you get an email offer from an online retailer you’ve done business with, or direct mail from a charity or other retailer, you’ve been selected by analytical tools who are viewing your past purchasing and giving history.

If you run a business, you should be cognizant of all the data you accumulate and the ways in which you accumulate it.  What’s more, you should weigh the data you’re currently collecting against the decisions it helps you make, so that you can identify additional data you may need.  This can be a goldmine for you in helping you better understand your customers’ needs and wants, identify new trends and changing patterns, and develop new products in services in response to those changing needs and wants.

Data and the need to analyze it are here to stay.

Correcting for Outliers

September 15, 2010

Yesterday, we discussed approaches for discerning outliers in your data set. Today we’re going to discuss what to do about them. Most of the remedies for dealing with outliers are similar to those of dealing with missing data: doing nothing, deleting observations, ignoring the variable, and imputing values. We will discuss the remedies below.

Doing nothing

As with missing data, you may choose to do nothing about the outliers, especially if you rank numeric values, which essentially negates the effect of outliers. This is true of many decision tree algorithms. Neural networks, however, may be seriously disrupted by a few outlying values.

Delete the observations with outlying values

This is another approach that, like with missing data, I do not recommend because of the selection bias it introduces in the model. However, in cases of truly extreme outliers, eliminating one or two that are way off the charts may improve results.

Ignoring the variable

Sometimes we can exclude a variable with outliers. Perhaps we can replace it with information referring to it, or use proxy information. For example, if a food manufacturer was trying to measure coupon redemption by certain metropolitan areas, there might be sharp outliers within each metro area. Instead of the metro area, the food manufacturer may substitute information about the metro area – number of supermarkets, newspaper circulation (assuming its coupons appear in the Sunday paper), average shopping basket amount, etc. Much of this information is available through third party vendors or from sources like the U.S. Census Bureau.

Imputing the values

As with missing values, you would simply try to predict the “right” value to substitute for an outlying variable. You might even cap the outliers at the bottom or top. For example, you might look at the 5th and 95th percentiles, and set the lowest values to the 5th percentile and the top values to the 95% percentile. You may even choose to eliminate those falling outside the 5th through 95th percentiles. However, as I mentioned yesterday, such capping ignores the uniqueness of each data set. You need to treat each data set differently when identifying and correcting its outliers.

If an observation has an outlier, you might also look to see what values other similar observations tend to have for that variable, and substitute the mean or median for the extreme value. For instance, an ice cream parlor chain might see that sales of mint chocolate chip ice cream in one store might be much higher than that of other stores in the area. The sales director might look at stores of similar size (e.g., square footage, sales volume, full-time equivalent employees, etc.), or similar territory (e.g., all ice cream parlors in the greater Bismarck, ND area), and check the average or median sales of mint chocolate chip ice cream and substitute that for the outlying store.

It is important to remember however that outliers can be caused because of external factors. Before blindly imputing values for mint chocolate chip ice cream sales in that particular store, the sales director should find out if customers near that store have preferences for mint, or whether a few customers buy the mint chocolate chip a lot more than others. It might even be that the other parlors could have severe stock-outs of the flavor, suggesting distribution problems. In this case, the outlying parlor could be normal and all other parlors could be selling too little mint chocolate chip ice cream!

Binning values

Sometimes, the best way to deal with outliers is to collapse the values into a few equal-sized categories. You might order your values from high to low and then break them into equal groups. This process is called binning. Low, Medium, and High are common bins. Others might be Outstanding, Above Average, Average, Below Average, and Poor. Outliers fall into appropriate ranges with binning.

Transforming Data

Sometimes you can eliminate outliers by transforming data. Binning is one form of transformation. Taking the natural log of a value can also reduce the variation caused by extreme values. Another way to eradicate outliers might be ratios. For example, if the ice cream parlor chain wanted to measure store sales, some stores may have much higher sales than others. However, the chain can reduce outliers and normalize data by computing a “sales per square foot” value.

It is important to note that transforming data also transforms your analysis and models, and that once you’ve done your analysis on the transformed data, you must convert your results back to the original form in order for them to make sense.

As you can see, correcting for outliers isn’t much different from correcting for missing data. However, you must be careful in your approach to correcting either outliers or missing data. Outliers by themselves can still alert you to valuable information, such as data collection problems. There’s no “best” way to correct for outliers in general; quite often the best approach for correcting outliers depends on the nature of the data, the business objective, and the impact the correction will have on the results of the analysis that is supporting that business objective. How you correct an outlier is just as critical as how you define it.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.