“Big Data” Success Starts With Good Data Governance

May 19, 2014

(This post appeared on our successor blog, The Analysights Data Mine, on Friday, May 9, 2014). 

As data continues to proliferate unabated in organizations, coming in faster and from more sources each day, decision makers find themselves perplexed.  Decision makers struggle with several questions: How much data do we have? How fast is it coming in? Where is it coming from? What form does it take? How reliable is it? Is it correct? How long will it be useful?  And this is before they even decide what they can and will do with the data!

Before a company can leverage big data successfully, it must decide upon its objectives and balance that against the data it has, regulations for the use of that data, and the information needs of all its functional areas. And it must assess the risks both to the security of the data and to the company’s viability.  That is, the company must establish effective data governance.

What is Data Governance?

Data governance is a young and still evolving system of practices designed to help organizations ensure that their data is managed properly and in the best interest of the organization and its stakeholders.  Data governance is an organization’s process for handling data by leveraging its data infrastructure, the quality and management of its data, its policies for using its data, its business process needs, and its risk management needs.  An illustration of data governance is shown below:

governance

Why Data Governance?

Data has many uses; comes in many different forms; takes up a lot of space; can be siloed, subject to certain regulations, off-limits to some parties but free and unlimited to others; and must be validated and safeguarded.  But just as much, data governance ensures that the business is using its data toward solving its defined business problems.

The explosion of regulations, such as Sarbanes-Oxley, Basel I, Basel II, Dodd-Frank, HIPAA, and a series of other rules regarding data privacy and security are making the role of data governance all the more important.

Moreover, data comes in many different forms. Companies get sales data from the field, or from a store location; they get information about their employees from job applications. Data of this nature is often structured.  Companies also get data from their web logs, from social media such as Facebook and Twitter; they also get data in the form of images, text, and so forth; these data are unstructured, but must be managed regardless.  Through data governance, the company can decide what data to store and whether it has the infrastructure in place to store it.

The 6 Vs of Big Data

Many people aware of big data are familiar with its proverbial “3 Vs” – Volume, Variety, and Velocity.  But Kevin Normandeau, in a post for Inside Big Data, suggests that three more Vs pose even greater issues: Veracity (cleanliness of the data), Validity (correctness and accuracy of the data), and Volatility (how long the data remains valid and should be stored).  These additional Vs make data governance an even greater necessity.

What Does Effective Data Governance Look Like?

Effective data governance begins with designation of an owner for the governance effort – an individual or team who will be held accountable.

The person or team owning the data governance function must be able to communicate with all department heads to understand the data they have access to, what they use it for, where they store it, and what they need it for.  They must also be adroit in their ability to work with third party vendors and external customers of their data.

The data governance team must understand both internal policies and external regulations governing the use of data and what specific data is subject to specific regulations and/or policies.

The data governance team must also assess the value of the data the company collects; estimate the risks involved if a company makes decisions based on invalid or incomplete data, or if the data infrastructure fails, or is hacked; and design systems to minimize these risks.

The team must then be able to draft, document, implement, and enforce its governance processes once data has been inventoried and matched to its relevant constraints, and the team develops its processes for data collection and storage.  The team must then be able to train employees of the organization in the proper use and collection of the data, so that they know what they can and cannot do.

Without effective data governance, companies will find themselves vulnerable to hackers, fines, or other business interruptions; they will be less efficient as inaccurate data will lead to rework and inadequate data will lead to slower, less effective decision making; and they will be less profitable as lost data or incomplete data will often cause them to miss opportunities or take incorrect actions due to decisions on such data.  Good data governance will ensure that companies get the most out of their data.

 

****************************************************

Follow Analysights on Facebook and Twitter!

Now you can find out about new posts to both Insight Central and our successor blog, The Analysights Data Mine, by “Liking” us on Facebook (just look for Analysights), or by following @Analysights on Twitter.  Each time a new post appears on Insight Central or The Analysights Data Mine, you will be notified by either your Facebook Newsfeed or your Twitter feeds.  Thanks!

 

“Big Data” Benefits Small Businesses Too

May 8, 2014

(This post appeared on our successor blog, The Analysights Data Mine, on Monday, May 5, 2014).

One misconception about “big data” is that it is only for large enterprises. On its face, such a claim would sound logical; but, in reality, “big data” is just as vital to a small business as it is a major corporation. While the amount of data a small business generates is nowhere near as large as that which a large corporation might generate, a small business can still analyze that data to find insightful ways to run more efficiently.

Imagine a family restaurant in your local town. Such a restaurant may not have a loyalty card like a chain restaurant; it may not have any process by which to target customers; in fact, the restaurant may not even be computerized. But the restaurant still generates a LOT of useful data.

What is the richest source of the restaurant’s data? It’s the check on which the server records the table’s orders. If a restaurant saves these checks, the owner can tally the entrees, appetizers, and side orders that were made during a given period of time. This can help the restaurateur learn a lot of useful information, such as:

  • What entrée or entrees are most commonly sold?
  • What side dishes are most commonly ordered with a particular entrée?
  • What is the most popular entrée sold on a Friday or Saturday night?
  • How many refills does a typical table order?
  • What is the average number of patrons per table?
  • What are the busiest and slowest nights/times of the week?
  • How many tables and/or patrons come in on a particular night of the week?

Information like this can help the restaurateur estimate how many of each entrée it must prepare for on a given day; order sufficient ingredients for such entrees and menu items; forecast business volume for various nights of the week, and staff adequately.

In addition, such information can aid menu planning and upgrades. For example, the restaurant owner can use the above information to look for commonalities among the most popular items. Perhaps the most popular entrees sold each involve some prominent ingredient. In this case, the restaurant can direct its chef to test new entrees and menu items that feature that ingredient. Moreover, if particular entrees are not selling very well, maybe the restaurant owner can try to feature or promote them in some way, or discontinue the item altogether.

Also, in the age of social media, sites like Yelp and TripAdvisor can provide the restaurateur with free market research. If customers are complaining about long waits for service, the restaurateur may use that to increase staffing, provide extra training to the waitstaff. If reviewers are raving about specific menu items, the restaurateur can promote those items or create new entrees that are similar.

“Big Data” is a subjective and relative term. Data collected by a small family restaurant is usually not large enough to warrant the use of business intelligence tools such as SAS or SPSS to analyze it, but is still large enough to provide valuable insights for a small business to operate successfully.

 

******************************************************************************************************************************

Follow Analysights on Facebook and Twitter!

Now you can keep track of new posts on either Insight Central or our successor blog, The Analysights Data Mine, by simply “Liking” us on Facebook (look for Analysights), or by following @Analysights on Twitter.  Each time a new post is published, you will find out about it from either your Facebook newsfeed or your Twitter feed.  Thank you for following our blog; we also look forward to following you on Twitter as well!

Big Data, Big Bucks

May 6, 2014

(This post appeared last week on our successor blog, the Analysights Data Mine)

In their 1996 bestselling book, The Millionaire Next Door, Thomas J. Stanley and William D. Danko constructed profiles of the typical American millionaire.  One common characteristic the authors observed was that these millionaires “chose the right occupation.”  When Stanley and Danko wrote Millionaire, I doubt many of their research subjects were data analysts, predictive modelers, data scientists, or other “Big Data” professionals; but if they were to write a new edition today, I’ll bet there would be a lot more on the list.  “Big Data” jobs seem to be “the right occupation” today.

In a recent interview with the Wall Street Journal, veteran analytics recruiter Linda Burtch of Burtch Works predicted that job candidates with little familiarity with “Big Data” will face a “permanent pink slip,” while observing that analytics professionals earn a median base salary of $90,000 per year. Ms. Burtch distinguishes between “analytics” professionals (who typically deal with structured data sets) and “data scientists” (who typically work with large, unstructured data sets), when classifying income levels.  Data scientists, Burtch Works found, make a median base salary of $120,000.

Even more impressive is the median base salaries of entry level professionals, those with three years’ experience or less: $65,000 for analytics professionals and $80,000 for data scientists.  At nine or more years’ experience, the median base salaries rise to $115,000 and $150,000, respectively.

Much of the reason for the hefty salaries is that companies don’t often understand what skill sets they need.  Ms. Burtch mentions this in her comments to the Wall Street Journal, and I indicated as much in a previous blog post.  Add to that the fact that needed skill sets are also highly specialized and relatively few professionals have such skills, or a large pool of them.  Because of the scarcity, candidates can command such high salaries.

For companies, this suggests that in order to get the most value out of a “Big Data” hire, it must first decide the typical projects it will expect the candidate to perform, and then base the required skill set and years of experience accordingly.  Then the company can budget the salary it is willing to pay.  This will ensure that the company isn’t hiring someone with 10 years’ experience in data analytics and paying that person $120,000 per year just to pull data for mailing lists, when it should have hired someone out of college for about one-third of that.

For candidates, the breadth of skill sets employers seek in “Big Data” professionals suggests they can maximize their salaries by continuing to broaden their skills and experience within the data realm.  For example, someone with years of SAS programming and SQL experience may branch out to other programming tools, such as R and Python. Or, such a professional may expand his or her skill set by developing proficiency in data visualization tools such as Tableau of QLIKVIEW.

Working in “Big Data” may not make someone “the millionaire next door,” but it may bring him or her pretty close.

 

****************************************************************************************************************************

Follow Analysights on Facebook and Twitter!

Now you can keep track of new posts on this site and our successor site, the Analysights Data Mine, by “Liking” us on Facebook, or following us at Twitter: @Analysights.  Each time we post something new, you will automatically be notified through your Facebook newsfeed or your Twitter feeds.  We look forward to seeing you!

Company Practices Can Cause “Dirty” Data

April 28, 2014

As technical people, we often use a not-so-technical phrase to describe the use of bad data in our analyses: “Garbage in, garbage out.” Anytime we build a model or perform an analysis based on data that is dirty or incorrect, we will get results that are undesirable. Data has many opportunities to get murky, with a major cause being the way the business collects and stores it. And dirty data isn’t always incorrect data; the way the company enters data can be correct for operational purposes, but not useful for a particular analysis being done for the marketing department. Here are some examples:

The Return That Wasn’t a Return

I was recently at an outlet store buying some shirts for my sons. After walking out, I realized the sales clerk rang up the full, not sale, price. I went back to the store to have the difference refunded. The clerk re-scanned the receipt, cancelled the previous sale and re-rang the shirts at the sale price. Although I ended up with the same outcome – a refund – I thought of the problems this process could cause.

What if the retailer wanted to predict the likelihood of merchandise returns? My transaction, which was actually a price adjustment, would be treated as a return. Depending on how often this happens, a particular store can be flagged as having above-average returns relative to comparable stores, and be required to implement more stringent return policies that weren’t necessary to begin with.

And consider the flipside of this process: by treating the erroneous ring-up as a return, the retailer won’t be alerted to the possibility that clerks at this store may be making mistakes in ringing up information; perhaps sale prices aren’t being entered into the store’s system; or perhaps equipment storing price updates isn’t functioning properly.

And processing the price adjustment the way the clerk did actually creates even more data that needs to be stored: the initial transaction, the return transaction, and the corrected transaction.

The Company With Very Old Customers

Some years ago, I worked for a company that did direct mailings. I needed to conduct an analysis of its customers and identify the variables that predicted those most likely to respond to a solicitation. The company collected the birthdates of its customers. From that field, I calculated the age of each individual customer. And I found that nearly ten percent of their customers were quite old – much older than the market segments this company targeted. A deeper dive on the birthdate field revealed that virtually all of them had the same birthdate: November 11, 1911. (This was back around the turn of the millennium when companies still recorded dates with two-digit years).

How did this happen? Well, as discussed in the prior post on problem definition, I consulted the company’s “data experts.” I learned that the birthdate field was a required field for first-time customers. The call center representative could not move from the birthdate field to the next field unless values were entered into the birthdate field. Hence, many representatives in the call center simply entered “11-11-11” to bypass the field when a first-time customer refused to give his or her birthdate.

In this case, the company’s requirement to collect birthdate information met sharp resistance from customers, causing the call center to enter dummy data to get around the operational constraints. Incidentally, the company later made the birthdate field optional.

Customers Who Hadn’t Purchased in Almost a Century

Back in the late 1990s, I went to work for a catalog retailer, building response models. The cataloger was concerned that its models were generating undesirable results. I tried running the models with its data and confirmed the models to be untrustworthy. So I started running frequency distributions on all its data fields. To my surprise, I found a field, “Months since last purchase,” in which many customers had the value “999.” Wow – many customers hadn’t purchased since 1916 – almost 83 years earlier!

I knew immediately what happened. In the past, when data was often read into systems using magnetic tape, the way the data systems were programmed required all fields to be populated; if a value for a particular field was missing, the value for the next field would get read into its place, and so forth; and when the program read to the end of the record, it often went to the next record and then read values from there until all fields for the previous record were populated. This was a data nightmare.

To get around this, fields whose data was missing or unknown were filled with a series of 9s, so that all the other data would be entered into the system correctly. This process was fine and dandy, as long as the company’s analysts accounted for this practice during their analysis. The cataloger, however, would run its regressions using those ‘999s,’ resulting in serious outliers, and regressions of little value.

In this case, the cataloger’s attempt to rectify one data malady resulted in a new data malady. I corrected this by recoding the values, breaking those whose last purchase date was known into intervals, and using ranking values: a 1 for the most recent customers, a 2 for the next most recent, a 3 for the next most recent, and so forth, and gave the lowest rank to those whose last purchase was unknown.

The Moral of the Story

Company policy is a major cause of dirty data. These examples – which are far from comprehensive – illustrate how the way data is entered can cause problems. Often, a data fix proves shortsighted, as it causes new problems down the road. This is why it is so important for analysts to consult the company’s data experts before undertaking any major data mining effort. Knowing how a company collects and stores data and making allowances for it will increase the likelihood of a successful data mining effort.

How “Big Data” Can Improve Educational Outcomes

April 23, 2014

Our news media frequently inundates us with study upon study of how the American education system trails most other advanced countries in math and science, graduation rates, or some other metric of education performance.  I disagree strongly with most of these studies for reasons I won’t go into, except to say that many of their researchers cherry-pick data and then use the most alarming findings for media sound bites.  But, let us take these studies at face value for a moment and assume their findings are correct.  What then do we do about our “failing” education system?

Big Data to the Rescue

Education is a treasure trove of data; only recently have schools been making use of this data to improve outcomes in education, and much of their work to date is only scratching the surface.

Schools collect data on several attributes: a student’s progress in each subject over time; the teacher for each subject; the instruction styles for each teacher; the student’s likes and dislikes; whether students drop out or graduate; demographic, neighborhood, and socioeconomic characteristics of each student; teacher tenure and training; and so on.  Consider the ways schools might use such data to improve educational outcomes:

  • Identify factors that drive subject failure or school dropout, and predict which students are at highest risk of either event, and intervene;
  • Enhance professional development of teachers by identifying areas of their teaching styles and methods that more most and least effective;
  • Identify the types of environments under which individual students perform best and tailor their curriculum accordingly;
  • Identify ineffective curricula and instruction and direct school resources to ones that are more effective;
  • Determine whether underperforming students are clustered within a particular classroom and drill down to determine whether the teacher needs additional training or resources, or if he/she has a larger number of students with special needs; and
  • Predict whether a student is more likely to succeed in a college-preparatory or vocational environment and tailor his or her curriculum accordingly.

This list is far from comprehensive. Keep in mind, however, just as with business situations, educational institutions must use “Big Data” judiciously in trying to enhance educational outcomes; the constraints under which the schools operate, especially those governing the use of student and teacher data, must still be taken into account before a school undertakes a data mining effort and again before taking action based on the findings from that effort.  Getting buy-in from parents and other community stakeholders is essential to ensuring that a school’s data mining efforts are successful.

As I said earlier, I don’t believe a lot of the studies about the performance of U.S. schools.  If their findings are indeed true, then “Big Data” can be quite useful in identifying and rectifying problem areas; if the findings are not true, then the data mining effort can make the performance of our schools even better.  But as with any organization wishing to use data mining, school administrators must decide what problem or problems they want data mining to solve and follow the steps as described in my last blog post.  The rules, caveats, and benefits of “Big Data” apply just as much to public sector industries like education as they do to for-profit industries.


Follow

Get every new post delivered to your Inbox.

Join 124 other followers