Posts Tagged ‘Sampling’

Considerations for Selecting a Representative Sample

July 27, 2010

When trying to understand and make inferences about a population, it is neither possible nor cost effective to survey everyone who comprises that population. Therefore, analysts choose to survey a reasonably-sized sample of the population, whose results they can generalize to the entire population. Since such sampling is subject to error, it is vitally important that an analyst select a sample that is adequately representative of the population at large. Ensuring that a sample represents the population as accurately as possible requires that the sample be drawn using well-established, specific principles. In today’s post, we will be discussing the considerations for selecting a representative sample.

What is the Unit of Analysis?

What is the population you are interested in measuring? Let’s assume you are a market research analyst for a life insurance company and you are trying to understand the degree of existing life insurance coverage of households in the greater Chicago area. Already, this is a challenging prospect. What constitutes “life insurance coverage?” “A household”? or “The greater Chicago area?” As the analyst, you must define these before you can move forward. Does “coverage” mean having any life insurance policy, regardless of amount? Or does it mean having life insurance that covers the oft recommended eight to ten times the principal breadwinner’s salary? Does it mean having individual vs. group life insurance, or either one?

Does “household” mean a unit with at least one adult and the presence of children? Can a household consist of one person for your analysis?

Does the “greater Chicago area” mean every household within the Chicago metropolitan statistical area (MSA), as defined by the U.S. Census Bureau, or does it mean the city of Chicago and its suburban collar counties (e.g., Cook, DuPage, Lake, Will, McHenry, Kane, Kendall)?

All of these are considerations you must decide on.

You talk through these issues with some of the relevant stakeholders: your company’s actuarial department, the marketing department, and the product development department, and you learn some new information. You find out that your company wants to sell a highly-specialized life insurance product to young (under 40), high-salaried (at least $200,000) male heads-of-household that provides up to ten times the income coverage. You find that “male head-of-household” is construed to mean any man who has children under 18 present in his household and has either no spouse or a spouse earning less than $20,000 per year.

You also learn that this life insurance product is being pilot tested in the Chicago area, and that the insurance company’s captive agent force has offices only within the City and its seven collar counties, although agents may write policies for any qualifying person in Illinois. You can do one of two things here. Since all your company’s agents are in the City and collar counties, you might simply restrict your definition of “greater Chicago area” to this region. Or, you might select this area, and add to it nearby counties without agencies, where agents write a large number of policies. Whether you do the former or latter depends on the timeframe available to you. If you can easily and quickly obtain the information for determining the additional counties, you might select the latter definition. If not, you’ll likely go with the former. Let’s assume you choose only those in the City and its collar counties.

Another thing you find out through communicating with stakeholders is that the intent of this insurance product is to close gaps in, not replace, existing life insurance coverage. Hence, you now know your relevant population:

Men under the age of 40, living in the city of Chicago or its seven collar counties, with a salary income of at least $200,000 per year, heading a household with at least one child under 18 present, with either no spouse or a spouse earning less than $20,000 per year, and who have life insurance coverage that is less than ten times their annual salary income.

You can see that this is a very specific unit of analysis. For this type of insurance product, you do not want to survey the general population, as this product will be irrelevant for most. Hence, the above italicized definition is your working population. It is from this group that you want to draw your sample.

How Do You Reach This Working Population?

Now that you have identified your working population, you must find a master list of people from which to draw your sample. Such a list is known as the sample frame. As you’ve probably guessed, there is no one list that will contain your working list precisely. Hence, you will spend some time searching for as comprehensive a list, or some combination of lists that will contain as complete a list as possible of everyone in your working population. The degree to which your sample frame fails to account for all of your working population is known as its bias or sample frame error, and such error cannot be totally eradicated.

Sample frame error exists because some of these upscale households move out while others move in; some die; some have unlisted phone numbers or don’t give out their email addresses; some will lose their jobs, while others move into these high paying jobs; and some will hit age 40, or their wives will get higher paying jobs. And these changes are dynamic. There’s nothing you can do, except be aware of them.

To obtain your sample frame, you might start by asking yourself several questions about your working population: What ZIP codes are they likely to live in? What types of hobbies do they engage in? What magazines and newspapers do they subscribe to? Where do they take vacations? What clubs and civic organizations do they join? Do they use financial planners or CPA’s?

Armed with this information, you might purchase mailing lists of such men from magazine subscriptions; you might search phone listings in upscale Chicago area communities like Winnetka, Kenilworth, and Lake Forest. You might network with travel agents, real estate brokers, financial advisors, and charitable organization. You may also purchase membership lists from clubs. You will then combine these lists to come up with your sample frame. The degree to which you can do this depends on your time and budget constraints, as well as any regulatory and ethical practices (e.g., privacy, Do Not Call lists, etc.) governing collection of such lists.

Many market research firms have made identifying the sample frame much easier in recent years, thanks to survey panels. Panels are groups of respondents who have agreed in advance to participate in surveys. The existence of survey panels has greatly reduced the amount of time and cost involved in compiling one’s own sample frame. The drawback, however, is that respondents from a panel self-select to join the panel. And panel respondents can be very different from other members of the working population who are not on a panel.

Weeding Out the Irrelevant Population

Your sample frame will never include all those who fit your working population, nor will it exclude all those who do not fit your working population. As a result, you will need to eliminate extraneous members of your sample frame. Unfortunately, there’s no proactive way to do this. Typically, you must ask screening questions at the beginning of your survey to identify if a respondent qualifies to take the survey, and then terminate the survey if a respondent fails to meet the criteria.

Summary

Selecting a representative sample is an intricate process that requires serious thought and communication between stakeholders, about the objectives of the survey, the definition of the relevant working population, the approach to finding and reaching members of the working population, and the time, budget, and regulatory constraints involved. No sample will ever be completely representative of the population, but samples can and should be reasonably representative.

Advertisements

Forecast Friday Topic: Multicollinearity – Correcting and Accepting it

July 22, 2010

(Fourteenth in a series)

In last week’s Forecast Friday post, we discussed how to detect multicollinearity in a regression model and how dropping a suspect variable or variables from the model can be one approach to reducing or eliminating multicollinearity. However, removing variables can cause other problems – particularly specification bias – if the suspect variable is indeed an important predictor. Today we will discuss two additional approaches to correcting multicollinearity – obtaining more data and transforming variables – and will discuss when it’s best to just accept the multicollinearity.

Obtaining More Data

Multicollinearity is really an issue with the sample, not the population. Sometimes, sampling produces a data set that might be too homogeneous. One way to remedy this would be to add more observations to the data set. Enlarging the sample will introduce more variation in the data series, which reduces the effect of sampling error and helps increase precision when estimating various properties of the data. Increased sample sizes can reduce either the presence or the impact of multicollinearity, or both. Obtaining more data is often the best way to remedy multicollinearity.

Obtaining more data does have problems, however. Sometimes, additional data just isn’t available. This is especially the case with time series data, which can be limited or otherwise finite. If you need to obtain that additional information through great effort, it can be costly and time consuming. Also, the additional data you add to your sample could be quite similar to your original data set, so there would be no benefit to enlarging your data set. The new data could even make problems worse!

Transforming Variables

Another way statisticians and modelers go about eliminating multicollinearity is through data transformation. This can be done in a number of ways.

Combine Some Variables

The most obvious way would be to find a way to combine some of the variables. After all, multicollinearity suggests that two or more independent variables are strongly correlated. Perhaps you can multiply two variables together and use the product of those two variables in place of them.

So, in our example of the donor history, we had the two variables “Average Contribution in Last 12 Months” and “Times Donated in Last 12 Months.” We can multiply them to create a composite variable, “Total Contributions in Last 12 Months,” and then use that new variable, along with the variable “Months Since Last Donation” to perform the regression. In fact, if we did that with our model, we end up with a model (not shown here) that has an R2=0.895, and this time the coefficient for “Months Since Last Donation” is significant, as is our “Total Contribution” variable. Our F statistic is a little over 72. Essentially, the R2 and F statistics are only slightly lower than in our original model, suggesting that the transformation was useful. However, looking at the correlation matrix, we still see a strong negative correlation between our two independent variables, suggesting that we still haven’t eliminated multicollinearity.

Centered Interaction Terms

Sometimes we can reduce multicollinearity by creating an interaction term between variables in question. In a model trying to predict performance on a test based on hours spent studying and hours of sleep, you might find that hours spent studying appears to be related with hours of sleep. So, you create a third independent variable, Sleep_Study_Interaction. You do this by computing the average value for both the hours of sleep and hours of studying variables. For each observation, you subtract each independent variable’s mean from its respective value for that observation. Once you’ve done that for each observation, multiply their differences together. This is your interaction term, Sleep_Study_Interaction. Run the regression now with the original two variables and the interaction term. When you subtract the means from the variables in question, you are in effect centering interaction term, which means you’re taking into account central tendency in your data.

Differencing Data

If you’re working with time series data, one way to reduce multicollinearity is to run your regression using differences. To do this, you take every variable – dependent and independent – and, beginning with the second observation – subtract the immediate prior observation’s values for those variables from the current observation. Now, instead of working with original data, you are working with the change in data from one period to the next. Differencing eliminates multicollinearity by removing the trend component of the time series. If all independent variables had followed more or less the same trend, they could end up highly correlated. Sometimes, however, trends can build on themselves for several periods, so multiple differencing may be required. In this case, subtracting the period before was taking a “first difference.” If we subtracted two periods before, it’s a “second difference,” and so on. Note also that with differencing, we lose the first observations in the data, depending on how many periods we have to difference, so if you have a small data set, differencing can reduce your degrees of freedom and increase your risk of making a Type I Error: concluding that an independent variable is not statistically significant when, in truth it is.

Other Transformations

Sometimes, it makes sense to take a look at a scatter plot of each independent variable’s values with that of the dependent variable to see if the relationship is fairly linear. If it is not, that’s a cue to transform an independent variable. If an independent variable appears to have a logarithmic relationship, you might substitute its natural log. Also, depending on the relationship, you can use other transformations: square root, square, negative reciprocal, etc.

Another consideration: if you’re predicting the impact of violent crime on a city’s median family income, instead of using the number of violent crimes committed in the city, you might instead divide it by the city’s population and come up with a per-capita figure. That will give more useful insights into the incidence of crime in the city.

Transforming data in these ways helps reduce multicollinearity by representing independent variables differently, so that they are less correlated with other independent variables.

Limits of Data Transformation

Transforming data has its own pitfalls. First, transforming data also transforms the model. A model that uses a per-capita crime figure for an independent variable has a very different interpretation than one using an aggregate crime figure. Also, interpretations of models and their results get more complicated as data is transformed. Ideally, models are supposed to be parsimonious – that is, they explain a great deal about the relationship as simply as possible. Typically, parsimony means as few independent variables as possible, but it also means as few transformations as possible. You also need to do more work. If you try to plug in new data to your resulting model for forecasting, you must remember to take the values for your data and transform them accordingly.

Living With Multicollinearity

Multicollinearity is par for the course when a model consists of two or more independent variables, so often the question isn’t whether multicollinearity exists, but rather how severe it is. Multicollinearity doesn’t bias your parameter estimates, but it inflates their variance, making them inefficient or untrustworthy. As you have seen from the remedies offered in this post, the cures can be worse than the disease. Correcting multicollinearity can also be an iterative process; the benefit of reducing multicollinearity may not justify the time and resources required to do so. Sometimes, any effort to reduce multicollinearity is futile. Generally, for the purposes of forecasting, it might be perfectly OK to disregard the multicollinearity. If, however, you’re using regression analysis to explain relationships, then you must try to reduce the multicollinearity.

A good approach is to run a couple of different models, some using variations of the remedies we’ve discussed here, and comparing their degree of multicollinearity with that of the original model. It is also important to compare the forecast accuracy of each. After all, if all you’re trying to do is forecast, then a model with slightly less multicollinearity but a higher degree of forecast error is probably not preferable to a more precise forecasting model with higher degrees of multicollinearity.

The Takeaways:

  1. Where you have multiple regression, you almost always have multicollinearity, especially in time series data.
  2. A correlation matrix is a good way to detect multicollinearity. Multicollinearity can be very serious if the correlation matrix shows that some of the independent variables are more highly correlated with each other than they are with the dependent variable.
  3. You should suspect multicollinearity if:
    1. You have a high R2 but low t-statistics;
    2. The sign for a coefficient is opposite of what is normally expected (a relationship that should be positive is negative, and vice-versa).
  4. Multicollinearity doesn’t bias parameter estimates, but makes them untrustworthy by enlarging their variance.
  5. There are several ways of remedying multicollinearity, with obtaining more data often being the best approach. Each remedy for multicollinearity contributes a new set of problems and limitations, so you must weigh the benefit of reduced multicollinearity on time and resources needed to do so, and the resulting impact on your forecast accuracy.

Next Forecast Friday Topic: Autocorrelation

These past two weeks, we discussed the problem of multicollinearity. Next week, we will discuss the problem of autocorrelation – the phenomenon that occurs when we violate the assumption that the error terms are not correlated with each other. We will discuss how to detect autocorrelation, discuss in greater depth the Durbin-Watson statistic’s use as a measure of the presence of autocorrelation, and how to correct for autocorrelation.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

Free Online Survey Tools Can Yield Costly Useless Results if not Used Carefully

June 15, 2010

Thanks to online survey tools like Zoomerang, Surveymonkey, and SurveyPirate, the ability to conduct surveys has been greatly democratized. Small businesses, non-profits, and departments within larger firms can now conduct surveys that they would never have been able to do because of cost and lack of resources. Unfortunately, the greatest drawback of these free survey tools is the same as their greatest benefit: anyone can launch a survey. Launching an effective survey requires a clear definition of the business problem at hand; a carefully thought out discussion of the information needed to address the business problem, the audience of the survey, and how to reach it; determination of the sample size and how to select them; designing, testing, and implementing the questionnaire; and analyzing the results. Free online survey tools do not change this process.

Recently, a business owner from one of my networking groups sent me an online survey that he designed with one of these free tools. It was a questionnaire about children’s toys – which was the business he was in. He wasn’t sending me the survey to look at and give advice; he sent it to me as if I were a prospective customer. Unfortunately, I’m not married and don’t have kids; and all my nieces and nephews are past the age of toys. The survey was irrelevant to me. The toy purveyor needed to think about who his likely buyers were – and he should have good knowledge, based on his past sales, of who his typical buyers are. Then he could have purchased a list of people to whom he could send the survey. Even if that meant using a mail or phone survey, which could be costly, the owner could get more meaningful results. Imagine how many other irrelevant or uninterested recipients received the business owner’s survey. Most probably didn’t respond; but others might have responded untruthfully, giving the owner bogus results.

Also, the “toy-preneur’s” survey questions were poorly designed. A double-barreled question: “Does your child like educational or action toys?” What if a respondent’s child liked both educational and action toys? The owner should have asked two separate questions: “Does your child like educational toys?” and “Does your child like action toys?” Or he could have asked a multi-part question like, “Check the box next to each of the types of toys your child likes to play with,” followed with a list of the different types of toys.

The survey gets worse… questions like: “How much does your child’s happiness mean to you?” How many people are going to answer that question negatively? Hello? Another asking the respondent to rank-order various features of a toy for which there was no prototype pictured, and if that wasn’t bad enough, there were at least 9 items to rank? Most people can’t rank more than five items, especially not for an object they cannot visualize.

We also don’t know how the toy manufacturer selected his sample. My guess was that he sent it to everyone whose business card he collected. Hence, most of the people he was surveying were the wrong people. In addition to getting unacceptable results, another danger of these online survey tools is that people are more frequently bombarded with surveys that they stop participating in surveys altogether. Imagine if you were to receive five or more of these surveys in less than two weeks. How much time are you willing to give to answering these surveys? Then when a truly legitimate survey comes up, how likely are you to participate?

I think it’s great that most companies now have the ability to conduct surveys on the cheap. However, the savings can be greatly offset by the uselessness of the results if the survey is designed poorly or sent to the wrong sample. There is nothing wrong with reading up on how to do a survey and then executing it, as described, as long as the problem is well-defined, the relevant population is identified, and the sampling, execution, and analysis plans are in place. “Free” surveying isn’t good if it costs you money and time in rework and/or in faulty actions taken based on your findings.

Do you have trouble deciding whether you need to do a survey? Do you spend a lot of time trying to find out what you’re trying to learn from a survey? Or how many people to survey? Or the questions you need to ask? Or which people to survey? Let Analysights help. We have nearly 20 years of survey research experience and a strong background in data analysis. We can help you determine whether a survey is the best approach for your research needs, the best questions to ask to get the information you need, and help you understand what the findings mean. Feel free to call us at (847) 895-2565.

Radio Commercial Statistic: Another Example of Lies, Damn Lies, and then Statistics

May 10, 2010

Each morning, I awake to my favorite radio station, and the last few days, I’ve awakened to a commercial about a teaming up of Feeding America and the reality show Biggest Loser to support food banks.  While I think that’s a laudable joint venture, I have been somewhat puzzled by, if not leery of, a claim made in the commercial: that “49 million Americans struggled to put food on the table.”  Forty-nine million?  That’s one out of every six Americans! 

Lots of questions popped into my head: Where did this number come from?  How was it determined?  How did the study define “struggling?”  Why were the respondents struggling?  How did the researcher define the implied “enough food?”  What was the length of time these 49 million people went “struggling” for enough food?  And most importantly, what was the motive behind the study?

The Biggest Loser/Feeding America commercial is a good reminder of why we should never take numbers or statistics at face value.  Several things are fishy here.  Does “enough food” mean the standard daily calorie intake (which, incidentally, is another statistic)?  Or, given that two-thirds of Americans are either overweight or obese (another statistic I have trouble believing), is “enough food” defined as the average number of calories a person actually eats each day?

I also want to know how the people who conducted the study came up with 49 million people.  Surely they could not have surveyed so many people.  Most likely, they needed to survey a sample of people, and then make statistical estimations – extrapolations – based on the size of the population.  In order to do that, the sample needed to be selected randomly: that is, every American had to have an equal chance of being selected for the survey.  That’s the only way we could be sure the results are representative of the entire population.

Next, who and how many completed the survey?  The issue of hunger is political in nature, and hence is likely to be very polarizing.  Generally, people who respond to surveys based on such political issues have a vested interest in the subject matter.  This introduces sample bias.  Also, having an adequate sample size (neither too small nor too large) is important.  There’s no way to know if the study that came up with the “49 million” statistic accounted for these issues.

We also don’t know how long a time these 49 million had to struggle in order to be counted?  Was it just any one time during a certain year, or did it have to go for at least two consecutive weeks before it could be contacted?  We’re not told.

As you can see, the commercial’s claim of 49 million “struggling to put food on the table” just doesn’t jive with me.  Whenever you must rely on statistics, you must remember to:

  1. Consider the source of the statistic and its purpose in conducting the research;
  2. Ask how the sample was selected and the study executed, and how many responded;
  3. Understand the researcher’s definition of the variables being measured;
  4. Not look at just the survey’s margin of error, but also at the confidence level and the diversity within the population being sampled. 

The Feeding America/Biggest Loser team-up is great, but that radio claim is a sobering example of how statistics can mislead as well as inform.

Using Marketing Research to Lead You Out of the Recession

June 8, 2009

Some economic indicators are starting to turn positive and suggest that the worst of the recession may be over. Even so, companies continue to cut their marketing budgets and this is perhaps the very worst time to do so. Cutting marketing expenditures at this time in the economic cycle is akin to stopping contributions to one’s investment portfolio during a bear market – in each case, one stands to miss out on the rebound.

Right now, marketing research is more critical than ever. Yes, business is still slow. But marketing research can be used ever more strategically right now. Your margins might still be tight, and you may still have some cuts to make. Many travel-related industries, especially hotels, are making use of marketing research to identify which amenities they can either eliminate or charge extra for, without negatively impacting customer satisfaction and/or loyalty. You should consider doing the same.

Marketing research can also be helpful in gauging the optimism of your customers and prospects, so that you can plan ahead for the future. Conducting marketing research right now can also inform you of what your target customers are substituting for your product or service to cope with these hard times. This information can help you accommodate them and/or find other ways to fulfill their needs.

You can also do marketing research relatively inexpensively with survey tools such as SuveyMonkey, Zoomerang, Survey Gizmo, etc. As long as you understand survey theory and sampling, you should be able to use these tools without compromising research integrity. You may even be able to reduce the size of your typical samples without sacrificing much accuracy. And you may be able to rely more heavily on secondary research. You can even track your competition with online tools like Compete.com.

Whatever the case, don’t abandon marketing research, especially now. Some carefully thought out, informal research is better than no research at all. Marketing research is the compass that will help you navigate out of these hard economic times.