## Posts Tagged ‘sample size’

### Forecast Friday Topic: Multicollinearity – Correcting and Accepting it

July 22, 2010

(Fourteenth in a series)

In last week’s Forecast Friday post, we discussed how to detect multicollinearity in a regression model and how dropping a suspect variable or variables from the model can be one approach to reducing or eliminating multicollinearity. However, removing variables can cause other problems – particularly specification bias – if the suspect variable is indeed an important predictor. Today we will discuss two additional approaches to correcting multicollinearity – obtaining more data and transforming variables – and will discuss when it’s best to just accept the multicollinearity.

Obtaining More Data

Multicollinearity is really an issue with the sample, not the population. Sometimes, sampling produces a data set that might be too homogeneous. One way to remedy this would be to add more observations to the data set. Enlarging the sample will introduce more variation in the data series, which reduces the effect of sampling error and helps increase precision when estimating various properties of the data. Increased sample sizes can reduce either the presence or the impact of multicollinearity, or both. Obtaining more data is often the best way to remedy multicollinearity.

Obtaining more data does have problems, however. Sometimes, additional data just isn’t available. This is especially the case with time series data, which can be limited or otherwise finite. If you need to obtain that additional information through great effort, it can be costly and time consuming. Also, the additional data you add to your sample could be quite similar to your original data set, so there would be no benefit to enlarging your data set. The new data could even make problems worse!

Transforming Variables

Another way statisticians and modelers go about eliminating multicollinearity is through data transformation. This can be done in a number of ways.

Combine Some Variables

The most obvious way would be to find a way to combine some of the variables. After all, multicollinearity suggests that two or more independent variables are strongly correlated. Perhaps you can multiply two variables together and use the product of those two variables in place of them.

So, in our example of the donor history, we had the two variables “Average Contribution in Last 12 Months” and “Times Donated in Last 12 Months.” We can multiply them to create a composite variable, “Total Contributions in Last 12 Months,” and then use that new variable, along with the variable “Months Since Last Donation” to perform the regression. In fact, if we did that with our model, we end up with a model (not shown here) that has an R2=0.895, and this time the coefficient for “Months Since Last Donation” is significant, as is our “Total Contribution” variable. Our F statistic is a little over 72. Essentially, the R2 and F statistics are only slightly lower than in our original model, suggesting that the transformation was useful. However, looking at the correlation matrix, we still see a strong negative correlation between our two independent variables, suggesting that we still haven’t eliminated multicollinearity.

Centered Interaction Terms

Sometimes we can reduce multicollinearity by creating an interaction term between variables in question. In a model trying to predict performance on a test based on hours spent studying and hours of sleep, you might find that hours spent studying appears to be related with hours of sleep. So, you create a third independent variable, Sleep_Study_Interaction. You do this by computing the average value for both the hours of sleep and hours of studying variables. For each observation, you subtract each independent variable’s mean from its respective value for that observation. Once you’ve done that for each observation, multiply their differences together. This is your interaction term, Sleep_Study_Interaction. Run the regression now with the original two variables and the interaction term. When you subtract the means from the variables in question, you are in effect centering interaction term, which means you’re taking into account central tendency in your data.

Differencing Data

If you’re working with time series data, one way to reduce multicollinearity is to run your regression using differences. To do this, you take every variable – dependent and independent – and, beginning with the second observation – subtract the immediate prior observation’s values for those variables from the current observation. Now, instead of working with original data, you are working with the change in data from one period to the next. Differencing eliminates multicollinearity by removing the trend component of the time series. If all independent variables had followed more or less the same trend, they could end up highly correlated. Sometimes, however, trends can build on themselves for several periods, so multiple differencing may be required. In this case, subtracting the period before was taking a “first difference.” If we subtracted two periods before, it’s a “second difference,” and so on. Note also that with differencing, we lose the first observations in the data, depending on how many periods we have to difference, so if you have a small data set, differencing can reduce your degrees of freedom and increase your risk of making a Type I Error: concluding that an independent variable is not statistically significant when, in truth it is.

Other Transformations

Sometimes, it makes sense to take a look at a scatter plot of each independent variable’s values with that of the dependent variable to see if the relationship is fairly linear. If it is not, that’s a cue to transform an independent variable. If an independent variable appears to have a logarithmic relationship, you might substitute its natural log. Also, depending on the relationship, you can use other transformations: square root, square, negative reciprocal, etc.

Another consideration: if you’re predicting the impact of violent crime on a city’s median family income, instead of using the number of violent crimes committed in the city, you might instead divide it by the city’s population and come up with a per-capita figure. That will give more useful insights into the incidence of crime in the city.

Transforming data in these ways helps reduce multicollinearity by representing independent variables differently, so that they are less correlated with other independent variables.

Limits of Data Transformation

Transforming data has its own pitfalls. First, transforming data also transforms the model. A model that uses a per-capita crime figure for an independent variable has a very different interpretation than one using an aggregate crime figure. Also, interpretations of models and their results get more complicated as data is transformed. Ideally, models are supposed to be parsimonious – that is, they explain a great deal about the relationship as simply as possible. Typically, parsimony means as few independent variables as possible, but it also means as few transformations as possible. You also need to do more work. If you try to plug in new data to your resulting model for forecasting, you must remember to take the values for your data and transform them accordingly.

Living With Multicollinearity

Multicollinearity is par for the course when a model consists of two or more independent variables, so often the question isn’t whether multicollinearity exists, but rather how severe it is. Multicollinearity doesn’t bias your parameter estimates, but it inflates their variance, making them inefficient or untrustworthy. As you have seen from the remedies offered in this post, the cures can be worse than the disease. Correcting multicollinearity can also be an iterative process; the benefit of reducing multicollinearity may not justify the time and resources required to do so. Sometimes, any effort to reduce multicollinearity is futile. Generally, for the purposes of forecasting, it might be perfectly OK to disregard the multicollinearity. If, however, you’re using regression analysis to explain relationships, then you must try to reduce the multicollinearity.

A good approach is to run a couple of different models, some using variations of the remedies we’ve discussed here, and comparing their degree of multicollinearity with that of the original model. It is also important to compare the forecast accuracy of each. After all, if all you’re trying to do is forecast, then a model with slightly less multicollinearity but a higher degree of forecast error is probably not preferable to a more precise forecasting model with higher degrees of multicollinearity.

The Takeaways:

1. Where you have multiple regression, you almost always have multicollinearity, especially in time series data.
2. A correlation matrix is a good way to detect multicollinearity. Multicollinearity can be very serious if the correlation matrix shows that some of the independent variables are more highly correlated with each other than they are with the dependent variable.
3. You should suspect multicollinearity if:
1. You have a high R2 but low t-statistics;
2. The sign for a coefficient is opposite of what is normally expected (a relationship that should be positive is negative, and vice-versa).
4. Multicollinearity doesn’t bias parameter estimates, but makes them untrustworthy by enlarging their variance.
5. There are several ways of remedying multicollinearity, with obtaining more data often being the best approach. Each remedy for multicollinearity contributes a new set of problems and limitations, so you must weigh the benefit of reduced multicollinearity on time and resources needed to do so, and the resulting impact on your forecast accuracy.

Next Forecast Friday Topic: Autocorrelation

These past two weeks, we discussed the problem of multicollinearity. Next week, we will discuss the problem of autocorrelation – the phenomenon that occurs when we violate the assumption that the error terms are not correlated with each other. We will discuss how to detect autocorrelation, discuss in greater depth the Durbin-Watson statistic’s use as a measure of the presence of autocorrelation, and how to correct for autocorrelation.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

### Don’t Scrimp on Marketing Research

October 12, 2009

There’s no question marketing research can be expensive.  Even if your company is doing it in-house, marketing research still requires time and staff.  Even if the research is entirely secondary, and can be found in libraries or on the Internet, someone has to collect the information.  There is considerable opportunity cost to conducting marketing research in-house, as the time an employee spends compiling, summarizing, analyzing, and presenting information cannot be dedicated to other projects.  Yet, budget constraints often dictate tradeoffs that must be made for marketing research projects, and often these tradeoffs are made carelessly.

Two obvious ways companies scrimp on their research budgets are selecting a vendor solely on the basis of cost and deciding to perform the research in-house.  Companies also make tradeoffs by using smaller sample sizes for surveys, choosing nonprobability over probability sampling methods, or opting for secondary over primary research, among others.

None of these tradeoffs is inherently bad.  Indeed, when budgets are scarce, you may need to make several in order to balance the scope of your project against your budget constraints.  A decision based on a little good marketing research is still more solid compared to totally unaided judgment.  But the key word here is good.  When cost becomes the overriding constraint for marketing research, companies run the risk of throwing the baby out with the bathwater.

### Pitfalls of in-house research using inexpensive survey tools

Generally, when conducting a survey, you want to choose a sample that adequately represents the population in which you’re interested.  If your company is marketing a product or service to low-income Hispanics, conducting an inexpensive online survey is either going to result in several respondents who don’t fit that demographic, or so few responses, as many low-income Hispanics are unlikely to have Internet access.  Yet many companies make use of inexpensive online survey tools like SurveyMonkey for the very reason that it is inexpensive, and the results they obtain are useless because either the wrong people or too few of the right people respond.

### Pitfalls of using nonprobability samples over probability samples

Companies have also tried to cut marketing research costs by substituting nonprobability for probability samples.  Respondents in a nonprobability sample are chosen solely on the basis of judgment, unlike probability samples, which are chosen at random with each member of the population having an equal chance of selection.  When someone at a shopping mall or on the street asks you to take a survey, you’re being selected for a nonprobability sample.

There is nothing wrong with using nonprobability samples; surveys using such samples can be executed rather quickly if time is an issue.  Furthermore, researching populations that are quite small and scattered (e.g., recruiting persons with a rare disease for a clinical trial) or where few published directories – the sample frame – about its members exists (e.g., medical coding professionals), can make probability sampling cost-prohibitive and unfeasible.  However, generalizing results from a survey administered to a nonprobability sample to the true population can be difficult and highly error prone.  As long as you understand this drawback and make allowances for it, you will be OK.

### Pitfalls of using smaller sample sizes

Sample size is another popular way companies attempt to cut marketing research costs.  Assume that a local pizzeria wants to explore the feasibility of offering delivery.  The pizzeria needs to estimate the amount a household within its ZIP code spends on a typical pizza delivery order, and randomly selects 100 households from within the ZIP code for a survey.  Further assume the pizzeria wants a five percent margin of error.  The survey is executed and the pizzeria finds that the average household surveyed spends \$15 on a pizza delivery.  Figure in the plus or minus five percent, and the estimated average is between \$14.25 and \$15.75.

The problem emerges when the pizzeria tries to generalize this average delivery order to all households within that ZIP code.  A sample size of 100 with a five percent margin of error is barely a 65% confidence interval.  That means the pizzeria can be only 65% confident that the true average pizza delivery order in that ZIP code is between \$14.25 and \$15.75.  If the pizzeria concludes that that order size is too small to justify offering delivery, but the true average turns out to be well above that range, the pizzeria has left money on the table.  On the other hand, if the pizzeria concludes that the order size is large enough to justify offering delivery, but the true average falls below that range, it risks adding an unprofitable service.

The optimal sample size is the one that gives you the highest level of confidence and the most tolerable error margin required for you to make an objective marketing decision.

### How to keep marketing research costs low while keeping quality high

When it comes to getting the most bang for your research buck, use the Pareto Principle as a starting point.  The Pareto Principle, also known as the 80/20 rule, states that about 80% of your results will come from roughly 20% of your efforts.  So, 80% of the information you need to make your decision will come from just 20% of the research questions you ask.  Find out what those questions are and be sure they are asked.  This process alone will help optimize the scope and cost of your research project.

Also, decide how much confidence you need and how much error you’ll tolerate.  Some decisions require more accuracy and/or confidence than others.  But your sample size changes in proportion to these accuracy and confidence needs.  The best way to start is by asking, “How much precision would I gain using a 95% vs. 90% confidence interval, and would that precision justify the extra cost to get it?”  A 95% confidence interval with a five percent error margin requires a sample of 384 people, while a 90% confidence interval with the same margin of error requires a sample of just 271.  Surveying those additional 113 people can add a couple thousand dollars to the cost of your project.  Is the gain in precision worth that additional cost?

Also, does your project really need a probability sample?  Not every research project does.  If the purpose of your research is exploratory, you generally need neither a large sample nor a randomly generated sample.  If you sell healthy meal solutions and want to understand issues busy moms face when trying preparing to prepare healthy meals for their children, you might simply run an ad in a local paper to recruit maybe 10 or 15 of those moms to either participate in a focus group or an in-depth interview.  This qualitative information can be sufficient on its own, or can lay the groundwork for a future larger, quantitative (not to mention probability sample) study.

Thoroughly understanding your business problem will give you the best idea of the scope your research requires, the precision you need, and the money you should budget.  If, after determining the necessary scope and precision, you find that the project is going to be prohibitively expensive, you can do one of two things.  One alternative is to look at all the information you are seeking to collect.  Then prioritize them in terms of their benefit to your marketing objectives.  Those parts of the research project that will add the most value should be undertaken; the remainder can be delayed until funds or time are available.  The other alternative is to do a cost-benefit analysis of the entire study.  If you had only \$20,000 budgeted for the study, but you find it will cost you \$35,000, weigh that cost against the value of the insights.  If, after the study, you are able to make decisions that increase sales – or reduce marketing costs – by more than \$35,000, then it makes sense to make the case for more funding.

The expression, “you get what you pay for” rings especially true for marketing research.

### How Much Damage do Bad Respondents do to Survey Results?

May 11, 2009

Minimizing both the number of bad respondents who take a survey and their impact on the survey results can seem as futile as Sisyphus pushing the rock up the mountain.  Bad respondents come in all flavors: professional respondents, speeders, retakers (people who take the same survey multiple times),  and outright frauds (people who aren’t who or what they claim to be).

Researchers have tried different approaches to these problems, including increasing sample size, eliminating one or two biggest types of bad respondents, or even ignoring the problem altogether.   Unfortunately, the first two approaches can actually cause more damage than doing nothing at all.  Let’s look at these three approaches more closely.

Approach 1: Increase the sample size

When concerned about accuracy, the common prescription among researchers is to increase the size of their sample.  Indeed, this approach reduces sampling error,  margin of error, and the impact of multicollinearity, and increases the confidence level in the results.   However, larger sample sizes are a double-edged sword.  Because a larger sample size reduces the standard error in the data, it also increases the t-value.  As a result, a small difference between two or more respondent groups can greatly increase the chance of committing a Type I error (rejecting a true null hypothesis).

Similarly,  if a sample has bad respondents, a larger sample size can actually exacerbate their impact on survey results.  After all, bad respondents are likely to respond to survey questions differently than legitimate respondents.  A larger sample size (even if every additional new respondent is good), will simply reduce the degree of these differences needed for statistical significance, and inflate the chance of drawing an erroneous conclusion from the survey findings.

Approach 2: Tackle the biggest offender

When faced with multiple problems, it is human nature to focus on eradicating the one or two worst problems.  While that might work in most situations, eliminating only one type of bad respondents can actually cause more problems.

Assume that a survey’s results include responses from both professional respondents and speeders.  Assume also that the survey has some ratings questions.  What if – compared to legitimate respondents – the former rates an item higher than average, and the latter lower than average?

By having both types of bad respondents in the survey, their overall impact on the mean may be negligible.  However, if you take out only one of them, the mean will become biased in favor of the type that was left alone, again exacerbating the impact of bad respondents.

Approach 3: Do Nothing

While doing nothing is preferable to the other two approaches, it has its own problems.  Return to the example of two types of bad responders.  While leaving both of them alone will keep the mean close to what it would be in the absence of both types, it will also inflate the variance of the data, resulting in an estimate of the mean that is untrustworthy.  Hence, removing one type of bad respondents causes biased results while doing nothing causes inefficient results, neither of which has pleasant outcomes.

1. Ask how your sample vendor screens people wishing to join its panel;
2. Find out how your vendor ensures that panelists who are on other panels are precluded from being sent the same survey;
3. Determine how your vendor tracks the survey-taking behavior of its panelists, assesses the legitimacy of each, and purges itself of suspected bad respondents; and
4. Determine how your vendor prevents a person with multiple e-mail addresses – if you’re doing online surveys – from trying to register each one as a separate panelist.

### How large should a survey sample be? Not as large as you think!

March 3, 2009

When conducting a survey, one of the key challenges a company faces is determining the appropriate sample size.  How many people should they survey?  Quite often, clients resort to subjective judgments, based on budget, past business processes, or corporate politics.

Some clients find safety in large sample sizes, and request 1,000 completed surveys.  But is a sample of 1,000 really necessary?  It depends on the level and sophistication of analysis the client needs to do.  If, for example, the client wants to compare market sizes for its product in five or six geographic regions within the country, a sample of 1,000 might be ideal for that level of analysis.  But what if the client is comparing at most two or three groups?  A sample size of 1,000 would be both overkill and a waste of money.

A scientific way to determine the ideal sample is to use a confidence interval approach.  This approach requires the client to know just three things: the desired level of confidence (a 95% confidence level means that if a sample was randomly drawn 100 times, we can be confident that 95 of them will contain the true population parameters); the variability (the degree to which respondents’ likelihood to answer your survey are similar or dissimilar to one another); and the desired level of error.

In most business cases, confidence intervals of 95% and 99% are common.  Insofar as variability is concerned, if you have no idea of the variability, you should assume maximum variability (50-50 chances).  And a 5% margin of error is pretty standard.

So, for a 95% confidence level, with maximum variability, and a 5% margin of error, a client would need only a sample size of 384.  At a 99% confidence level, the required sample size would only be 663.  Both are well below 1,000, and provide high levels of accuracy.  And a lot cheaper!

But notice one thing: when the client increased its confidence by just 4% points, it needed to survey 279 more people!  The sample size had to be increased dramatically for just small increases in accuracy!  Simply put, the accuracy gained diminishes for each one-unit increase in sample.  In this case, the client needs to decide whether the benefit of the additional confidence justifies the cost of surveying an additional 279 people.  If millions of dollars are at stake, then the additional cost is justified.  If relatively few dollars or resources are at stake, probably not.

Your ideal sample size is that which provides you with the level of accuracy you need for the value you expect to receive.  If each addition to your accuracy increases the value of the research benefit, by all means, increase the sample size until the benefit of that accuracy is maximized.  But in many cases, even that point will be reached somewhere below 1,000.