## Archive for July, 2010

### Forecast Friday Topic: Detecting Autocorrelation

July 29, 2010

(Fifteenth in a series)

We have spent the last few Forecast Friday posts discussing violations of different assumptions in regression analysis. So far, we have discussed the effects of specification bias and multicollinearity on parameter estimates, and their corresponding effect on your forecasts. Today, we will discuss another violation, autocorrelation, which occurs when sequential residual (error) terms are correlated with one another.

When working with time series data, autocorrelation is the most common problem forecasters face. When the assumption of uncorrelated residuals is violated, we end up with models that have inefficient parameter estimates and upwardly-biased t-ratios and R2 values. These inflated values make our forecasting model appear better than it really is, and can cause our model to miss turning points. Hence, if you’re model is predicting an increase in sales and you, in actuality, see sales plunge, it may be due to autocorrelation.

What Does Autocorrelation Look Like?

Autocorrelation can take on two types: positive or negative. In positive autocorrelation, consecutive errors usually have the same sign: positive residuals are almost always followed by positive residuals, while negative residuals are almost always followed by negative residuals. In negative autocorrelation, consecutive errors typically have opposite signs: positive residuals are almost always followed by negative residuals and vice versa.

In addition, there are different orders of autocorrelation. The simplest, most common kind of autocorrelation, first-order autocorrelation, occurs when the consecutive errors are correlated. Second-order autocorrelation occurs when error terms two periods apart are correlated, and so forth. Here, we will concentrate solely on first-order autocorrelation.

You will see a visual depiction of positive autocorrelation later in this post.

What Causes Autocorrelation?

The two main culprits for autocorrelation are sluggishness in the business cycle (also known as inertia) and omitted variables from the model. At various turning points in a time series, inertia is very common. At the time when a time series turns upward (downward), its observations build (lose) momentum, and continue going up (down) until the series reaches its peak (trough). As a result, successive observations and the error terms associated with them depend on each other.

Another example of inertia happens when forecasting a time series where the same observations can be in multiple successive periods. For example, I once developed a model to forecast enrollment for a community college, and found autocorrelation to be present in my initial model. This happened because many of the students enrolled during the spring term were also enrolled in the previous fall term. As a result, I needed to correct for that.

The other main cause of autocorrelation is omitted variables from the model. When an important independent variable is omitted from a model, its effect on the dependent variable becomes part of the error term. Hence, if the omitted variable has a positive correlation with the dependent variable, it is likely to cause error terms that are positively correlated.

How Do We Detect Autocorrelation?

To illustrate how we go about detecting autocorrelation, let’s first start with a data set. I have pulled the average hourly wages of textile and apparel workers for the 18 months from January 1986 through June 1987. The original source was the Survey of Current Business, September issues from 1986 and 1987, but this data set was reprinted in Data Analysis Using Microsoft ® Excel, by Michael R. Middleton, page 219:

 Month t Wage Jan-86 1 5.82 Feb-86 2 5.79 Mar-86 3 5.8 Apr-86 4 5.81 May-86 5 5.78 Jun-86 6 5.79 Jul-86 7 5.79 Aug-86 8 5.83 Sep-86 9 5.91 Oct-86 10 5.87 Nov-86 11 5.87 Dec-86 12 5.9 Jan-87 13 5.94 Feb-87 14 5.93 Mar-87 15 5.93 Apr-87 16 5.94 May-87 17 5.89 Jun-87 18 5.91

Now, let’s run a simple regression model, using time period t as the independent variable and Wage as the dependent variable. Using the data set above, we derive the following model:

Ŷ = 5.7709 + 0.0095t

Examine the Model Output

Notice also the following model diagnostic statistics:

 R2= 0.728 Variable Coefficient t-ratio Intercept 5.7709 367.62 t 0.0095 6.55

You can see that the R2 is a high number, with changes in t explaining nearly three-quarters the variation in average hourly wage. Note also the t-ratios for both the intercept and the parameter estimate for t. Both are very high. Recall that a high R2 and high t-ratios are symptoms of autocorrelation.

Visually Inspect Residuals

Just because a model has a high R2 and parameters with high t-ratios doesn’t mean autocorrelation is present. More work must be done to detect autocorrelation. Another way to check for autocorrelation is to visually inspect the residuals. The best way to do this is through plotting the average hourly wage predicted by the model against the actual average hourly wage, as Middleton has done:

Notice the green line representing the Predicted Wage. It is a straight, upward line. This is to be expected, since the independent variable is sequential and shows an increasing trend. The red line depicts the actual wage in the time series. Notice that the model’s forecast is higher than actual for months 5 through 8, and for months 17 and 18. The model also underpredicts for months 12 through 16. This clearly illustrates the presence of positive, first-order autocorrelation.

The Durbin-Watson Statistic

Examining the model components and visually inspecting the residuals are intuitive, but not definitive ways to diagnose autocorrelation. To really be sure if autocorrelation exists, we must compute the Durbin-Watson statistic, often denoted as d.

In our June 24 Forecast Friday post, we demonstrated how to calculate the Durbin-Watson statistic. The actual formula is:

That is, beginning with the error term for the second observation, we subtract the immediate previous error term from it; then we square the difference. We do this for each observation from the second one onward. Then we sum all of those squared differences together. Next, we square the error terms for each observation, and sum those together. Then we divide the sum of squared differences by the sum of squared error terms, to get our Durbin-Watson statistic.

For our example, we have the following:

 t Error Squared Error et-et-1 Squared Difference 1 0.0396 0.0016 2 0.0001 0.0000 (0.0395) 0.0016 3 0.0006 0.0000 0.0005 0.0000 4 0.0011 0.0000 0.0005 0.0000 5 (0.0384) 0.0015 (0.0395) 0.0016 6 (0.0379) 0.0014 0.0005 0.0000 7 (0.0474) 0.0022 (0.0095) 0.0001 8 (0.0169) 0.0003 0.0305 0.0009 9 0.0536 0.0029 0.0705 0.0050 10 0.0041 0.0000 (0.0495) 0.0024 11 (0.0054) 0.0000 (0.0095) 0.0001 12 0.0152 0.0002 0.0205 0.0004 13 0.0457 0.0021 0.0305 0.0009 14 0.0262 0.0007 (0.0195) 0.0004 15 0.0167 0.0003 (0.0095) 0.0001 16 0.0172 0.0003 0.0005 0.0000 17 (0.0423) 0.0018 (0.0595) 0.0035 18 (0.0318) 0.0010 0.0105 0.0001 Sum: 0.0163 0.0171

To obtain our Durbin-Watson statistic, we plug our sums into the formula:

= 1.050

What Does the Durbin-Watson Statistic Tell Us?

Our Durbin-Watson statistic is 1.050. What does that mean? The Durbin-Watson statistic is interpreted as follows:

• If d is close to zero (0), then positive autocorrelation is probably present;
• If d is close to two (2), then the model is likely free of autocorrelation; and
• If d is close to four (4), then negative autocorrelation is probably present.

As we saw from our visual examination of the residuals, we appear to have positive autocorrelation, and the fact that our Durbin-Watson statistic is about halfway between zero and two suggests the presence of positive autocorrelation.

Next Forecast Friday Topic: Correcting Autocorrelation

Today we went through the process of understanding the causes and effect of autocorrelation, and how to suspect and detect its presence. Next week, we will discuss how to correct for autocorrelation and eliminate it so that we can have more efficient parameter estimates.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

### Help! Customer Satisfaction is High But Sales are Down!

July 28, 2010

Customer satisfaction measurement has been of great interest to service organizations for some years now. Nearly every industry that is both highly competitive and heavily customer-facing – like restaurants, hotels, and banks – know that a poor customer experience can result in lost future sales to the competition. As a result, these service-oriented businesses make every effort to keep their ear open to the voice of the customer. Indeed, customer satisfaction surveys proliferate – I once received five in a single week – as company after company strives to hear that customer voice.

And the effort may be futile. This isn’t to say that measuring customer satisfaction isn’t important – most certainly it is. But many companies may be overdoing it. In fact, some companies are seeing negative correlations between customer satisfaction and repeat business! Is this happening to you?

Reasons Why Satisfaction Scores and Sales Don’t Sync

If your customers are praising you in satisfaction surveys but you’re seeing no improvement in sales and repeat business, it could be for one or more of the following reasons:

You’re not Asking the Question Right

Often, a disparity between survey results and actual business results can be attributed to the two measuring different things. If you simply ask, “Overall, how satisfied were you with your stay at XYZ Hotel,” it only tells you about their current experience. If 80 percent of your respondents indicate “Satisfied” or “Very Satisfied,” you only get information about their attitudes. Then you compare satisfaction scores to either total sales or repeat sales from quarter to quarter. And you find either no correlation or a negative correlation. Why? Because the survey question measured only their perceived satisfaction, while the business results measured sales.

On the other hand, if you were to ask the question: “How likely are you to return to XYZ Hotel,” or “How likely are you to recommend XYZ Hotel to a friend or relative,” you might get a better match between responses and business outcomes.

Only Your Happiest Customers Are Responding

Another reason satisfaction scores may be high while sales are declining is because only your most loyal customers are taking the time to complete your survey. Your most loyal customers might have been trained to complete these surveys because they have been spoiled with special incentives because of their frequent patronage, and hence get better treatment than most customers.

Another, more dangerous, reason your happiest customers may be the only respondents is because the distribution of the survey is “managed,” being sent only to the people most likely to give high scores. There is a great risk of this bias in organizations where top executives’ compensation is tied to customer satisfaction scores.

Respondents Aren’t Telling the Truth

As much as we hate to admit, we’re not as honest as we claim to be. This is especially true in surveys. Entire books could be written on respondent honesty (or lack thereof). There are several reasons respondents don’t give truthful answers about their satisfaction. One obvious reason is courtesy; some just don’t like to give negative feedback. Still, even with the promise of confidentiality, respondents worry that if they give a poor rating, they’ll receive a phone call from the business’ representative, which they aren’t comfortable taking.

Survey incentives – if not carefully structured – can also lead to untruthful respondents. If you offer respondents a chance to win a drawing in exchange for completing your customer satisfaction survey, they may lie and say positive things about their experience, in the hopes that it would increase their odds of winning the prize.

You’re Hearing Your Customer but Not Really Listening

In many cases, your customers might say one thing, but really mean another. The customer could be quite satisfied on the whole, but there might be one or two smaller things, that if unchecked, can reduce the customer’s likelihood of repeat business. For example, if you sell clothing online, but not shoes, and your customer doesn’t find out until after loading everything else into the online shopping cart, assuming he/she doesn’t abandon the cart, the customer completes the order for the clothes he or she wants. When the customer gets the survey, he or she might indicate being very satisfied with the order he/she executed. But deep down, that same customer might not have liked that your online store doesn’t sell shoes. Whether or not the customer indicates the issue about the shoes, the next time he/she wants to buy clothes online, the customer may remember that you don’t sell shoes and choose to place the entire order with a competitor who does.

How Can We Remedy This Disparity?

There are a few ways we can remedy these situations. First, make sure the questions you ask reflect your business goals. If you want satisfied customers to return, be sure to ask how likely they are to return. Then measure the scores against actual repeat business. If you want satisfied customers to recommend your business to a friend, make sure you ask how likely they are to do so and then measure that against referrals. Compare apples to apples.

Second, reduce incentives for bias. Ideally, no executive’s compensation should be tied to survey ratings. Instead, tie compensation to actual results. If compensation must be tied to survey results, then by all means make sure the survey is administered by employees with no vested interest in the outcome of the survey. Also, make sure that your entire list of people to survey comes from similarly disinterested employees of the organization.

Third, encourage non-loyal customers to participate. You might create a separate survey for your most loyal customers. For the non-loyal customers, make sure you have ways to encourage them to respond. Whether it’s through an appropriate incentive (say a coupon for a future visit), or through friendly requests, let your non-loyal customers know you still care about their feedback.

Fourth, place reliability checks in your survey. Ask the same question in two ways (positive and negative) or phrase it slightly differently and compare the results. In the former example, you would expect the answers to be on opposite ends of the rating scale. In the latter, you would expect consistency of responses on the same end of the scale. This helps you determine whether respondents are being truthful.

Finally, be proactive. In the example of your online clothing store, you might have the foresight to realize that your decision not to sell shoes may impact satisfaction and future business. So you might be upfront about it, but at the same time, offer a link to a cooperating online retailer who does sell shoes, and allow the customer to order shoes from that retailer using the same shopping cart. That may keep the customer’s satisfaction high and increase his/her likelihood of future business.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

### Considerations for Selecting a Representative Sample

July 27, 2010

When trying to understand and make inferences about a population, it is neither possible nor cost effective to survey everyone who comprises that population. Therefore, analysts choose to survey a reasonably-sized sample of the population, whose results they can generalize to the entire population. Since such sampling is subject to error, it is vitally important that an analyst select a sample that is adequately representative of the population at large. Ensuring that a sample represents the population as accurately as possible requires that the sample be drawn using well-established, specific principles. In today’s post, we will be discussing the considerations for selecting a representative sample.

What is the Unit of Analysis?

What is the population you are interested in measuring? Let’s assume you are a market research analyst for a life insurance company and you are trying to understand the degree of existing life insurance coverage of households in the greater Chicago area. Already, this is a challenging prospect. What constitutes “life insurance coverage?” “A household”? or “The greater Chicago area?” As the analyst, you must define these before you can move forward. Does “coverage” mean having any life insurance policy, regardless of amount? Or does it mean having life insurance that covers the oft recommended eight to ten times the principal breadwinner’s salary? Does it mean having individual vs. group life insurance, or either one?

Does “household” mean a unit with at least one adult and the presence of children? Can a household consist of one person for your analysis?

Does the “greater Chicago area” mean every household within the Chicago metropolitan statistical area (MSA), as defined by the U.S. Census Bureau, or does it mean the city of Chicago and its suburban collar counties (e.g., Cook, DuPage, Lake, Will, McHenry, Kane, Kendall)?

All of these are considerations you must decide on.

You talk through these issues with some of the relevant stakeholders: your company’s actuarial department, the marketing department, and the product development department, and you learn some new information. You find out that your company wants to sell a highly-specialized life insurance product to young (under 40), high-salaried (at least \$200,000) male heads-of-household that provides up to ten times the income coverage. You find that “male head-of-household” is construed to mean any man who has children under 18 present in his household and has either no spouse or a spouse earning less than \$20,000 per year.

You also learn that this life insurance product is being pilot tested in the Chicago area, and that the insurance company’s captive agent force has offices only within the City and its seven collar counties, although agents may write policies for any qualifying person in Illinois. You can do one of two things here. Since all your company’s agents are in the City and collar counties, you might simply restrict your definition of “greater Chicago area” to this region. Or, you might select this area, and add to it nearby counties without agencies, where agents write a large number of policies. Whether you do the former or latter depends on the timeframe available to you. If you can easily and quickly obtain the information for determining the additional counties, you might select the latter definition. If not, you’ll likely go with the former. Let’s assume you choose only those in the City and its collar counties.

Another thing you find out through communicating with stakeholders is that the intent of this insurance product is to close gaps in, not replace, existing life insurance coverage. Hence, you now know your relevant population:

Men under the age of 40, living in the city of Chicago or its seven collar counties, with a salary income of at least \$200,000 per year, heading a household with at least one child under 18 present, with either no spouse or a spouse earning less than \$20,000 per year, and who have life insurance coverage that is less than ten times their annual salary income.

You can see that this is a very specific unit of analysis. For this type of insurance product, you do not want to survey the general population, as this product will be irrelevant for most. Hence, the above italicized definition is your working population. It is from this group that you want to draw your sample.

How Do You Reach This Working Population?

Now that you have identified your working population, you must find a master list of people from which to draw your sample. Such a list is known as the sample frame. As you’ve probably guessed, there is no one list that will contain your working list precisely. Hence, you will spend some time searching for as comprehensive a list, or some combination of lists that will contain as complete a list as possible of everyone in your working population. The degree to which your sample frame fails to account for all of your working population is known as its bias or sample frame error, and such error cannot be totally eradicated.

Sample frame error exists because some of these upscale households move out while others move in; some die; some have unlisted phone numbers or don’t give out their email addresses; some will lose their jobs, while others move into these high paying jobs; and some will hit age 40, or their wives will get higher paying jobs. And these changes are dynamic. There’s nothing you can do, except be aware of them.

To obtain your sample frame, you might start by asking yourself several questions about your working population: What ZIP codes are they likely to live in? What types of hobbies do they engage in? What magazines and newspapers do they subscribe to? Where do they take vacations? What clubs and civic organizations do they join? Do they use financial planners or CPA’s?

Armed with this information, you might purchase mailing lists of such men from magazine subscriptions; you might search phone listings in upscale Chicago area communities like Winnetka, Kenilworth, and Lake Forest. You might network with travel agents, real estate brokers, financial advisors, and charitable organization. You may also purchase membership lists from clubs. You will then combine these lists to come up with your sample frame. The degree to which you can do this depends on your time and budget constraints, as well as any regulatory and ethical practices (e.g., privacy, Do Not Call lists, etc.) governing collection of such lists.

Many market research firms have made identifying the sample frame much easier in recent years, thanks to survey panels. Panels are groups of respondents who have agreed in advance to participate in surveys. The existence of survey panels has greatly reduced the amount of time and cost involved in compiling one’s own sample frame. The drawback, however, is that respondents from a panel self-select to join the panel. And panel respondents can be very different from other members of the working population who are not on a panel.

Weeding Out the Irrelevant Population

Your sample frame will never include all those who fit your working population, nor will it exclude all those who do not fit your working population. As a result, you will need to eliminate extraneous members of your sample frame. Unfortunately, there’s no proactive way to do this. Typically, you must ask screening questions at the beginning of your survey to identify if a respondent qualifies to take the survey, and then terminate the survey if a respondent fails to meet the criteria.

Summary

Selecting a representative sample is an intricate process that requires serious thought and communication between stakeholders, about the objectives of the survey, the definition of the relevant working population, the approach to finding and reaching members of the working population, and the time, budget, and regulatory constraints involved. No sample will ever be completely representative of the population, but samples can and should be reasonably representative.

### Forecast Friday Topic: Multicollinearity – Correcting and Accepting it

July 22, 2010

(Fourteenth in a series)

In last week’s Forecast Friday post, we discussed how to detect multicollinearity in a regression model and how dropping a suspect variable or variables from the model can be one approach to reducing or eliminating multicollinearity. However, removing variables can cause other problems – particularly specification bias – if the suspect variable is indeed an important predictor. Today we will discuss two additional approaches to correcting multicollinearity – obtaining more data and transforming variables – and will discuss when it’s best to just accept the multicollinearity.

Obtaining More Data

Multicollinearity is really an issue with the sample, not the population. Sometimes, sampling produces a data set that might be too homogeneous. One way to remedy this would be to add more observations to the data set. Enlarging the sample will introduce more variation in the data series, which reduces the effect of sampling error and helps increase precision when estimating various properties of the data. Increased sample sizes can reduce either the presence or the impact of multicollinearity, or both. Obtaining more data is often the best way to remedy multicollinearity.

Obtaining more data does have problems, however. Sometimes, additional data just isn’t available. This is especially the case with time series data, which can be limited or otherwise finite. If you need to obtain that additional information through great effort, it can be costly and time consuming. Also, the additional data you add to your sample could be quite similar to your original data set, so there would be no benefit to enlarging your data set. The new data could even make problems worse!

Transforming Variables

Another way statisticians and modelers go about eliminating multicollinearity is through data transformation. This can be done in a number of ways.

Combine Some Variables

The most obvious way would be to find a way to combine some of the variables. After all, multicollinearity suggests that two or more independent variables are strongly correlated. Perhaps you can multiply two variables together and use the product of those two variables in place of them.

So, in our example of the donor history, we had the two variables “Average Contribution in Last 12 Months” and “Times Donated in Last 12 Months.” We can multiply them to create a composite variable, “Total Contributions in Last 12 Months,” and then use that new variable, along with the variable “Months Since Last Donation” to perform the regression. In fact, if we did that with our model, we end up with a model (not shown here) that has an R2=0.895, and this time the coefficient for “Months Since Last Donation” is significant, as is our “Total Contribution” variable. Our F statistic is a little over 72. Essentially, the R2 and F statistics are only slightly lower than in our original model, suggesting that the transformation was useful. However, looking at the correlation matrix, we still see a strong negative correlation between our two independent variables, suggesting that we still haven’t eliminated multicollinearity.

Centered Interaction Terms

Sometimes we can reduce multicollinearity by creating an interaction term between variables in question. In a model trying to predict performance on a test based on hours spent studying and hours of sleep, you might find that hours spent studying appears to be related with hours of sleep. So, you create a third independent variable, Sleep_Study_Interaction. You do this by computing the average value for both the hours of sleep and hours of studying variables. For each observation, you subtract each independent variable’s mean from its respective value for that observation. Once you’ve done that for each observation, multiply their differences together. This is your interaction term, Sleep_Study_Interaction. Run the regression now with the original two variables and the interaction term. When you subtract the means from the variables in question, you are in effect centering interaction term, which means you’re taking into account central tendency in your data.

Differencing Data

If you’re working with time series data, one way to reduce multicollinearity is to run your regression using differences. To do this, you take every variable – dependent and independent – and, beginning with the second observation – subtract the immediate prior observation’s values for those variables from the current observation. Now, instead of working with original data, you are working with the change in data from one period to the next. Differencing eliminates multicollinearity by removing the trend component of the time series. If all independent variables had followed more or less the same trend, they could end up highly correlated. Sometimes, however, trends can build on themselves for several periods, so multiple differencing may be required. In this case, subtracting the period before was taking a “first difference.” If we subtracted two periods before, it’s a “second difference,” and so on. Note also that with differencing, we lose the first observations in the data, depending on how many periods we have to difference, so if you have a small data set, differencing can reduce your degrees of freedom and increase your risk of making a Type I Error: concluding that an independent variable is not statistically significant when, in truth it is.

Other Transformations

Sometimes, it makes sense to take a look at a scatter plot of each independent variable’s values with that of the dependent variable to see if the relationship is fairly linear. If it is not, that’s a cue to transform an independent variable. If an independent variable appears to have a logarithmic relationship, you might substitute its natural log. Also, depending on the relationship, you can use other transformations: square root, square, negative reciprocal, etc.

Another consideration: if you’re predicting the impact of violent crime on a city’s median family income, instead of using the number of violent crimes committed in the city, you might instead divide it by the city’s population and come up with a per-capita figure. That will give more useful insights into the incidence of crime in the city.

Transforming data in these ways helps reduce multicollinearity by representing independent variables differently, so that they are less correlated with other independent variables.

Limits of Data Transformation

Transforming data has its own pitfalls. First, transforming data also transforms the model. A model that uses a per-capita crime figure for an independent variable has a very different interpretation than one using an aggregate crime figure. Also, interpretations of models and their results get more complicated as data is transformed. Ideally, models are supposed to be parsimonious – that is, they explain a great deal about the relationship as simply as possible. Typically, parsimony means as few independent variables as possible, but it also means as few transformations as possible. You also need to do more work. If you try to plug in new data to your resulting model for forecasting, you must remember to take the values for your data and transform them accordingly.

Living With Multicollinearity

Multicollinearity is par for the course when a model consists of two or more independent variables, so often the question isn’t whether multicollinearity exists, but rather how severe it is. Multicollinearity doesn’t bias your parameter estimates, but it inflates their variance, making them inefficient or untrustworthy. As you have seen from the remedies offered in this post, the cures can be worse than the disease. Correcting multicollinearity can also be an iterative process; the benefit of reducing multicollinearity may not justify the time and resources required to do so. Sometimes, any effort to reduce multicollinearity is futile. Generally, for the purposes of forecasting, it might be perfectly OK to disregard the multicollinearity. If, however, you’re using regression analysis to explain relationships, then you must try to reduce the multicollinearity.

A good approach is to run a couple of different models, some using variations of the remedies we’ve discussed here, and comparing their degree of multicollinearity with that of the original model. It is also important to compare the forecast accuracy of each. After all, if all you’re trying to do is forecast, then a model with slightly less multicollinearity but a higher degree of forecast error is probably not preferable to a more precise forecasting model with higher degrees of multicollinearity.

The Takeaways:

1. Where you have multiple regression, you almost always have multicollinearity, especially in time series data.
2. A correlation matrix is a good way to detect multicollinearity. Multicollinearity can be very serious if the correlation matrix shows that some of the independent variables are more highly correlated with each other than they are with the dependent variable.
3. You should suspect multicollinearity if:
1. You have a high R2 but low t-statistics;
2. The sign for a coefficient is opposite of what is normally expected (a relationship that should be positive is negative, and vice-versa).
4. Multicollinearity doesn’t bias parameter estimates, but makes them untrustworthy by enlarging their variance.
5. There are several ways of remedying multicollinearity, with obtaining more data often being the best approach. Each remedy for multicollinearity contributes a new set of problems and limitations, so you must weigh the benefit of reduced multicollinearity on time and resources needed to do so, and the resulting impact on your forecast accuracy.

Next Forecast Friday Topic: Autocorrelation

These past two weeks, we discussed the problem of multicollinearity. Next week, we will discuss the problem of autocorrelation – the phenomenon that occurs when we violate the assumption that the error terms are not correlated with each other. We will discuss how to detect autocorrelation, discuss in greater depth the Durbin-Watson statistic’s use as a measure of the presence of autocorrelation, and how to correct for autocorrelation.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

### Analyzing Subgroups of Data

July 21, 2010

The data available to us has never been more voluminous. Thanks to technology, data about us and our environment are collected almost continuously. When we use a cell phone to call someone else’s cell phone, several pieces of information are collected: the two phone numbers involved in the call; the time the call started and ended; the cell phone towers closest to the two parties; the cell phone carriers; the distance of the call; the date; and many more. Cell phone companies use this information to determine where to increase capacity; refine, price, and promote their plans more effectively; and identify regions with inadequate coverage.

Multiply these different pieces of data by the number of calls in a year, a month, a day – even an hour – and you can easily see that we are dealing with enormous amounts of records and observations. While it’s good for decision makers to see what sales, school enrollment, cell phone usage, or any other pattern looks like in total, quite often they are even more interested in breaking down data into groups to see if certain groups behave differently. Quite often we hear decision makers asking questions like these:

• How do depositors under age 35 compare with those between 35-54 and 55 & over in their choice of banking products?
• How will voter support for Candidate A differ by race or ethnicity?
• How does cell phone usage differ between men and women?
• Does the length or severity of a prison sentence differ by race?

When we break data down into subgroups, we are trying to see whether knowing about these groups adds any additional meaningful information. This helps us customize marketing messages, product packages, pricing structures, and sales channels for different segments of our customers. There are many different ways we can break data down: by region, age, race, gender, income, spending levels; the list is limitless.

To give you an example of how data can be analyzed by groups, let’s revisit Jenny Kaplan, owner of K-Jen, the New Orleans-style restaurant. If you recall from the May 25 post, Jenny tested two coupon offers for her \$10 jambalaya entrée: one offering 10% off and another offering \$1 off. Even though the savings was the same, Jenny thought customers would respond differently. As Jenny found, neither offer was better than the other at increasing the average size of the table check. Now, Jenny wants to see if there is a preference for one offer over the other, based on customer age.

Jenny knows that of her 1,000-patron database, about 50% are the ages of 18 to 35; the rest are older than 35. So Jenny decides to send out 1,000 coupons via email as follows:

 \$1 off 10% off Total Coupons 18-35 250 250 500 Over 35 250 250 500 Total Coupons 500 500 1,000

Half of Jenny’s customers received one coupon offer and half received the other. Looking carefully at the table above, half the people in each age group got one offer and the other half got the other offer. At the end of the promotion period, Jenny received back 200 coupons. She tracks the coupon codes back to her database and finds the following pattern:

 Coupons Redeemed (Actual) \$1 off 10% off Coupons Redeemed 18-35 35 65 100 Over 35 55 45 100 Coupons Redeemed 90 110 200

Exactly 200 coupons were redeemed, 100 from each age group. But notice something else: of the 200 people redeeming the coupon, 110 redeemed the coupon offering 10% off; just 90 redeemed the \$1 off coupon. Does this mean the 10% off coupon was the better offer? Not so fast!

What Else is the Table Telling Us?

Look at each age group. Of the 100 customers aged 18-35, 65 redeemed the 10% off coupon; but of the 100 customers age 35 and up, just 45 did. Is that a meaningful difference or just a fluke? Do persons over 35 prefer an offer of \$1 off to one of 10% off? There’s one way to tell: a chi-squared test for statistical significance.

The Chi-Squared Test

Generally, a chi-squared test is useful in determining associations between categories and observed results. The chi-squared – χ2 – statistic is value needed to determine statistical significance. In order to compute χ2, Jenny needs to know two things: the actual frequency distribution of the coupons redeemed (which is shown in the last table above), and the expected frequencies.

Expected frequencies are the types of frequencies you would expect the distribution of data to fall, based on probability. In this case, we have two equal sized groups: customers age 18-35 and customers over 35. Knowing nothing else besides the fact that the same number of people in these groups redeemed coupons, and that 110 of them redeemed the 10% off coupon, and 90 redeemed the \$1 off coupon, we would expect that 55 customers in each group would redeem the 10% off coupon and 45 in each group would redeem the \$1 off coupon. Hence, in our expected frequencies, we still expect 55% of the total customers to redeem the 10% off offer. Jenny’s expected frequencies are:

 Coupons Redeemed (Expected) \$1 off 10% off Coupons Redeemed 18-35 45 55 100 Over 35 45 55 100 Coupons Redeemed 90 110 200

As you can see, the totals for each row and column match those in the actual frequency table above. The mathematical way to compute the expected frequencies for each cell would be to multiply its corresponding column total by its corresponding row total and then divide it by the total number of observations. So, we would compute as follows:

 Frequency of: Formula: Result 18-35 redeeming \$1 off: =(100*90)/200 =45 18-35 redeeming 10% off: =(100*110)/200 =55 Over 35 redeeming \$1 off: =(100*90)/200 =45 Over 35 redeeming 10% off: =(100*110)/200 =55

Now that Jenny knows the expected frequencies, she must determine the critical χ2 statistic to determine significance, then she must compute the χ2 statistic for her data. If the latter χ2 is greater than the critical χ2 statistic, then Jenny knows that the customer’s age group is associated the coupon offer redeemed.

Determining the Critical χ2 Statistic

To find out what her critical χ2 statistic is, Jenny must first determine the degrees of freedom in her data. For cross-tabulation tables, the number of degrees of freedom is a straightforward calculation:

Degrees of freedom = (# of rows – 1) * (# of columns -1)

So, Jenny has two rows of data and two columns, so she has (2-1)*(2-1) = 1 degree of freedom. With this information, Jenny grabs her old college statistics book and looks at the χ2 distribution table in the appendix. For a 95% confidence interval with one degree of freedom, her critical χ2 statistic is 3.84. When Jenny calculates the χ2 statistic from her frequencies, she will compare it with the critical χ2 statistic. If Jenny’s χ2 statistic is greater than the critical, she will conclude that the difference is statistically significant and that age does relate to which coupon offer is redeemed.

Calculating the χ2 Value From Observed Frequencies

Now, Jenny needs to compare the actual number of coupons redeemed for each group to their expected number. Essentially, to compute her χ2 value, Jenny follows a particular formula. For each cell, she subtracts the expected frequency of that cell from the actual frequency, squares the difference, and then divides it by the expected frequency. She does this for each cell. Then she sums up her results to get her χ2 value:

 \$1 off 10% off 18-35 =(35-45)^2/45 = 2.22 =(65-55)^2/55=1.82 Over 35 =(55-45)^2/45 = 2.22 =(45-55)^2/55=1.82 χ2= 2.22+1.82+2.22+1.82 = 8.08

Jenny’s χ2 value is 8.08, much higher than the critical 3.84, indicating that there is indeed an association between age and coupon redemption.

Interpreting the Results

Jenny concludes that patrons over the age of 35 are more inclined than patrons age 18-35 to take advantage of a coupon stating \$1 off; patrons age 18-35 are more inclined to prefer the 10% off coupon. The way Jenny uses this information depends on the objectives of her business. If Jenny feels that K-Jen needs to attract more middle-aged and senior citizens, she should use the \$1 off coupon when targeting them. If Jenny feels K-Jen isn’t selling enough Jambalaya, then she might try to stimulate demand by couponing, sending the \$1 off coupon to patrons over the age of 35 and the 10% off coupon to those 18-35.

Jenny might even have a counterintuitive use for the information. If most of K-Jen’s regular patrons are over age 35, they may already be loyal customers. Jenny might still send them coupons, but give the 10% off coupon instead. Why? These customers are likely to buy the jambalaya anyway, so why not give them the coupon they are not as likely to redeem? After all, why give someone a discount if they’re going to buy anyway! Giving the 10% off coupon to these customers does two things: first, it shows them that K-Jen still cares about their business and keeps them aware of K-Jen as a dining option. Second, by using the lower redeeming coupon, Jenny can reduce her exposure to subsidizing loyal customers. In this instance, Jenny uses the coupons for advertising and promoting awareness, rather than moving orders of jambalaya.

There are several more ways to analyze data by subgroup, some of which will be discussed in future posts. It is important to remember that your research objectives dictate the information you collect, which dictate the appropriate analysis to conduct.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.