Archive for October, 2010

Forecast Friday Topic: The Linear Probability Model

October 28, 2010

(Twenty-seventh in a series)

Up to now, we have talked about how to build regression models with continuous dependent variables. Such models are intended to answer questions like, “How much can sales increase if we spend $5,000 more on advertising?”; or “How much impact does each year of formal education have on salaries in a given occupation?”; or “How much does each $1 per bushel change in the price of wheat affect the number of boxes of cereal that Kashi produces?” Such business questions are quantitative and estimate impact of independent variables on the dependent variable at a macro level. But what if you wanted to predict a scenario that has only two outcomes?

Consider these business questions: “Is a particular individual more likely to respond to a direct marketing solicitation or not?” “Is a family more likely to vote Democrat or Republican?” “Is a particular subscriber likely to renew his/her subscription or let it lapse?” Notice that these business questions pertain to specific individuals, and there is only one of two outcomes. Moreover, these are questions are qualitative and involve individual choice and preferences; hence they seek to understand customers at a micro level.

Only in the last 20-30 years has it become easier to develop qualitative choice models. The increased use of surveys as well as customer and transactional databases, and improvements in data collection processes has made it more feasible to develop models to predict phenomena with discrete outcomes. Essentially, a qualitative choice model works the same way as a regression model. You have your independent variables and your dependent variable. However, the dependent variable is a dummy variable – it has only two outcomes: 1 if it is a “yes” for a particular outcome, and 0 if it is a “no.” Generally, the number of observations with a dependent variable of 1 is much smaller than that whose dependent variable is a zero. Think about a catalog mailing. The catalog might be sent to a million people, but only one or two percent – 10,000-20,000 – will actually respond.

The Linear Probability Model

The Linear Probability Model (LPM) was one of the first ways analysts began to develop qualitative choice models. LPM consisted of running OLS regression, only with a dichotomous dependent variable. Generally, the scores that would result would be used to assess the probability of an outcome. A score close to 0 would mean an outcome has a low probability of occurring, while a score close to 1 would mean the outcome is almost certain to occur. Consider the following example.

Mark Moretti, Circulation Director for the Baywood Bugle, a small town local newspaper, wants to determine how likely a subscriber is to renew his/her subscription to the Bugle. Mark has been concerned that delivery issues are causing subscribers to let their subscriptions lapse, and he also suspects that the Bugle is having a hard time trying to retain relatively new subscribers. So Mark randomly selected a sample of 30 subscribers whose subscriptions recently came in for renewal. Nine of these did let their subscriptions lapse, while the other 21 renewed. Mark also pulled the number of complaints each of these subscribers logged in the last 12 months, as well as their tenure (in years) at the time their subscription came up for renewal.

For his dependent variable, Mark used whether the subscriber renewed: 1 for yes, 0 for no. The number of complaints and the tenure served as the independent variable. Mark’s sample looked like this:

Subscriber #

Complaints

Subscriber Tenure

Renewed

1

16

1

0

2

13

10

1

3

5

14

1

4

8

10

1

5

0

8

1

6

5

7

1

7

5

7

1

8

13

15

1

9

9

10

1

10

14

11

1

11

6

10

1

12

4

14

1

13

16

10

0

14

12

2

0

15

9

9

1

16

12

7

1

17

20

4

0

18

17

1

0

19

2

11

1

20

13

14

1

21

5

13

1

22

7

2

0

23

9

12

1

24

10

8

0

25

0

10

1

26

2

13

1

27

19

4

0

28

12

3

0

29

10

9

1

30

4

9

1

 

Despite the fact that there are only two outcomes for renewed, Mark decides to run OLS regression on these 30 subscribers. He gets the following results:

Which suggests that each one-year increase in tenure increases a subscriber’s likelihood of renewal by just under seven percent, while each one-unit increase in the number of complaints reduces the subscriber’s likelihood of renewal by just over three percent. We would expect these variables to exhibit the relationships they do, since the former is a measure of customer loyalty, the latter of customer dissatisfaction.

Mark also gets an R2=0.689 and an F-statistic of 29.93, suggesting a very good fit.

However, Mark’s model exhibits serious flaws. Among them:

The Error Terms are Non-Normal

The LPM shows that the fitted values of the equation represent the probability that Yi=1 for the given values Xi. The error terms, however, are not normally distributed. Because there are only two possible outcomes, the error terms are binomially distributed, because Y can only be 0 and 1:

If Yi=0, then 0=α + β1X1i + β2X2i + εi

such that :

εi = -α – β1X1i – β2X2i

 

If Yi=1, then 1=α + β1X1i + β2X2i + εi

such that :

εi = 1 -α – β1X1i – β2X2i

 

The absence of normally distributed error terms, combined with Mark’s small sample means that his parameter estimates cannot be trusted. If Mark’s sample were much larger, then the error would approach a normal distribution.

The Error Terms are Heteroscedastic!

The residuals do not have a constant variance. With a continuous dependent variable, if two or more observations have the same value for X, it’s likely that their Y values won’t be too far apart. However, when the dependent variable is discrete, we will find that observations with the same values for an X can either have a Y value of 0 or 1. Let’s look at how the residuals in Mark’s variables compare to each independent variable:

Visual inspection suggests heteroscedasticity, which makes the parameter estimates in Mark’s model inefficient.

Unacceptable Values for Ŷ!

The dependent variable can only have two outcomes: 0 or 1. Because it is intended to deliver a probability score for each observation, values for a probability can only be between 0 and 1. However, look at the following predicted probabilities the LPM calculated for nine of the thirty subscribers:

Subscriber #

Predicted Renewal

1

(0.034)

3

1.204

8

1.031

12

1.234

18

(0.064)

19

1.086

21

1.135

25

1.077

26

1.225

As the table shows, subscribers #1 and #18 have predicted probabilities of less than 0 and the other seven have predicted probabilities in excess of 1. In actuality, subscribers 1 and 18 did not renew while the other 7 did, so these results were not inaccurate. However, their probabilities are unrealistic. In this case, only 30% of the values fall outside of the 0 to 1 region, so the model can probably be constrained by capping variables that fall outside the region to just barely within the region.

R2 is Useless

Another problem with Mark’s model is that R2, despite its high value, cannot be relied upon. Only a few data points lie close to the fitted regression line, as shown by the charts of the independent variables below:

 

This example being an exception, most LPMs generate very low R2 values for the very reason depicted in these charts. Hence R2 is generally disregarded in models with qualitative dependent variables.

So Why Do We Use Linear Probability Models?

Before many statistical packages were used, LPM was one of the only ways analysts could model qualitative dependent variables. Moreover, from an approximation standpoint, LPMs were not terribly far away from the more appropriate qualitative choice modeling approaches like logistic regression. And despite both their misuse and their inferiority to the more appropriate approaches, LPMs are easy to explain and conceptualize.

Next Forecast Friday Topic: Logistic Regression

Logistic regression is the more appropriate tool to use in such situations like this one. Next week I will walk you through the concepts of logistic regression, and illustrate a simple, one-variable model. You will understand how the logistic – or logit – model is used to compute a more accurate estimate of the likelihood of an outcome occurring. You will also discover how the logistic regression model provides three values that are simply three different ways of expressing the same thing. That’s next week.

 

*************************

Help us Reach 200 Fans on Facebook by Tomorrow!

Thanks to all of you, Analysights now has 175 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

Read All About It: Why Newspapers Need Marketing Analytics

October 26, 2010

After nearly 20 years, I decided to let my subscription to the Wall Street Journal lapse. A few months ago, I did likewise with my longtime subscription to the Chicago Tribune. I didn’t want to end my subscriptions, but as a customer, I felt my voice wasn’t being heard.

Some marketing research and predictive modeling might have enabled the Journal and the Tribune to keep me from defecting. From these efforts, both publications could have spotted my increasing frustration and dissatisfaction and intervened before I chose to vote with my feet.

Long story short, I let both subscriptions lapse for the same reason: chronic unreliable delivery, which was allowed to fester for many years despite numerous calls by me to their customer service numbers about missing and late deliveries.

Marketing Research

Both newspapers could have used marketing research to alert them to the likelihood that I would not renew my subscriptions. They each had lots of primary research readily available to them, without needing to do any surveys: my frequent calls to their customer service department, with the same complaint.

Imagine the wealth of insights both papers could have reaped from this data: they could determine the most common breaches of customer service; by looking at the number of times customers complained about the same issue, they could determine where problems were left unresolved; by breaking down the most frequent complaints by geography, they could determine whether additional delivery persons needed to be hired, or if more training was necessary; and most of all, both newspapers could have also found their most frequent complainers, and reached out to them to see what could be improved.

Both newspapers could have also conducted regular customer satisfaction surveys of their subscribers, asking about overall satisfaction and likelihood of renewing, followed by questions about subscribers’ perceptions about delivery service, quality of reporting, etc. The surveys could have helped the Journal and the Tribune grab the low-hanging fruit by identifying the key elements of service delivery that have the strongest impact on subscriber satisfaction and likelihood of renewal, and then coming up with a strategy to secure satisfaction with those elements.

Predictive Modeling

Another way both newspapers might have been able to intervene and retain my business would have been to predict my likelihood of lapse. This so-called attrition or “churn” modeling is common in industries whose customers are continuity-focused: newspapers and magazines, credit cards, membership associations, health clubs, banks, wireless communications, and broadband cable to name a few.

Attrition modeling (which, incidentally, will be discussed in the next two upcoming Forecast Friday posts) involves developing statistical models comparing attributes and characteristics of current customers with those of former, or churned, customers. The dependent variable being measured is whether a customer churned, so it would be a 1 if “yes” and a 0 if “no.”

Essentially, in building the model, the newspapers would look at several independent, or predictor, variables: customer demographics (e.g., age, income, gender, etc.), frequency of complaints, geography, to name a few. The model would then identify the variables that are the strongest predictors of whether a subscriber will not renew. The model will generate a score between 0 and 1, indicating each subscriber’s probability of not renewing. For example, a probability score of .72 indicates that there is a 72% chance a subscriber will let his/her subscription lapse, and that the newspaper may want to intervene.

In my case, both newspapers might have run such an attrition model to see if number of complaints in the last 12 months was a strong predictor of whether a subscriber would lapse. If that were the case, I would have a high probability of churn, and they could then call me; or, if they found that subscribers who churned were clustered in a particular area, they might be able to look for systemic breakdowns in customer service in that area. Either way, both papers could have found a way to salvage the subscriber relationship.


Forecast Friday: Where We Go From Here

October 21, 2010

(Twenty-sixth in a series)

Today’s Forecast Friday post won’t be a formal instruction. Rather, I sat down and looked at most of the posts I wrote on forecasting since I began the Forecast Friday series six months ago and realized that I have covered many forecasting topics in depth and haven’t really given you any insight into when the series will end. So with that in mind, I decided to map out the topics I will be addressing in the Forecast Friday posts over the next few months. So here is the schedule for the next few months:

Post #

Date

Forecast Friday Topic

  

Qualitative Choice Models

27

10/28/2010

Linear Probability Model

28

11/04/2010

Logistic Regression
   Simultaneous Equations and Two-Stage Least Squares Regression

29

11/11/2010

The Identification Problem

30

11/18/2010

Structural and Reduced Forms
  

11/25/2010

Thanksgiving Day – No Post

31

12/02/2010

Two-Stage Least Squares Regression
   Leading Indicators and Expectations

32

12/09/2010

Leading Economic Indicators and Surveys of Expectations

33

12/16/2010

Calendar Effects in Forecasting
   Holiday Break
  

12/23/2010

No posts
  

12/30/2010

   ARIMA Models

34

01/06/2011

The Autocorrelation Function

35

01/13/2011

Stationarity of Time Series Data

36

01/20/2011

MA, AR, and ARMA Models

37

01/27/2011

ARIMA Models

38

02/03/2011

39

02/10/2011

   Judgmental Methods in Forecasting

40

02/17/2011

Judgmental Extrapolation

41

02/24/2011

Expert Judgment

42

03/03/2011

Delphi Method

43

03/10/2011

Other Judgmental Forecasting Methods

44

03/17/2011

Judgmental Bias in Forecasting
   Combining and Evaluating Forecasts

45

03/24/2011

Procedures for Combining Forecasts

46

03/31/2011

Effectiveness of Combining Forecasts – Empirical Evidence

47

04/07/2011

Evaluating Forecasts – Part I

48

04/14/2011

Evaluating Forecasts – Part II

 

After post 48, Forecast Friday will continue, but the focus will be on more practical and applied topics. In the near future, I will be soliciting your feedback on topics you’d like covered in Forecast Friday. As always, you are welcome to post comments on this blog about topics in which your interested. Insight Central is for your benefit, and is intended to help you use analytics for your company’s strategic advantage. Thank you for your loyal readership of Forecast Friday over these last six months!

C-Sat Surveys Can Cause Intra-Organizational Conflict

October 20, 2010

I’ve grown somewhat leery of customer satisfaction surveys in recent years.  While I still believe they can add highly useful information for a company to make improvements to the customer experience, I am also convinced that many companies aren’t doing said research properly.

My reservations aside, regardless of whether a company is doing C-Sat research properly, customer satisfaction surveys can also cause intra-organizational friction and conflict.  Because of the ways departments are incentivized and compensated, some will benefit more than others.  Moreover, because many companies either don’t  link their desired financial and operational outcomes – or don’t link them well enough – to the survey, many departments can claim that the research isn’t working.  C-Sat research is fraught with inter-departmental conflict because companies are conducting it with vague objectives and rewarding – or punishing – departments for their ability or inability to meet those vague objectives.

The key to reducing the conflict caused by C-Sat surveys is to have all affected departments share in framing the objectives.  Before the survey is even designed, all parties should have an idea of what is going to be measured – whether it is repeat business, reduced complaints, shorter customer waiting times – and what they will all be accountable for.  Stakeholders should also work together to see how – or if – they can link the survey’s results to financial and operational performance.  And the stakeholders should be provided information, training, and guidelines to aid their managerial actions in response to the survey’s results.

Survey Question Dos and Don’ts Redux

October 19, 2010

This past summer, I published a series of posts for Insight Central about effective questionnaire design.  It cannot be stressed enough that survey questions must carefully be thought out in order to obtain information you can act on.  In this month’s issue of Quirk’s Marketing Research Review, Brett Plummer of HSM Group, Ltd. reiterates many of the points made in my earlier posts.

Plummer’s article (you’ll need to enter the code 20101008 in the Article ID blank) provides a series of dos and don’ts when writing survey questions. I’ll summarize them here:

Do:

  1. Keep your research objectives in mind;
  2. Consider the best type of question to ask for each question;
  3. Think about how your going to analyze your data;
  4. Make sure all valid response options are included; and
  5. Consider where you place each question within your survey.

Don’t:

  1. Create confusing or vague questions;
  2. Forget to ensure that the response options to questions are appropriate, thorough, and not overlapping;
  3. Ask leading questions; and
  4. Ask redundant questions.

Plummer does a good job at reminding of the importance of these guidelines and points out that effective survey questions are the key to an organization’s obtaining the highest quantity and quality of actionable information, and thus maximizing its research investment.