(Twenty-seventh in a series)
Up to now, we have talked about how to build regression models with continuous dependent variables. Such models are intended to answer questions like, “How much can sales increase if we spend $5,000 more on advertising?”; or “How much impact does each year of formal education have on salaries in a given occupation?”; or “How much does each $1 per bushel change in the price of wheat affect the number of boxes of cereal that Kashi produces?” Such business questions are quantitative and estimate impact of independent variables on the dependent variable at a macro level. But what if you wanted to predict a scenario that has only two outcomes?
Consider these business questions: “Is a particular individual more likely to respond to a direct marketing solicitation or not?” “Is a family more likely to vote Democrat or Republican?” “Is a particular subscriber likely to renew his/her subscription or let it lapse?” Notice that these business questions pertain to specific individuals, and there is only one of two outcomes. Moreover, these are questions are qualitative and involve individual choice and preferences; hence they seek to understand customers at a micro level.
Only in the last 20-30 years has it become easier to develop qualitative choice models. The increased use of surveys as well as customer and transactional databases, and improvements in data collection processes has made it more feasible to develop models to predict phenomena with discrete outcomes. Essentially, a qualitative choice model works the same way as a regression model. You have your independent variables and your dependent variable. However, the dependent variable is a dummy variable – it has only two outcomes: 1 if it is a “yes” for a particular outcome, and 0 if it is a “no.” Generally, the number of observations with a dependent variable of 1 is much smaller than that whose dependent variable is a zero. Think about a catalog mailing. The catalog might be sent to a million people, but only one or two percent – 10,000-20,000 – will actually respond.
The Linear Probability Model
The Linear Probability Model (LPM) was one of the first ways analysts began to develop qualitative choice models. LPM consisted of running OLS regression, only with a dichotomous dependent variable. Generally, the scores that would result would be used to assess the probability of an outcome. A score close to 0 would mean an outcome has a low probability of occurring, while a score close to 1 would mean the outcome is almost certain to occur. Consider the following example.
Mark Moretti, Circulation Director for the Baywood Bugle, a small town local newspaper, wants to determine how likely a subscriber is to renew his/her subscription to the Bugle. Mark has been concerned that delivery issues are causing subscribers to let their subscriptions lapse, and he also suspects that the Bugle is having a hard time trying to retain relatively new subscribers. So Mark randomly selected a sample of 30 subscribers whose subscriptions recently came in for renewal. Nine of these did let their subscriptions lapse, while the other 21 renewed. Mark also pulled the number of complaints each of these subscribers logged in the last 12 months, as well as their tenure (in years) at the time their subscription came up for renewal.
For his dependent variable, Mark used whether the subscriber renewed: 1 for yes, 0 for no. The number of complaints and the tenure served as the independent variable. Mark’s sample looked like this:
Subscriber # |
Complaints |
Subscriber Tenure |
Renewed |
1 |
16 |
1 |
0 |
2 |
13 |
10 |
1 |
3 |
5 |
14 |
1 |
4 |
8 |
10 |
1 |
5 |
0 |
8 |
1 |
6 |
5 |
7 |
1 |
7 |
5 |
7 |
1 |
8 |
13 |
15 |
1 |
9 |
9 |
10 |
1 |
10 |
14 |
11 |
1 |
11 |
6 |
10 |
1 |
12 |
4 |
14 |
1 |
13 |
16 |
10 |
0 |
14 |
12 |
2 |
0 |
15 |
9 |
9 |
1 |
16 |
12 |
7 |
1 |
17 |
20 |
4 |
0 |
18 |
17 |
1 |
0 |
19 |
2 |
11 |
1 |
20 |
13 |
14 |
1 |
21 |
5 |
13 |
1 |
22 |
7 |
2 |
0 |
23 |
9 |
12 |
1 |
24 |
10 |
8 |
0 |
25 |
0 |
10 |
1 |
26 |
2 |
13 |
1 |
27 |
19 |
4 |
0 |
28 |
12 |
3 |
0 |
29 |
10 |
9 |
1 |
30 |
4 |
9 |
1 |
Despite the fact that there are only two outcomes for renewed, Mark decides to run OLS regression on these 30 subscribers. He gets the following results:
Which suggests that each one-year increase in tenure increases a subscriber’s likelihood of renewal by just under seven percent, while each one-unit increase in the number of complaints reduces the subscriber’s likelihood of renewal by just over three percent. We would expect these variables to exhibit the relationships they do, since the former is a measure of customer loyalty, the latter of customer dissatisfaction.
Mark also gets an R^{2}=0.689 and an F-statistic of 29.93, suggesting a very good fit.
However, Mark’s model exhibits serious flaws. Among them:
The Error Terms are Non-Normal
The LPM shows that the fitted values of the equation represent the probability that Y_{i}=1 for the given values X_{i}. The error terms, however, are not normally distributed. Because there are only two possible outcomes, the error terms are binomially distributed, because Y can only be 0 and 1:
If Y_{i}=0, then 0=α + β_{1}X_{1i} + β_{2}X_{2i} + ε_{i}
such that :
ε_{i} = -α – β_{1}X_{1i} – β_{2}X_{2i }
If Y_{i}=1, then 1=α + β_{1}X_{1i} + β_{2}X_{2i} + ε_{i }
such that :
ε_{i} = 1 -α – β_{1}X_{1i} – β_{2}X_{2i }
The absence of normally distributed error terms, combined with Mark’s small sample means that his parameter estimates cannot be trusted. If Mark’s sample were much larger, then the error would approach a normal distribution.
The Error Terms are Heteroscedastic!
The residuals do not have a constant variance. With a continuous dependent variable, if two or more observations have the same value for X, it’s likely that their Y values won’t be too far apart. However, when the dependent variable is discrete, we will find that observations with the same values for an X can either have a Y value of 0 or 1. Let’s look at how the residuals in Mark’s variables compare to each independent variable:
Visual inspection suggests heteroscedasticity, which makes the parameter estimates in Mark’s model inefficient.
Unacceptable Values for Ŷ!
The dependent variable can only have two outcomes: 0 or 1. Because it is intended to deliver a probability score for each observation, values for a probability can only be between 0 and 1. However, look at the following predicted probabilities the LPM calculated for nine of the thirty subscribers:
Subscriber # |
Predicted Renewal |
1 |
(0.034) |
3 |
1.204 |
8 |
1.031 |
12 |
1.234 |
18 |
(0.064) |
19 |
1.086 |
21 |
1.135 |
25 |
1.077 |
26 |
1.225 |
As the table shows, subscribers #1 and #18 have predicted probabilities of less than 0 and the other seven have predicted probabilities in excess of 1. In actuality, subscribers 1 and 18 did not renew while the other 7 did, so these results were not inaccurate. However, their probabilities are unrealistic. In this case, only 30% of the values fall outside of the 0 to 1 region, so the model can probably be constrained by capping variables that fall outside the region to just barely within the region.
R^{2} is Useless
Another problem with Mark’s model is that R^{2}, despite its high value, cannot be relied upon. Only a few data points lie close to the fitted regression line, as shown by the charts of the independent variables below:
This example being an exception, most LPMs generate very low R^{2} values for the very reason depicted in these charts. Hence R^{2} is generally disregarded in models with qualitative dependent variables.
So Why Do We Use Linear Probability Models?
Before many statistical packages were used, LPM was one of the only ways analysts could model qualitative dependent variables. Moreover, from an approximation standpoint, LPMs were not terribly far away from the more appropriate qualitative choice modeling approaches like logistic regression. And despite both their misuse and their inferiority to the more appropriate approaches, LPMs are easy to explain and conceptualize.
Next Forecast Friday Topic: Logistic Regression
Logistic regression is the more appropriate tool to use in such situations like this one. Next week I will walk you through the concepts of logistic regression, and illustrate a simple, one-variable model. You will understand how the logistic – or logit – model is used to compute a more accurate estimate of the likelihood of an outcome occurring. You will also discover how the logistic regression model provides three values that are simply three different ways of expressing the same thing. That’s next week.
*************************
Help us Reach 200 Fans on Facebook by Tomorrow!
Thanks to all of you, Analysights now has 175 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!
Tags: binomial distribution, dichotomous dependent variables, heteroscedasticity, linear probability model, logistic regression, logit model, normal distribution, qualitative choice models, regression analysis
Leave a Reply