Posts Tagged ‘regression analysis’

Forecast Friday Topic: Structural and Reduced Forms

November 18, 2010

(Thirtieth in a series)

Last week, we discussed the identification problem – a common occurrence in forecasting when consistent estimators of the parameters of the equation or model in which we are interested don’t exist. We also discussed how identifying variables unique to one equation but not to the other is the first step to alleviating the identification problem.

Today, we will briefly discuss the next step in solving the identification problem: structural and reduced forms. Because the math can get a little complicated, we won’t be focusing on it here. Like last week, this week’s – and next week’s – post will be theoretical in nature.

Structural and reduced forms get their origin from matrix algebra and involve systems of equations. Indeed, the equations contained within a system are called structural equations because, together, they are developed to explain the hypothesized structure of a given market. Structural equations are based on economic theory and are used to derive the reduced form equations for two-stage least squares regression.

To derive the reduced form equations, one endogenous variable must be placed on the left side of the equation, while all exogenous variables must be placed on the right. You must have one reduced form equation for each endogenous variable present in the system. So, if your system of equations has five endogenous variables, then you must have five reduced form equations.

The process for reducing the form of the structural equations follows that of solving for a system of linear equations:

1. Set one equation equal to another;
2. Subtract the endogenous parameter term (estimate times variable) and error term from each side of equation;
3. Factor both sides;
4. Divide to solve for the endogenous variable. This gives you the first reduced form equation.
5. Find the next reduced form equation by substituting the right side of the first reduced form equation into one of the original structural equations.

Essentially, it’s best to think of endogenous variables as dependent variables and of exogenous variables as independent variables; this way, you get the result of the reduced form having precisely the same format as multiple regression models. Given assumptions about future values of exogenous variables, the reduced form can facilitate computation of conditional forecasts of future values of the endogenous variables.

Forecast Friday Resumes Two Weeks From Today

Forecast Friday will not be published next Thursday, in observance of Thanksgiving.  We here at Analysights are very thankful for readers like you who check in every week, and look forward to your continued visits to Insight Central.  Our Forecast Friday post will resume two weeks from today, December 2, in which we will conclude our discussion of simulataneous equations with a post on Two-Stage Least Squares regression analysis.  We here at Analysights wishes you and your family a Happy Thanksgiving.

*************************

Be Sure to Follow us on Facebook and Twitter !

Thanks to all of you, Analysights now has nearly 200 fans on Facebook … and we’d love more! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you and they will be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

Forecast Friday Topic: The Identification Problem

November 11, 2010

(Twenty-ninth in a series)

When we work with regression analysis, it is assumed that outside factors determine each of the independent variables in the model; these factors are said to be exogenous to the system. This is especially of interest to economists, who have long used econometric models to forecast demand and supply for various goods. The price the market will bear for a good or service, for example, is not determined by a single equation, but by the interaction of the equations for both supply and demand. If price was what we were trying to forecast, then a single equation would do us little good. In fact, since price is part of a multi-equation system, performing regression analysis for just demand without supply or vice-versa will result in biased parameter estimates.

This post begins our three-part “series within a series” on “Simultaneous Equations and Two-Stage Least Squares Regression”. Although this topic sounds intimidating, I will not be covering it in much technical detail. My purpose in discussing it is to make you aware of these concepts, so that you can determine when to look beyond a simple regression analysis.

Hence, we start with the most basic concept of simultaneous equations: the Identification problem. Let’s assume that you are the supply chain manager for a beer company. You need to forecast the price of barley, so your company can budget how much money it needs to spend in order to have enough barley to produce its beer; determine whether the price is on an upward trend, so that it could purchase derivatives to hedge its risk; and determine the final price for its beer.

You have statistics for the price and traded quantity of barley for the last several years. You also remember three concepts from your college economics class:

1. The price and quantity supplied of a good have a direct relationship – producers supply more as the price goes up and less as the price goes down;
2. The price and quantity demanded of a good have an inverse relationship – consumers purchase less as the price goes up and vice-versa; and
3. The market price is determined by the interaction of the supply and demand equations.

Since price and quantity are positively sloped for supply and negatively sloped for demand, with only the two variables of quantity and price, you cannot determine – that is identify – the supply and demand equations using regression analysis; the information is insufficient. However, if you can identify variables that are in one equation and not the other, you will be able to identify the individual relations.

In agriculture, the supply of a crop is greatly affected by weather. If you can obtain information on the amount of rainfall in barley producing regions during the years for which you have data, you might be able to identify the different equations. Moreover, production costs impact supply. So if you can obtain information on the costs of planting and harvesting the barley, that too would help. On the demand side, barley’s quantity can be influenced by changes in tastes. If beer demand goes up, so too will the demand for barley; if farm animal raising increases, farmers may need to purchase more barley for animal fodder; and various health fads may emerge, increasing the demands for barley breads and soups. If you can obtain these kinds of information, you are on your way to identifying the supply and demand curves.

Exogenous and Endogenous Variables

Since rainfall affects the supply of barley, but the barley market does not influence the amount of rainfall, rainfall is said to be an exogenous variable, because its value is determined by factors outside of the equation system. Since the demand for beer helps derive the demand for barley, but not the other way around, beer demand is an exogenous variable.

Because price and quantity of barley are part of a demand and supply system, they are determined by the interaction of the two equations – that is by the equation system – so they are said to be endogenous variables.

Identifying an Equation

If you are trying to identify an equation that is part of a multi-equation system, you must have a minimum of one less variable than you do equations excluded from that equation. Hence, if you have a two-equation system, you must have at least one variable excluded from the model you’re trying to identify, that is included in the other equation; if your system has three equations, you need to have at least two variables excluded from the model you want to identify, and so on.

When you have just enough exogenous variables in one equation that is not in the other equation(s), then your equation is just identified. You can use several econometric techniques to estimate just identified systems, however they are quite rare in practice. When you have no exogenous variables that are unique to one equation in the system, your equations are under identified and cannot be estimated with any econometric techniques. Most often, equations are over identified, because there are more exogenous variables excluded from one equation than required by the number of equations in the system. When over identification is the case, then two-stage least squares (the topic of the third post of this miniseries) is required in order to tell which of the variables is causing your supply (or demand) curve to shift along the fixed demand (or supply) curve.

Next Forecast Friday Topic: Structural and Reduced Forms

Next week’s Forecast Friday topic builds on today’s topic with a discussion of structural and reduced forms of equations. These are the first steps in Two-Stage Least Squares Regression analysis, and are part of the effort to solve the identification problem.

*************************

Be Sure to Follow us on Facebook and Twitter !

Thanks to all of you, Analysights now has nearly 200 fans on Facebook … and we’d love more!  If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you and they will be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

Forecast Friday Topic: The Logistic Regression Model

November 4, 2010

(Twenty-eighth in a series)

Sometimes analysts need to forecast the likelihood of discrete outcomes: the probability that this outcome will occur or that outcome will occur. Last week, we discussed the linear probability model (LPM) as one solution. Essentially, the LPM looked that the two discrete outcomes: a 1 if the outcome occurred and a 0 if it did not. We treated that binary variable as the dependent variable and then ran it as if it were ordinary least squares. In our example, we got a pretty good result. However, LPM came up short in many different ways: values that fell outside the 0 to 1 range, dubious R2 values, heteroscedasticity, and non-normal error terms.

One of the more popular improvements on the LPM is the logistic regression model, sometimes referred to as the “logit” model. Logit models are very useful in consumer choice modeling. While the theory is quite complex, today we will introduce you to basic concepts of the logit model, using a simple regression model.

Probabilities, Odds Ratios, and Logits, Oh My!

The first thing to understand about logistic regression is the mathematics of the model. There are three calculations that you need to know: the probability, the odds ratio, and the logits. While these are different values, they are three ways of expressing the very same thing: the likelihood of an outcome occurring.

Let’s start with the easiest of the three: probability. Probability is the likelihood that a particular outcome will happen. It is a number between 0 and 1. A probability of .75 means that an outcome has a 75% chance of occurring. In a logit model, the probability of an observation having a “success” outcome (Y=1) is denoted as Pi. Since Pi is a number between 0 and 1, that means the probability of a “failure” outcome (Y=0) is 1- Pi. If Pi (Y=1)=0.40 then Pi (Y=0)=1.00-0.40 = 0.60.

The odds ratio, then, is the ratio of the probability of a success to the probability of failure: Hence, in the above example, the odds ratio is (0.40/0.60) = 0.667.

Logits, denoted as Li, are the natural log of the odds ratio: Hence the logit in our example here is ln(0.667) = -0.405.

A logistic regression model’s equation generates Ŷ values in the form of logits for each observation. The logits are equal to the terms of a regression equation: Once an observation’s logit is generated, you must take its antilog to derive its odds ratio, and then you must use simple algebra to compute the probability.

Estimate the Logit Model Using Weighted Least Squares Regression

Because data with a logistic distribution are not linear, linear regression is often not appropriate for modeling; however, if data can be grouped, ordinary least squares regression can be used to estimate the logits. This example will use a simple one-variable model to approximate logits. Multiple logistic regression is beyond the scope of this post, and should only be used with a statistical package.

We use weighted least squares (WLS) techniques for approximating the logits. You will recall that we discussed WLS in our Forecast Friday post on heteroscedasticity. In a logistic distribution, error terms are heteroscedastic, making WLS an appropriate tool to employ. The steps in this process are:

1. Group the independent variable, each group being its own Xi.
2. Note the sample size, Ni of each group, and count the number of successes, ni, in each.
3. Compute the relative probabilities for each Xi: 1. Use WLS to transform the model with weights, wi: 2. Perform OLS on the weighted, or transformed model: L*-hat is computed by multiplying L-hat by the weight, w. Likewise X* is computed by multiplying the original X value by weight; similar for error term.

3. Take the antilog of the logits to estimate probabilities for each group.

Predicting Churn in Wireless Telephony

Marissa Martinelli is director of customer retention for Cheapo Wireless, a low-end cell phone provider. Cheapo’s target market are subprime households, whose incomes are generally below \$50,000 per year and don’t have bank accounts. As incomes of their customers rise, churn rates for low-end cell phones increases greatly. Cheapo has developed a new cell phone plan that caters to higher income customers, so that it can migrate its existing customers to the new plan as their incomes rise. In order to promote the new plan, Marissa must first identify the customers most at risk of churning.

Marissa takes a random sample of 1,365 current and former Cheapo cell phone customers and looks at their churn rates. She has their incomes based on their applications and credit checks when they first applied for wireless service. She decides to break them down into 19 groups, with incomes from \$0 to \$50,000, in \$2,500 increments. For simplicity, Marissa divides the income amounts by \$10,000, and decides to group them. The lowest income group, 0.50, is all customers whose incomes are \$5,000 or less; the next group, 0.75, are those with incomes between \$5,000-\$7,500, and so on. Marissa notes the number of churned customers (ni) for each income level and the number of customers for each income level (Ni):

 # Churned Group Size Income level (\$10Ks) ni Ni Xi 1 20 0.50 2 30 0.75 3 30 1.00 5 40 1.25 6 40 1.50 8 50 1.75 9 50 2.00 12 60 2.25 17 80 2.50 22 80 2.75 35 100 3.00 40 100 3.25 75 150 3.50 70 125 3.75 62 100 4.00 62 90 4.25 64 90 4.50 51 70 4.75 50 60 5.00

As the table shows, of the 60 customers whose income is between \$47,500 and \$50,000, 50 of them have churned. Knowing this information, Marissa can now compute the conditional probabilities of churn (Y=1) for each income group:

 # Churned Group Size Income level (\$10Ks) Probability of Churn Probability of Retention ni Ni Xi Pi 1-Pi 1 20 0.50 0.050 0.950 2 30 0.75 0.067 0.933 3 30 1.00 0.100 0.900 5 40 1.25 0.125 0.875 6 40 1.50 0.150 0.850 8 50 1.75 0.160 0.840 9 50 2.00 0.180 0.820 12 60 2.25 0.200 0.800 17 80 2.50 0.213 0.788 22 80 2.75 0.275 0.725 35 100 3.00 0.350 0.650 40 100 3.25 0.400 0.600 75 150 3.50 0.500 0.500 70 125 3.75 0.560 0.440 62 100 4.00 0.620 0.380 62 90 4.25 0.689 0.311 64 90 4.50 0.711 0.289 51 70 4.75 0.729 0.271 50 60 5.00 0.833 0.167

Marissa then goes on to derive the weights for each income level:

 Logits Weights Pi *(1-Pi) Pi /(1-Pi) Li NiPi(1-Pi) Wi 0.048 0.053 -2.944 0.950 0.975 0.062 0.071 -2.639 1.867 1.366 0.090 0.111 -2.197 2.700 1.643 0.109 0.143 -1.946 4.375 2.092 0.128 0.176 -1.735 5.100 2.258 0.134 0.190 -1.658 6.720 2.592 0.148 0.220 -1.516 7.380 2.717 0.160 0.250 -1.386 9.600 3.098 0.167 0.270 -1.310 13.388 3.659 0.199 0.379 -0.969 15.950 3.994 0.228 0.538 -0.619 22.750 4.770 0.240 0.667 -0.405 24.000 4.899 0.250 1.000 0.000 37.500 6.124 0.246 1.273 0.241 30.800 5.550 0.236 1.632 0.490 23.560 4.854 0.214 2.214 0.795 19.289 4.392 0.205 2.462 0.901 18.489 4.300 0.198 2.684 0.987 13.843 3.721 0.139 5.000 1.609 8.333 2.887

Now, Marissa must transform the logits and the independent variable (Income level) by multiplying them by their respective weights:

 Income level (\$10Ks) Logits Weights Weighted Income Weighted Logits Xi Li Wi Xi* Li* 0.50 -2.944 0.975 0.487 -2.870 0.75 -2.639 1.366 1.025 -3.606 1.00 -2.197 1.643 1.643 -3.610 1.25 -1.946 2.092 2.615 -4.070 1.50 -1.735 2.258 3.387 -3.917 1.75 -1.658 2.592 4.537 -4.299 2.00 -1.516 2.717 5.433 -4.119 2.25 -1.386 3.098 6.971 -4.295 2.50 -1.310 3.659 9.147 -4.793 2.75 -0.969 3.994 10.983 -3.872 3.00 -0.619 4.770 14.309 -2.953 3.25 -0.405 4.899 15.922 -1.986 3.50 0.000 6.124 21.433 0.000 3.75 0.241 5.550 20.812 1.338 4.00 0.490 4.854 19.415 2.376 4.25 0.795 4.392 18.666 3.491 4.50 0.901 4.300 19.349 3.873 4.75 0.987 3.721 17.673 3.674 5.00 1.609 2.887 14.434 4.646

Now, Marissa can run OLS on the transformed model, using Weights (wi) and Weighted Income (X*) as independent variables and the Weighted Logits (L*) as the dependent variable.

Marissa derives the following regression equation: Interpreting the Model

As expected, weighted income has a positive relationship on likelihood of churn. However, her sample is just 19 observations, so Marissa must be very careful about drawing too strong an inference from these results. While R2 is a strong 0.981, it too must not be relied upon. In fact, it is pretty meaningless in a logit model. Also, notice that there is no intercept term in this model. You will recall that when using WLS to correct for heteroscedasticity, the intercept was lost in the transformed model and actually became its slope. It is equivalent to the slope in an unadjusted regression model, since heteroscedasticity doesn’t bias parameter estimates.

Calculating Probabilities

Now Marissa needs to use this model to assess current customers’ likelihood of churning. Let’s say she sees a customer who makes \$9,500 a year. That customer would be in the income group, 1.0. What is that customer’s probability of churning? Marissa takes the weight, 1.643 for her wi and the weighted X* (also 1.643), and plugs them into her equation: = -4.121

To get to the probability, Marissa must take the antilog of these logits, which will give her the odds ratio: 0.016

Now Marissa calculates this customer’s probability of churning: So, a customer earning \$9,500 per year has less than a two percent chance of churning. Had the customer been earning \$46,000, he/she would have had a whopping 98.7% chance of churning!

There are equivalents of R2 that are used for logistic regression, but that discussion is beyond the scope of this post. Today’s post was to give you a primer on the theory of logistic regression.

Next Forecast Friday Topic: The Identification Problem

We have just concluded our discussions on qualitative choice models. Next week, we begin our three-part miniseries on Simultaneous Equations and Two-Stage Least Squares Regression. The first post will discuss the Identification problem.

*************************

We’re Almost at Our Goal of 200 Facebook Fans !

Thanks to all of you, Analysights now has 190 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

Forecast Friday Topic: The Linear Probability Model

October 28, 2010

(Twenty-seventh in a series)

Up to now, we have talked about how to build regression models with continuous dependent variables. Such models are intended to answer questions like, “How much can sales increase if we spend \$5,000 more on advertising?”; or “How much impact does each year of formal education have on salaries in a given occupation?”; or “How much does each \$1 per bushel change in the price of wheat affect the number of boxes of cereal that Kashi produces?” Such business questions are quantitative and estimate impact of independent variables on the dependent variable at a macro level. But what if you wanted to predict a scenario that has only two outcomes?

Consider these business questions: “Is a particular individual more likely to respond to a direct marketing solicitation or not?” “Is a family more likely to vote Democrat or Republican?” “Is a particular subscriber likely to renew his/her subscription or let it lapse?” Notice that these business questions pertain to specific individuals, and there is only one of two outcomes. Moreover, these are questions are qualitative and involve individual choice and preferences; hence they seek to understand customers at a micro level.

Only in the last 20-30 years has it become easier to develop qualitative choice models. The increased use of surveys as well as customer and transactional databases, and improvements in data collection processes has made it more feasible to develop models to predict phenomena with discrete outcomes. Essentially, a qualitative choice model works the same way as a regression model. You have your independent variables and your dependent variable. However, the dependent variable is a dummy variable – it has only two outcomes: 1 if it is a “yes” for a particular outcome, and 0 if it is a “no.” Generally, the number of observations with a dependent variable of 1 is much smaller than that whose dependent variable is a zero. Think about a catalog mailing. The catalog might be sent to a million people, but only one or two percent – 10,000-20,000 – will actually respond.

The Linear Probability Model

The Linear Probability Model (LPM) was one of the first ways analysts began to develop qualitative choice models. LPM consisted of running OLS regression, only with a dichotomous dependent variable. Generally, the scores that would result would be used to assess the probability of an outcome. A score close to 0 would mean an outcome has a low probability of occurring, while a score close to 1 would mean the outcome is almost certain to occur. Consider the following example.

Mark Moretti, Circulation Director for the Baywood Bugle, a small town local newspaper, wants to determine how likely a subscriber is to renew his/her subscription to the Bugle. Mark has been concerned that delivery issues are causing subscribers to let their subscriptions lapse, and he also suspects that the Bugle is having a hard time trying to retain relatively new subscribers. So Mark randomly selected a sample of 30 subscribers whose subscriptions recently came in for renewal. Nine of these did let their subscriptions lapse, while the other 21 renewed. Mark also pulled the number of complaints each of these subscribers logged in the last 12 months, as well as their tenure (in years) at the time their subscription came up for renewal.

For his dependent variable, Mark used whether the subscriber renewed: 1 for yes, 0 for no. The number of complaints and the tenure served as the independent variable. Mark’s sample looked like this:

 Subscriber # Complaints Subscriber Tenure Renewed 1 16 1 0 2 13 10 1 3 5 14 1 4 8 10 1 5 0 8 1 6 5 7 1 7 5 7 1 8 13 15 1 9 9 10 1 10 14 11 1 11 6 10 1 12 4 14 1 13 16 10 0 14 12 2 0 15 9 9 1 16 12 7 1 17 20 4 0 18 17 1 0 19 2 11 1 20 13 14 1 21 5 13 1 22 7 2 0 23 9 12 1 24 10 8 0 25 0 10 1 26 2 13 1 27 19 4 0 28 12 3 0 29 10 9 1 30 4 9 1

Despite the fact that there are only two outcomes for renewed, Mark decides to run OLS regression on these 30 subscribers. He gets the following results: Which suggests that each one-year increase in tenure increases a subscriber’s likelihood of renewal by just under seven percent, while each one-unit increase in the number of complaints reduces the subscriber’s likelihood of renewal by just over three percent. We would expect these variables to exhibit the relationships they do, since the former is a measure of customer loyalty, the latter of customer dissatisfaction.

Mark also gets an R2=0.689 and an F-statistic of 29.93, suggesting a very good fit.

However, Mark’s model exhibits serious flaws. Among them:

The Error Terms are Non-Normal

The LPM shows that the fitted values of the equation represent the probability that Yi=1 for the given values Xi. The error terms, however, are not normally distributed. Because there are only two possible outcomes, the error terms are binomially distributed, because Y can only be 0 and 1:

If Yi=0, then 0=α + β1X1i + β2X2i + εi

such that :

εi = -α – β1X1i – β2X2i

If Yi=1, then 1=α + β1X1i + β2X2i + εi

such that :

εi = 1 -α – β1X1i – β2X2i

The absence of normally distributed error terms, combined with Mark’s small sample means that his parameter estimates cannot be trusted. If Mark’s sample were much larger, then the error would approach a normal distribution.

The Error Terms are Heteroscedastic!

The residuals do not have a constant variance. With a continuous dependent variable, if two or more observations have the same value for X, it’s likely that their Y values won’t be too far apart. However, when the dependent variable is discrete, we will find that observations with the same values for an X can either have a Y value of 0 or 1. Let’s look at how the residuals in Mark’s variables compare to each independent variable:  Visual inspection suggests heteroscedasticity, which makes the parameter estimates in Mark’s model inefficient.

Unacceptable Values for Ŷ!

The dependent variable can only have two outcomes: 0 or 1. Because it is intended to deliver a probability score for each observation, values for a probability can only be between 0 and 1. However, look at the following predicted probabilities the LPM calculated for nine of the thirty subscribers:

 Subscriber # Predicted Renewal 1 (0.034) 3 1.204 8 1.031 12 1.234 18 (0.064) 19 1.086 21 1.135 25 1.077 26 1.225

As the table shows, subscribers #1 and #18 have predicted probabilities of less than 0 and the other seven have predicted probabilities in excess of 1. In actuality, subscribers 1 and 18 did not renew while the other 7 did, so these results were not inaccurate. However, their probabilities are unrealistic. In this case, only 30% of the values fall outside of the 0 to 1 region, so the model can probably be constrained by capping variables that fall outside the region to just barely within the region.

R2 is Useless

Another problem with Mark’s model is that R2, despite its high value, cannot be relied upon. Only a few data points lie close to the fitted regression line, as shown by the charts of the independent variables below:  This example being an exception, most LPMs generate very low R2 values for the very reason depicted in these charts. Hence R2 is generally disregarded in models with qualitative dependent variables.

So Why Do We Use Linear Probability Models?

Before many statistical packages were used, LPM was one of the only ways analysts could model qualitative dependent variables. Moreover, from an approximation standpoint, LPMs were not terribly far away from the more appropriate qualitative choice modeling approaches like logistic regression. And despite both their misuse and their inferiority to the more appropriate approaches, LPMs are easy to explain and conceptualize.

Next Forecast Friday Topic: Logistic Regression

Logistic regression is the more appropriate tool to use in such situations like this one. Next week I will walk you through the concepts of logistic regression, and illustrate a simple, one-variable model. You will understand how the logistic – or logit – model is used to compute a more accurate estimate of the likelihood of an outcome occurring. You will also discover how the logistic regression model provides three values that are simply three different ways of expressing the same thing. That’s next week.

*************************

Help us Reach 200 Fans on Facebook by Tomorrow!

Thanks to all of you, Analysights now has 175 fans on Facebook! Can you help us get up to 200 fans by tomorrow? If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

Forecast Friday Topic: Selecting the Variables for a Regression

October 14, 2010

(Twenty-fifth in a series)

When it comes to building a regression model, for many companies there’s good news and bad news. The good news: there’s plenty of independent variables from which to choose. The bad news: there’s plenty of independent variables from which to choose! While it may be possible to run a regression with all possible independent variables, each one included in your model reduces your degrees of freedom and causes the model to overfit the data on which the model is built, resulting in less reliable forecasts when new data is introduced.

So how do you come up with your short list of independent variables?

Some analysts have tried plotting the dependent variable (Y) against individual independent variables (Xi) and selecting it if there’s some noticeable relationship. Another tried method is to produce a correlation matrix of all the independent variables and if a large correlation between two of them is discovered, drop one from consideration (so to avoid multicollinearity). Still another approach has been to perform a multiple linear regression on all possible explanatory variables and then dropping those who t values are insignificant. These approaches are often selected because they are quick and simple, but they are not reliable for coming up with a decent regression model.

Stepwise Regression

Other approaches are a bit more complex, but more reliable. Perhaps the most common of these approaches is stepwise regression. Stepwise regression works by first identifying the independent variable with the highest correlation with the dependent variable. Once that variable is identified, a one-variable regression model is run. The residuals of that model are then obtained. Recall from previous Forecast Friday posts that if an important variable is omitted from a regression model, its effect on the dependent variable gets factored into the residuals. Hence, the next step in a stepwise regression is to identify the one unselected independent variable with the highest correlation with the residuals. Now you have your second independent variable, and you run a two-variable regression model. You then look at the residuals to that model and select the independent variable with the highest correlation to them, and so forth. Repeat the process until no more variables can be added into the model.

Many statistical analysis packages do stepwise regression seamlessly. Stepwise regression is not guaranteed to produce the optimal set of variables for your model.

Other Approaches

Other approaches to variable selection include best subsets regression, which involves taking various subsets of the available independent variables and running models with them, choosing the subset with the best R2. Many statistical software packages have the capability of helping determine the various subsets to choose from. Principal components analysis of all the variables is another approach, but it is beyond the scope of this discussion.

Despite systematic techniques like stepwise regression, variable selection in regression models is as much an art as a science. Whatever variables you select for your model should have a valid rationale for being there.

Next Forecast Friday Topic: I haven’t decided yet!

Let me surprise you. In the meantime, have a great weekend and be well!