## Archive for the ‘Statistics’ Category

### Forecast Friday Topic: Procedures for Combining Forecasts

March 24, 2011

(Forty-first in a series)

We have gone through a series of different forecasting approaches over the last several months. Many times, companies will have multiple forecasts generated for the same item, usually generated by different people across the enterprise, often using different methodologies, assumptions, and data collection processes, and typically for different business problems. Rarely is one forecasting method or forecast superior to another, especially over time. Hence, many companies will opt to combine the forecasts they generate into a composite forecast.

Considerable empirical evidence suggests that combining forecasts works very well in practice. If all the forecasts generated by the alternative approaches are unbiased, then that lack of bias carries over into the composite forecast, a desirable outcome to have.

Two common procedures for combining forecasts include simple averaging and assigning weights inversely proportional to the sum of squares error. We will discuss both procedures in this post.

Simple Average

The quickest, easiest way to combine forecasts is to simply take the forecasts generated by each method and average them. With a simple average, each forecasting method is given equal weight. So, if you are presented with the following five forecasts:

You’ll get the average of \$83,000 as your composite forecast.

The simplicity and quickness of this procedure is its main advantage. However, the chief drawback is if information is known that individual methods consistently predict superiorly or inferiorly, that information is disregarded in the combination. Moreover, look at the wide variation in the forecasts above. The forecasts range from \$50,000 to \$120,000. Clearly, one or more of these methods’ forecasts will be way off. While the combination of forecasts can dampen the impact of forecast error, the outliers can easily skew the composite forecast. If you suspect one or more forecasts may be inferior to the others, you may just choose to exclude them and apply simple averaging to the forecasts for which you have some reasonable degree of confidence.

Assigning Weights in (Inverse) Proportion to Sum of Squared Errors

If you know the past performance of individual forecasting methods available to you, and you need to combine multiple forecasts, it’s likely you will want to assign greater weights to those forecast methods that have performed best. You will also want to allow the weighting scheme to adapt over time, since the relative performance of forecasting methods can change. One way to do that would be to assign weights to each forecast in based on their inverse proportion to the sum of squared forecast errors.

Let’s assume you have 12 months of sales data, actual (Xt), and three forecasting methods, each generating a forecast for each month (f1t, f2t, and f2t). Each of those three methods have also generated forecasts for month 13, which you are trying to predict. The table below shows these 12 months of actuals and forecasts, along with each method’s forecasts for month 13:

How much weight do you give each forecast? Calculate the sum squared error for each:

To get the weight of the one forecast method, you need to divide the sum of the other two methods’ squared errors by the total sum of the squared errors for all three methods, and then divide by 2 (3 methods minus 1). You must then do the same for the other two methods, in order to get the weights to sum to 1. Hence, the weights are as follows:

Notice that the higher weights are given to the forecast methods with the lowest sum of squared error. So, since each method generated a forecast for month 13, our composite forecast would be:

Hence, we would estimate approximately 795 as our composite forecast for month 13. When we obtain month 13’s actual sales, we would repeat this process for sum of squared errors from months 1-13 for each individual forecast, reassign the weights, and then apply them to each method’s forecasts for month 14. Also, notice the fraction ½ at the beginning of each weight equation. The denominator depends on the number of weights we are generating. In this case, we are generating three weights, so our denominator is (3-1)=2. If we would have used four methods, each weight equation above would have been one-third; and if we had only two methods, there would be no fraction, because it would be one.

Regression-Based Weights – Another Procedure

Another way to assign weights would be with regression, but that’s beyond the scope of this post. While the weighting approach above is simple, it’s also ad hoc. Regression-based weights can be much more theoretically correct. However, in most cases, you will not have many months of forecasts for estimating regression parameters. Also, you run the risk of autocorrelated errors, most certainly for forecasts beyond one step ahead. More information on regression-based weights can be found in Newbold & Bos, Introductory Business & Economic Forecasting, Second Edition, pp. 504-508.

Next Forecast Friday Topic: Effectiveness of Combining Forecasts

Next week, we’ll take a look at the effectiveness of combining forecasts, with a look at the empirical evidence that has been accumulated.

********************************************************

For the latest insights on marketing research, predictive modeling, and forecasting, be sure to check out Analysights on Facebook and Twitter! “Like-ing” us on Facebook and following us on Twitter will allow you to stay informed of each new Insight Central post published, new information about analytics, discussions Analysights will be hosting, and other opportunities for feedback. So check us out on Facebook and Twitter!

### Forecast Friday Topic: Calendar Effects in Forecasting

December 16, 2010

(Thirty-third in a series)

It is a common practice to compare a particular point in time to its equivalent one or two years ago. Companies often report their earnings and revenues for the first quarter of this year with respect to the first quarter of last year to see if there’s been any improvement or deterioration since then. Retailers want to know if December 2010 sales were higher or lower than December 2009 and even December 2008 sales. Sometimes, businesses want to see how sales compared for October, November, and December. While these approaches seem straightforward, the way the calendar falls can create misleading comparisons and faulty forecasts.

Every four years, February has 29 days instead of the usual 28. That extra day can cause problems in forecasting February sales. In some years, Easter falls in April, and other years March. This can cause forecasting nightmares for confectioners, greeting cards manufacturers, and retailers alike. In some years, a given month might have five Fridays and/or Saturdays, and just four in other years. If your business’ sales are much higher on the weekend, these can generate significant forecast error.

Some months have as many as 31 days, others 30, while February 28 or 29. Because the variation in the calendar can cause variation in the time series, it is necessary to make adjustments. If you do not adjust for variation in the length of the month, the effects can show up as a seasonal effect, which may not cause serious forecast errors, but will certainly make it difficult to interpret any seasonal patterns. You can easily adjust for month length:

Where Wt is the weighted value of your dependent variable for that month. Hence, if you had sales of \$100,000 in February and \$110,000 in March, you would first start with the numerator. There’s 365.25 days in a (non-leap) year. Divide that by 12. That means the numerator will be 30.44. Divide that by the number of days in each of those months to get adjustment factors for each month. So, for February, you’d divide 30.44 by 28 and get an adjustment factor of 1.09; for March, you would divide by 31 and get an adjustment factor of .98. Then you would multiply those factors by their respective months. Hence, your weighted sales for February would be \$109,000, and for March approximately \$108,000. Although sales appear to be higher in March than in February, once you adjust for month length, you find that the two months actually were about the same in terms of volume.

As described earlier, months can have four or five occurrences of the same day. As a result, a month may have more trading days in one year than they do in the next. This can cause problems in retail sales and banking. If a month has five Sundays in it, and Sunday is a non-trading day (as is the case in banking) you must account for it. Unlike month-length adjustments, where differences in length from one month to the next are obvious, trading day adjustments aren’t always precise, as their variance is not as predictable.

In the simplest cases, your approach can be similar to that of the formula above, only you’re dividing the number of trading days in an average month by the number of trading days in a given month. However, that can be misleading.

Many analysts also rely on other approaches to adjust for trading days in regression analysis: seasonal dummy variables (which we discussed earlier this year); creating independent variables that denote the number of times each day of the week occurred in that month; and a dummy variable for Easter (having a value of 1 in either March or April, depending on when it fell, and 0 in the non-Easter month).

Adjusting for calendar and trading day effects is crucial to effective forecasting and discernment of seasonal patterns.

Forecast Friday Resumes January 6, 2011

Forecast Friday will not be published on December 23 and December 30, in observance of Christmas and New Year’s, but will resume on January 6, 2011. When we resume on that day, we will begin a six-week miniseries on autoregressive integrated moving average (ARIMA) models in forecasting. This six-week series will round out all of our discussions on quantitative forecasting techniques, after which we will begin discussing judgmental forecasts for five weeks, followed by a four week capstone tying together everything we’ve discussed. There’s much to look forward to in the New Year.

*************************

Thanks to all of you, Analysights now has over 200 fans on Facebook … and we’d love more! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you and they will be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

### Forecast Friday Topic: Two-Stage Least Squares

December 2, 2010

(Thirty-first in a series)

In the last two Forecast Friday posts, we set up the discussion of how to solve systems of simultaneous equations when forecasting by first mentioning the identification problem and then breaking equation systems down from structural into reduced forms. Today we finalize that discussion with a talk about performing two-stage least squares (2SLS) regression. Quite often, when we forecast, we build one regression model using one or more variables as our explanatory variables and another as our dependent variable. However, many of our explanatory variables may in fact be dependent variables in another regression model that is highly related to the model we develop, particularly if one or more of our explanatory variables is contained in those other regressions. Because of this, our results may suffer from simultaneous equation bias. To minimize this bias, 2SLS regression is necessary.

Two-Stage Least Squares

As mentioned in the last two posts, I will not be going into too much mathematical detail in our discussions of simultaneous equations and 2SLS. This is a brief theoretical discussion to help you recognize situations that warrant a 2SLS approach. In our last post, we talked about how to generate the reduced form of the equations in our system. Hence the first stage of 2SLS is:

Perform OLS on the Reduced Form Equations

Recall our discussion on endogenous and exogenous variables. For each endogenous variable in your system, you must have one reduced form equation, and each reduced form equation must have all its exogenous variables on the right side of the equation. While tedious, this process isn’t always as tedious as it could be. If in your system, you have seven endogenous variables, you do not need to run OLS on all seven of the resulting reduced form equations, only on those whose endogenous variables appear on the right side of the structural equations you want to estimate. Hence, if you’re trying to forecast consumption (an endogenous variable), of which disposable income (another endogenous variable) is an independent variable, then you would need only perform OLS on the reduced form equation for disposable income, since it is the only endogenous variable that appears on the right side of the consumption equation. You do not need to do OLS on both variables’ reduced form.

Performing OLS on the reduced form equations gives you the fitted values to use in the second-stage regressions. Only the R2 and fitted values of each equation are the important pieces of information provided by the first stage. T-ratios are of no value, since the likelihood of significant multicollinearity will be strong. But the R2 statistic is important. A low R2 suggests little or no correlation between the fitted values and the endogenous variables. The fitted values, after all, are intended to replace the endogenous variables, so you want a high correlation, via a high R2. Furthermore, a low R2 can lead to biased standard errors of the parameter estimates in the second stage, resulting in coefficients that are inefficient. Hence, a correction factor for the standard errors is necessary.

Once you have performed OLS on the reduced form equations, the next stage of 2SLS is:

Perform OLS on the Structural Equations

If you performed OLS on the reduced form equation for disposable income, you would then substitute the fitted values for each regression into value of disposable income for the structural consumption equation. At this stage, everything works like regular OLS. However, you must be on heightened alert for autocorrelation, since you are using time series data, and because you are using several time series equations, which can increase the likelihood of autocorrelation.

Next Forecast Friday Topic: Leading Economic Indicators

In next week’s Forecast Friday post, we will discuss some of the more interesting aspects of economic forecasting, leading economic indicators. We often hear about leading economic indicators in the news, and we will be discussing the theory behind them, and the role that expectations play. We will see how leading economic indicators impact forecasting as well. Don’t miss it!

*************************

Thanks to all of you, Analysights now has nearly 200 fans on Facebook … and we’d love more! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you and they will be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

### Forecast Friday Topic: Structural and Reduced Forms

November 18, 2010

(Thirtieth in a series)

Last week, we discussed the identification problem – a common occurrence in forecasting when consistent estimators of the parameters of the equation or model in which we are interested don’t exist. We also discussed how identifying variables unique to one equation but not to the other is the first step to alleviating the identification problem.

Today, we will briefly discuss the next step in solving the identification problem: structural and reduced forms. Because the math can get a little complicated, we won’t be focusing on it here. Like last week, this week’s – and next week’s – post will be theoretical in nature.

Structural and reduced forms get their origin from matrix algebra and involve systems of equations. Indeed, the equations contained within a system are called structural equations because, together, they are developed to explain the hypothesized structure of a given market. Structural equations are based on economic theory and are used to derive the reduced form equations for two-stage least squares regression.

To derive the reduced form equations, one endogenous variable must be placed on the left side of the equation, while all exogenous variables must be placed on the right. You must have one reduced form equation for each endogenous variable present in the system. So, if your system of equations has five endogenous variables, then you must have five reduced form equations.

The process for reducing the form of the structural equations follows that of solving for a system of linear equations:

1. Set one equation equal to another;
2. Subtract the endogenous parameter term (estimate times variable) and error term from each side of equation;
3. Factor both sides;
4. Divide to solve for the endogenous variable. This gives you the first reduced form equation.
5. Find the next reduced form equation by substituting the right side of the first reduced form equation into one of the original structural equations.

Essentially, it’s best to think of endogenous variables as dependent variables and of exogenous variables as independent variables; this way, you get the result of the reduced form having precisely the same format as multiple regression models. Given assumptions about future values of exogenous variables, the reduced form can facilitate computation of conditional forecasts of future values of the endogenous variables.

Forecast Friday Resumes Two Weeks From Today

Forecast Friday will not be published next Thursday, in observance of Thanksgiving.  We here at Analysights are very thankful for readers like you who check in every week, and look forward to your continued visits to Insight Central.  Our Forecast Friday post will resume two weeks from today, December 2, in which we will conclude our discussion of simulataneous equations with a post on Two-Stage Least Squares regression analysis.  We here at Analysights wishes you and your family a Happy Thanksgiving.

*************************

Thanks to all of you, Analysights now has nearly 200 fans on Facebook … and we’d love more! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you and they will be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

### Forecast Friday Topic: The Identification Problem

November 11, 2010

(Twenty-ninth in a series)

When we work with regression analysis, it is assumed that outside factors determine each of the independent variables in the model; these factors are said to be exogenous to the system. This is especially of interest to economists, who have long used econometric models to forecast demand and supply for various goods. The price the market will bear for a good or service, for example, is not determined by a single equation, but by the interaction of the equations for both supply and demand. If price was what we were trying to forecast, then a single equation would do us little good. In fact, since price is part of a multi-equation system, performing regression analysis for just demand without supply or vice-versa will result in biased parameter estimates.

This post begins our three-part “series within a series” on “Simultaneous Equations and Two-Stage Least Squares Regression”. Although this topic sounds intimidating, I will not be covering it in much technical detail. My purpose in discussing it is to make you aware of these concepts, so that you can determine when to look beyond a simple regression analysis.

Hence, we start with the most basic concept of simultaneous equations: the Identification problem. Let’s assume that you are the supply chain manager for a beer company. You need to forecast the price of barley, so your company can budget how much money it needs to spend in order to have enough barley to produce its beer; determine whether the price is on an upward trend, so that it could purchase derivatives to hedge its risk; and determine the final price for its beer.

You have statistics for the price and traded quantity of barley for the last several years. You also remember three concepts from your college economics class:

1. The price and quantity supplied of a good have a direct relationship – producers supply more as the price goes up and less as the price goes down;
2. The price and quantity demanded of a good have an inverse relationship – consumers purchase less as the price goes up and vice-versa; and
3. The market price is determined by the interaction of the supply and demand equations.

Since price and quantity are positively sloped for supply and negatively sloped for demand, with only the two variables of quantity and price, you cannot determine – that is identify – the supply and demand equations using regression analysis; the information is insufficient. However, if you can identify variables that are in one equation and not the other, you will be able to identify the individual relations.

In agriculture, the supply of a crop is greatly affected by weather. If you can obtain information on the amount of rainfall in barley producing regions during the years for which you have data, you might be able to identify the different equations. Moreover, production costs impact supply. So if you can obtain information on the costs of planting and harvesting the barley, that too would help. On the demand side, barley’s quantity can be influenced by changes in tastes. If beer demand goes up, so too will the demand for barley; if farm animal raising increases, farmers may need to purchase more barley for animal fodder; and various health fads may emerge, increasing the demands for barley breads and soups. If you can obtain these kinds of information, you are on your way to identifying the supply and demand curves.

Exogenous and Endogenous Variables

Since rainfall affects the supply of barley, but the barley market does not influence the amount of rainfall, rainfall is said to be an exogenous variable, because its value is determined by factors outside of the equation system. Since the demand for beer helps derive the demand for barley, but not the other way around, beer demand is an exogenous variable.

Because price and quantity of barley are part of a demand and supply system, they are determined by the interaction of the two equations – that is by the equation system – so they are said to be endogenous variables.

Identifying an Equation

If you are trying to identify an equation that is part of a multi-equation system, you must have a minimum of one less variable than you do equations excluded from that equation. Hence, if you have a two-equation system, you must have at least one variable excluded from the model you’re trying to identify, that is included in the other equation; if your system has three equations, you need to have at least two variables excluded from the model you want to identify, and so on.

When you have just enough exogenous variables in one equation that is not in the other equation(s), then your equation is just identified. You can use several econometric techniques to estimate just identified systems, however they are quite rare in practice. When you have no exogenous variables that are unique to one equation in the system, your equations are under identified and cannot be estimated with any econometric techniques. Most often, equations are over identified, because there are more exogenous variables excluded from one equation than required by the number of equations in the system. When over identification is the case, then two-stage least squares (the topic of the third post of this miniseries) is required in order to tell which of the variables is causing your supply (or demand) curve to shift along the fixed demand (or supply) curve.

Next Forecast Friday Topic: Structural and Reduced Forms

Next week’s Forecast Friday topic builds on today’s topic with a discussion of structural and reduced forms of equations. These are the first steps in Two-Stage Least Squares Regression analysis, and are part of the effort to solve the identification problem.

*************************