## Posts Tagged ‘time series analysis’

### Forecast Friday Topic: Forecasting with Seasonally-Adjusted Data

September 16, 2010

(Twenty-first in a series)

Last week, we introduced you to fictitious businesswoman Billie Burton, who puts together handmade gift baskets and care packages for customers. Billie is interested in forecasting gift basket orders so that she can get a better idea of how much time she’s going to need to set aside to assemble the packages; how much supplies to have on hand; how much revenue – and cost – she can expect; and whether she will need assistance. Gift-giving is seasonal, and Billie’s business is no exception. The Christmas season is Billie’s busiest season, and a few other months are much busier than others, so she must adjust for these seasonal factors before doing any forecasts.

Why is it important to adjust data for seasonal factors? Imagine trying to do regression analysis on monthly retail data that hasn’t been adjusted. If sales during the holiday season are much greater than at all other times of the year, there will be significant forecast errors in the model because the holiday period’s sales will be outliers. And regression analysis places greater weight on extreme values when trying to determine the least-squares equation.

Billie’s Orders, Revisited

Recall from last week that Billie has five years of monthly gift basket orders, from January 2005 to December 2009. The orders are shown again in the table below:

 Month TOTAL GIFT BASKET ORDERS 2005 2006 2007 2008 2009 January 15 18 22 26 31 February 30 36 43 52 62 March 25 18 22 43 32 April 15 30 36 27 52 May 13 16 19 23 28 June 14 17 20 24 29 July 12 14 17 20 24 August 22 26 31 37 44 September 20 24 29 35 42 October 14 17 20 24 29 November 35 42 50 60 72 December 40 48 58 70 84

Billie would like to forecast gift basket orders for the first four months of 2010, particularly February and April, for Valentine’s Day and Easter, two other busier-than-usual periods. Billie must first adjust her data.

When we decomposed the time series, we computed the seasonal adjustment factors for each month. They were as follows:

 Month Factor January 0.78 February 1.53 March 0.89 April 1.13 May 0.65 June 0.67 July 0.55 August 1.00 September 0.91 October 0.62 November 1.53 December 1.75

Knowing these monthly seasonal factors, Billie adjusts her monthly orders by dividing each month’s orders by its respective seasonal factor (e.g., each January’s orders is divided by 0.78; each February’s orders by 1.53, and so on). Billie’s seasonally-adjusted data looks like this:

 Month Orders Adjustment Factor Seasonally Adjusted Orders Time Period Jan-05 15 0.78 19.28 1 Feb-05 30 1.53 19.61 2 Mar-05 25 0.89 28.15 3 Apr-05 15 1.13 13.30 4 May-05 13 0.65 19.93 5 Jun-05 14 0.67 21.00 6 Jul-05 12 0.55 21.81 7 Aug-05 22 1.00 22.10 8 Sep-05 20 0.91 21.93 9 Oct-05 14 0.62 22.40 10 Nov-05 35 1.53 22.89 11 Dec-05 40 1.75 22.92 12 Jan-06 18 0.78 23.13 13 Feb-06 36 1.53 23.53 14 Mar-06 18 0.89 20.27 15 Apr-06 30 1.13 26.61 16 May-06 16 0.65 24.53 17 Jun-06 17 0.67 25.49 18 Jul-06 14 0.55 25.44 19 Aug-06 26 1.00 26.12 20 Sep-06 24 0.91 26.32 21 Oct-06 17 0.62 27.20 22 Nov-06 42 1.53 27.47 23 Dec-06 48 1.75 27.50 24 Jan-07 22 0.78 28.27 25 Feb-07 43 1.53 28.11 26 Mar-07 22 0.89 24.77 27 Apr-07 36 1.13 31.93 28 May-07 19 0.65 29.13 29 Jun-07 20 0.67 29.99 30 Jul-07 17 0.55 30.90 31 Aug-07 31 1.00 31.14 32 Sep-07 29 0.91 31.80 33 Oct-07 20 0.62 32.01 34 Nov-07 50 1.53 32.70 35 Dec-07 58 1.75 33.23 36 Jan-08 26 0.78 33.42 37 Feb-08 52 1.53 33.99 38 Mar-08 43 0.89 48.41 39 Apr-08 27 1.13 23.94 40 May-08 23 0.65 35.26 41 Jun-08 24 0.67 35.99 42 Jul-08 20 0.55 36.35 43 Aug-08 37 1.00 37.17 44 Sep-08 35 0.91 38.38 45 Oct-08 24 0.62 38.41 46 Nov-08 60 1.53 39.24 47 Dec-08 70 1.75 40.11 48 Jan-09 31 0.78 39.84 49 Feb-09 62 1.53 40.53 50 Mar-09 32 0.89 36.03 51 Apr-09 52 1.13 46.12 52 May-09 28 0.65 42.93 53 Jun-09 29 0.67 43.49 54 Jul-09 24 0.55 43.62 55 Aug-09 44 1.00 44.20 56 Sep-09 42 0.91 46.06 57 Oct-09 29 0.62 46.41 58 Nov-09 72 1.53 47.09 59 Dec-09 84 1.75 48.13 60

Notice the seasonally adjusted gift basket orders in the fourth column. It is the quotient of the second and third columns. Notice that in the months where the seasonal adjustment factor is greater than 1, the seasonally adjusted orders will be lower than actual orders; in months where the factor is less than 1, the seasonally adjusted orders will be greater than actual. This is intended to normalize the data set. (Note: August has a seasonal factor of 1.00, suggesting it is an average month. However, that is due to rounding. Notice that August 2008’s actual orders are 37 baskets, but its adjusted orders are 37.17. That’s due to rounding). Also, the final column is the sequential time period number for each month, from 1 to 60.

Perform Regression Analysis

Now Billie runs regression analysis. She is going to do a simple regression, using the time period, t, in the last column as her independent variable and the seasonally adjusted orders as her dependent variable. Recall that last week, we ran a simple regression on the actual sales to isolate the trend component, and we identified an upward trend; however, because of the strong seasonal factors in the actual orders, the regression model didn’t fit the data well. By factoring out these seasonal variations, we should expect a model that better fits the data.

Running her regression of the seasonally adjusted orders, Billie gets the following output:

Ŷ = 0.47t +17.12

And as we expected, this model fits the data better, with an R2 of 0.872. Basically, in a baseline month, each passing month increases basket orders by about half an order.

Forecasting Orders

Now Billie needs to forecast orders for January through April 2010. January 2010 is period 61, so she plugs that into her regression equation:

Ŷ = 0.47(61) + 17.12

=45.81

Billie plugs in the data for the rest of the months and gets the following:

 Month Period Ŷ Jan-10 61 45.81 Feb-10 62 46.28 Mar-10 63 46.76 Apr-10 64 47.23

Remember, however, that this is seasonally-adjusted data. To get the forecasts for actual orders for each month, Billie now needs to convert them back. Since she divided each month’s orders by its seasonal adjustment factor, she must now multiply each of these months’ forecasts by those same factors. So Billie goes ahead and does that:

 Month Period Ŷ Seasonal Factor Forecast Orders Jan-10 61 45.81 0.78 35.65 Feb-10 62 46.28 1.53 70.81 Mar-10 63 46.76 0.89 41.53 Apr-10 64 47.23 1.13 53.25

So, Billie forecasts 36 gift basket orders in January; 71 in February, 42 in March, and 53 in April.

Next Forecast Friday Topic: Qualitative Variables in Regression Modeling

You’ve just learned how to adjust for seasonality when forecasting. One thing you’ve noticed through all of these forecasts we have built is that all variables have been quantitative. Yet sometimes, we need to account for qualitative, or categorical factors in our explanation of events. The next two Forecast Friday posts will discuss a simple approach for introducing qualitative information into modeling: “dummy” variables. Dummy variables can be helpful in determining differences in predictive estimates by region, gender, race, political affiliation, etc. As you will also find, dummy variables can even be used for a faster, more simplified approach to gauging seasonality. You’ll find our discussion on dummy variables highly useful.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

### Forecast Friday Topic: Decomposing a Time Series

September 9, 2010

(Twentieth in a series)

Welcome to our 20th Forecast Friday post. The last four months have been quite a journey, as we went through the various time series methods like moving average models, exponential smoothing models, and regression analysis, followed by in-depth discussions of the assumptions behind regression analysis and the consequences and remedies of violating those assumptions. Today, we resume the more practical aspects of time series analysis, with a discussion of decomposing a time series. If you recall from our May 3 post, a time series consists of four components: a trend component; a seasonal component; a cyclical component; and an irregular, or random, component. Today, we will show you how to isolate and control for these components, using the fictitious example of Billie Burton, a self-employed gift basket maker.

So Billie pulls together her monthly orders for the years 2005-2009. They look like this:

 Month TOTAL GIFT BASKET ORDERS 2005 2006 2007 2008 2009 January 15 18 22 26 31 February 30 36 43 52 62 March 25 18 22 43 32 April 15 30 36 27 52 May 13 16 19 23 28 June 14 17 20 24 29 July 12 14 17 20 24 August 22 26 31 37 44 September 20 24 29 35 42 October 14 17 20 24 29 November 35 42 50 60 72 December 40 48 58 70 84

Trend Component

When a variable exhibits a long-term increase or decrease over the course of time, it is said to have a trend. Billie’s gift basket orders for the past five years exhibit a long-term, upward trend, as shown by the time series plot below:

Although the graph looks pretty busy and bumpy, you can see that Billie’s monthly orders seem to be moving upward over the course of time. Notice that we fit a straight line across Billie’s time series. This is a linear trend line. Most times, we plot the data in a time series and then draw a straight line freehand to show whether a trend is increasing or decreasing. Another approach to fitting a trend line – like the one I used here – is to use simple regression analysis, using each time period, t, as the independent variable, and numbering each period in sequential order. Hence, January 2005 would be t=1 and December 2009 would be t=60. This is very similar to the approach we discussed in our May 27 blog post when we demonstrated how our other fictitious businesswoman, Sue Stone, could forecast her sales.

In using regression analysis, to fit our trend line, we would get the following equation:

Ŷ= 0.518t +15.829

Since the slope of the trend line is positive, we know that the trend is upward. Billie’s orders seem to increase by slightly more than half an order each month, on average. However, when we look at the R2, we get just .313, suggesting the trend line doesn’t fit the actual data well. But that is because of the drastic seasonality in the data set, which we will address shortly. For now, we at least know that the trend is increasing.

Seasonal Component

When a time series shows a repeating pattern over time, usually during the same time of the year, that pattern is known as the seasonal component in the time series. Some time series have more than one period in the year in which seasonality is strong; others have no seasonality. If you look at each of the January points, you’ll notice that it is greatly lower than the preceding December and the following February. Also, if you look at each December, you’ll see that it is the highest point of orders for each year. This strongly suggests seasonality in the data.

But what is the impact of the seasonality? We find out by isolating the seasonal component and creating a seasonal index, known as the ratio to moving average. Computing the ratio to moving average is a four-step process:

First, take the moving average of the series

Since our data is monthly, we will be taking a 12-month moving average. If our data was quarterly, we would do a 4-quarter moving average. We’ve essentially done this in the third column of the table below.

Second, center the moving averages

Next, we center the moving averages by taking the average of each successive pair of moving averages, the result is shown in the fourth column.

Third, compute the ratio to moving average

To obtain the ratio to moving average, divide the number of orders for a given month by the centered 12-month moving average that corresponds to that month. Notice that July 2005 is the first month to have a centered 12-month moving average. That is because we lose data points when we take a moving average. For July 2005, we divide its number of orders, 12, by its centered 12-month moving average, 21.38, and get .561 (the number’s multiplied by 100 for percentages, in this example).

 Month Orders 12-Month Moving Average Centered 12-Month Moving Average Ratio to Moving Average (%) Jan-05 15 Feb-05 30 Mar-05 25 Apr-05 15 May-05 13 Jun-05 14 21.25 Jul-05 12 21.50 21.38 56.1 Aug-05 22 22.00 21.75 101.1 Sep-05 20 21.42 21.71 92.1 Oct-05 14 22.67 22.04 63.5 Nov-05 35 22.92 22.79 153.6 Dec-05 40 23.17 23.04 173.6 Jan-06 18 23.33 23.25 77.4 Feb-06 36 23.67 23.50 153.2 Mar-06 18 24.00 23.83 75.5 Apr-06 30 24.25 24.13 124.4 May-06 16 24.83 24.54 65.2 Jun-06 17 25.50 25.17 67.5 Jul-06 14 25.83 25.67 54.5 Aug-06 26 26.42 26.13 99.5 Sep-06 24 26.75 26.58 90.3 Oct-06 17 27.25 27.00 63.0 Nov-06 42 27.50 27.38 153.4 Dec-06 48 27.75 27.63 173.8 Jan-07 22 28.00 27.88 78.9 Feb-07 43 28.42 28.21 152.4 Mar-07 22 28.83 28.63 76.9 Apr-07 36 29.08 28.96 124.3 May-07 19 29.75 29.42 64.6 Jun-07 20 30.58 30.17 66.3 Jul-07 17 30.92 30.75 55.3 Aug-07 31 31.67 31.29 99.1 Sep-07 29 33.42 32.54 89.1 Oct-07 20 32.67 33.04 60.5 Nov-07 50 33.00 32.83 152.3 Dec-07 58 33.33 33.17 174.9 Jan-08 26 33.58 33.46 77.7 Feb-08 52 34.08 33.83 153.7 Mar-08 43 34.58 34.33 125.2 Apr-08 27 34.92 34.75 77.7 May-08 23 35.75 35.33 65.1 Jun-08 24 36.75 36.25 66.2 Jul-08 20 37.17 36.96 54.1 Aug-08 37 38.00 37.58 98.4 Sep-08 35 37.08 37.54 93.2 Oct-08 24 39.17 38.13 63.0 Nov-08 60 39.58 39.38 152.4 Dec-08 70 40.00 39.79 175.9 Jan-09 31 40.33 40.17 77.2 Feb-09 62 40.92 40.63 152.6 Mar-09 32 41.50 41.21 77.7 Apr-09 52 41.92 41.71 124.7 May-09 28 42.92 42.42 66.0 Jun-09 29 44.08 43.50 66.7 Jul-09 24 Aug-09 44 Sep-09 42 Oct-09 29 Nov-09 72 Dec-09 84

We have exactly 48 months of ratios to examine. Lets plot each year’s ratios on a graph:

At first glance, it appears that there are only two lines on the graphs, those for years three and four. However, all four years are represented on this graph. It’s just that all the turning points are the same, and the ratio to moving averages for each month are nearly identical. The only difference is in Year 3 (July 2007 to June 2008). Notice how the green line for year three doesn’t follow the same pattern as the other years, from February to April. Year 3’s ratio to moving average is actually higher for March than in all previous years, and lower for April. This is because Easter Sunday fell in late March 2008, so the Easter gift basket season was moved a couple weeks earlier than in prior years.

Finally, compute the average seasonal index for each month

We now have the ratio to moving averages for each month. Let’s average them:

 RATIO TO MOVING AVERAGES Month Year 1 Year 2 Year 3 Year 4 Average July 0.56 0.55 0.55 0.54 0.55 August 1.01 1.00 0.99 0.98 1.00 September 0.92 0.90 0.89 0.93 0.91 October 0.64 0.63 0.61 0.63 0.62 November 1.54 1.53 1.52 1.52 1.53 December 1.74 1.74 1.75 1.76 1.75 January 0.77 0.79 0.78 0.77 0.78 February 1.53 1.52 1.54 1.53 1.53 March 0.76 0.77 1.25 0.78 0.89 April 1.24 1.24 0.78 1.25 1.13 May 0.65 0.65 0.65 0.66 0.65 June 0.68 0.66 0.66 0.67 0.67

Hence, we see that August is a normal month (the average seasonal index =1). However, look at December. Its seasonal index is 1.75. That means that Billie’s orders are generally 175 percent higher than the monthly average in December. Given the Christmas gift giving season, that’s expected in Billie’s gift basket business. We also notice higher seasonal indices in November (when the Christmas shopping season kicks off), February (Valentine’s Day), and in April (Easter). The other months tend to be below average.

Notice that April isn’t superbly high above the baseline and that March had one year where it’s index was 1.25 (when in other years it was under 0.80). That’s because Easter sometimes falls in late March. Stuff like this is important to keep track of, since it can dramatically impact planning. Also, if a given month has five weekends one year and only 4 weekends the next; or if leap year adds one day in February every four years, depending on your business, these events can make a big difference in the accuracy of your forecasts.

The Cyclical and Irregular Components

Now that we’ve isolated the trend and seasonal components, we know that Billie’s orders exhibit an increasing trend and that orders tend to be above average during November, December, February, and April. Now we need to isolate the cyclical and seasonal components. Cyclical variations don’t repeat themselves in a regular pattern, but they are not random variations either. Cyclical patterns are recognizable, but they almost always vary in intensity (the height from peak to trough) and timing (frequency with which the peaks and troughs occur). Since they cannot be accurately predicted, they are often analyzed with the irregular components.

The way we isolate the cyclical and irregular components is by first isolating the trend and seasonal components like we did above. So we take our trend regression equation from above, plug in each month’s sequence number to get the trend value. Then we multiply it by that month’s average seasonal ratio to moving average to derive the statistical normal. To derive the cyclical/irregular component, we divide the actual orders for that month by the statistical normal. The following table shows us how:

 Month Orders Time Period Trend Value Seasonal Index Ratio Statistical Normal Cyclical – Irregular Component (%) Y t T S T*S 100*Y/(T*S) Jan-05 15 1 16 0.78 12.72 117.92 Feb-05 30 2 17 1.53 25.80 116.27 Mar-05 25 3 17 0.89 15.44 161.91 Apr-05 15 4 18 1.13 20.19 74.31 May-05 13 5 18 0.65 12.01 108.20 Jun-05 14 6 19 0.67 12.63 110.86 Jul-05 12 7 19 0.55 10.71 112.09 Aug-05 22 8 20 1.00 19.88 110.64 Sep-05 20 9 20 0.91 18.69 107.02 Oct-05 14 10 21 0.62 13.13 106.63 Nov-05 35 11 22 1.53 32.92 106.31 Dec-05 40 12 22 1.75 38.48 103.95 Jan-06 18 13 23 0.78 17.56 102.52 Feb-06 36 14 23 1.53 35.31 101.94 Mar-06 18 15 24 0.89 20.96 85.86 Apr-06 30 16 24 1.13 27.20 110.30 May-06 16 17 25 0.65 16.07 99.57 Jun-06 17 18 25 0.67 16.77 101.34 Jul-06 14 19 26 0.55 14.13 99.10 Aug-06 26 20 26 1.00 26.07 99.72 Sep-06 24 21 27 0.91 24.36 98.53 Oct-06 17 22 27 0.62 17.02 99.91 Nov-06 42 23 28 1.53 42.43 98.99 Dec-06 48 24 28 1.75 49.33 97.30 Jan-07 22 25 29 0.78 22.40 98.23 Feb-07 43 26 29 1.53 44.83 95.92 Mar-07 22 27 30 0.89 26.49 83.06 Apr-07 36 28 30 1.13 34.21 105.23 May-07 19 29 31 0.65 20.13 94.41 Jun-07 20 30 31 0.67 20.92 95.60 Jul-07 17 31 32 0.55 17.55 96.88 Aug-07 31 32 32 1.00 32.26 96.08 Sep-07 29 33 33 0.91 30.03 96.58 Oct-07 20 34 33 0.62 20.90 95.69 Nov-07 50 35 34 1.53 51.94 96.27 Dec-07 58 36 34 1.75 60.19 96.37 Jan-08 26 37 35 0.78 27.23 95.47 Feb-08 52 38 36 1.53 54.34 95.70 Mar-08 43 39 36 0.89 32.01 134.34 Apr-08 27 40 37 1.13 41.22 65.50 May-08 23 41 37 0.65 24.18 95.12 Jun-08 24 42 38 0.67 25.07 95.75 Jul-08 20 43 38 0.55 20.97 95.38 Aug-08 37 44 39 1.00 38.45 96.22 Sep-08 35 45 39 0.91 35.70 98.05 Oct-08 24 46 40 0.62 24.79 96.83 Nov-08 60 47 40 1.53 61.44 97.65 Dec-08 70 48 41 1.75 71.04 98.54 Jan-09 31 49 41 0.78 32.07 96.66 Feb-09 62 50 42 1.53 63.85 97.10 Mar-09 32 51 42 0.89 37.53 85.26 Apr-09 52 52 43 1.13 48.23 107.81 May-09 28 53 43 0.65 28.24 99.16 Jun-09 29 54 44 0.67 29.21 99.27 Jul-09 24 55 44 0.55 24.39 98.40 Aug-09 44 56 45 1.00 44.64 98.56 Sep-09 42 57 45 0.91 41.37 101.53 Oct-09 29 58 46 0.62 28.67 101.14 Nov-09 72 59 46 1.53 70.95 101.48 Dec-09 84 60 47 1.75 81.89 102.58

For the most part, Billie’s orders don’t seem to exhibit much cyclical or irregular behavior. In most months, the cyclical-irregular component ratio is pretty close to 100. Given her kind of business, we know this would be either not true or a fluke, since the recession of 2008 through 2009 would likely have meant a reduction in orders. In much of those months, we would expect to see a ratio well below 100. We do see that in much of 2005, the cyclical-irregular component for Billie’s gift basket orders are well above 100. It is very likely that in these years, Billie’s business was seeing a positive cyclical pattern. We then see irregular patterns in March and April of later years, where the cyclical-irregular component is also well above 100. That’s again the irregularity of when Easter falls. Not surprisingly, Easter has both a seasonal and irregular component!

This does not mean that Billie can kick up her feet and rest assured knowing that her business doesn’t suffer much from cyclical or irregular patterns. A deepening of the recession can ultimately sink her orders; a war can cut off the materials that are used to produce her gift baskets; a shortage or drastic price increase in the materials she uses can also force her prices higher, which in turn lowers her orders; her workshop could be destroyed in a flood or fire; and so on. To handle some of these irregular patterns – which are almost impossible to plan for – Billie would purchase insurance.

**********************************

Knowing the composition of a time series is an important element of forecasting. Decomposing the time series helps decision makers know and explain the variability in their data and how much of it to attribute it to trend, seasonal, cyclical and irregular components. In next week’s Forecast Friday post, we’ll discuss how to forecast using data that is seasonally-adjusted.

### Forecast Friday Topic: Multicollinearity – Correcting and Accepting it

July 22, 2010

(Fourteenth in a series)

In last week’s Forecast Friday post, we discussed how to detect multicollinearity in a regression model and how dropping a suspect variable or variables from the model can be one approach to reducing or eliminating multicollinearity. However, removing variables can cause other problems – particularly specification bias – if the suspect variable is indeed an important predictor. Today we will discuss two additional approaches to correcting multicollinearity – obtaining more data and transforming variables – and will discuss when it’s best to just accept the multicollinearity.

Obtaining More Data

Multicollinearity is really an issue with the sample, not the population. Sometimes, sampling produces a data set that might be too homogeneous. One way to remedy this would be to add more observations to the data set. Enlarging the sample will introduce more variation in the data series, which reduces the effect of sampling error and helps increase precision when estimating various properties of the data. Increased sample sizes can reduce either the presence or the impact of multicollinearity, or both. Obtaining more data is often the best way to remedy multicollinearity.

Obtaining more data does have problems, however. Sometimes, additional data just isn’t available. This is especially the case with time series data, which can be limited or otherwise finite. If you need to obtain that additional information through great effort, it can be costly and time consuming. Also, the additional data you add to your sample could be quite similar to your original data set, so there would be no benefit to enlarging your data set. The new data could even make problems worse!

Transforming Variables

Another way statisticians and modelers go about eliminating multicollinearity is through data transformation. This can be done in a number of ways.

Combine Some Variables

The most obvious way would be to find a way to combine some of the variables. After all, multicollinearity suggests that two or more independent variables are strongly correlated. Perhaps you can multiply two variables together and use the product of those two variables in place of them.

So, in our example of the donor history, we had the two variables “Average Contribution in Last 12 Months” and “Times Donated in Last 12 Months.” We can multiply them to create a composite variable, “Total Contributions in Last 12 Months,” and then use that new variable, along with the variable “Months Since Last Donation” to perform the regression. In fact, if we did that with our model, we end up with a model (not shown here) that has an R2=0.895, and this time the coefficient for “Months Since Last Donation” is significant, as is our “Total Contribution” variable. Our F statistic is a little over 72. Essentially, the R2 and F statistics are only slightly lower than in our original model, suggesting that the transformation was useful. However, looking at the correlation matrix, we still see a strong negative correlation between our two independent variables, suggesting that we still haven’t eliminated multicollinearity.

Centered Interaction Terms

Sometimes we can reduce multicollinearity by creating an interaction term between variables in question. In a model trying to predict performance on a test based on hours spent studying and hours of sleep, you might find that hours spent studying appears to be related with hours of sleep. So, you create a third independent variable, Sleep_Study_Interaction. You do this by computing the average value for both the hours of sleep and hours of studying variables. For each observation, you subtract each independent variable’s mean from its respective value for that observation. Once you’ve done that for each observation, multiply their differences together. This is your interaction term, Sleep_Study_Interaction. Run the regression now with the original two variables and the interaction term. When you subtract the means from the variables in question, you are in effect centering interaction term, which means you’re taking into account central tendency in your data.

Differencing Data

If you’re working with time series data, one way to reduce multicollinearity is to run your regression using differences. To do this, you take every variable – dependent and independent – and, beginning with the second observation – subtract the immediate prior observation’s values for those variables from the current observation. Now, instead of working with original data, you are working with the change in data from one period to the next. Differencing eliminates multicollinearity by removing the trend component of the time series. If all independent variables had followed more or less the same trend, they could end up highly correlated. Sometimes, however, trends can build on themselves for several periods, so multiple differencing may be required. In this case, subtracting the period before was taking a “first difference.” If we subtracted two periods before, it’s a “second difference,” and so on. Note also that with differencing, we lose the first observations in the data, depending on how many periods we have to difference, so if you have a small data set, differencing can reduce your degrees of freedom and increase your risk of making a Type I Error: concluding that an independent variable is not statistically significant when, in truth it is.

Other Transformations

Sometimes, it makes sense to take a look at a scatter plot of each independent variable’s values with that of the dependent variable to see if the relationship is fairly linear. If it is not, that’s a cue to transform an independent variable. If an independent variable appears to have a logarithmic relationship, you might substitute its natural log. Also, depending on the relationship, you can use other transformations: square root, square, negative reciprocal, etc.

Another consideration: if you’re predicting the impact of violent crime on a city’s median family income, instead of using the number of violent crimes committed in the city, you might instead divide it by the city’s population and come up with a per-capita figure. That will give more useful insights into the incidence of crime in the city.

Transforming data in these ways helps reduce multicollinearity by representing independent variables differently, so that they are less correlated with other independent variables.

Limits of Data Transformation

Transforming data has its own pitfalls. First, transforming data also transforms the model. A model that uses a per-capita crime figure for an independent variable has a very different interpretation than one using an aggregate crime figure. Also, interpretations of models and their results get more complicated as data is transformed. Ideally, models are supposed to be parsimonious – that is, they explain a great deal about the relationship as simply as possible. Typically, parsimony means as few independent variables as possible, but it also means as few transformations as possible. You also need to do more work. If you try to plug in new data to your resulting model for forecasting, you must remember to take the values for your data and transform them accordingly.

Living With Multicollinearity

Multicollinearity is par for the course when a model consists of two or more independent variables, so often the question isn’t whether multicollinearity exists, but rather how severe it is. Multicollinearity doesn’t bias your parameter estimates, but it inflates their variance, making them inefficient or untrustworthy. As you have seen from the remedies offered in this post, the cures can be worse than the disease. Correcting multicollinearity can also be an iterative process; the benefit of reducing multicollinearity may not justify the time and resources required to do so. Sometimes, any effort to reduce multicollinearity is futile. Generally, for the purposes of forecasting, it might be perfectly OK to disregard the multicollinearity. If, however, you’re using regression analysis to explain relationships, then you must try to reduce the multicollinearity.

A good approach is to run a couple of different models, some using variations of the remedies we’ve discussed here, and comparing their degree of multicollinearity with that of the original model. It is also important to compare the forecast accuracy of each. After all, if all you’re trying to do is forecast, then a model with slightly less multicollinearity but a higher degree of forecast error is probably not preferable to a more precise forecasting model with higher degrees of multicollinearity.

The Takeaways:

1. Where you have multiple regression, you almost always have multicollinearity, especially in time series data.
2. A correlation matrix is a good way to detect multicollinearity. Multicollinearity can be very serious if the correlation matrix shows that some of the independent variables are more highly correlated with each other than they are with the dependent variable.
3. You should suspect multicollinearity if:
1. You have a high R2 but low t-statistics;
2. The sign for a coefficient is opposite of what is normally expected (a relationship that should be positive is negative, and vice-versa).
4. Multicollinearity doesn’t bias parameter estimates, but makes them untrustworthy by enlarging their variance.
5. There are several ways of remedying multicollinearity, with obtaining more data often being the best approach. Each remedy for multicollinearity contributes a new set of problems and limitations, so you must weigh the benefit of reduced multicollinearity on time and resources needed to do so, and the resulting impact on your forecast accuracy.

Next Forecast Friday Topic: Autocorrelation

These past two weeks, we discussed the problem of multicollinearity. Next week, we will discuss the problem of autocorrelation – the phenomenon that occurs when we violate the assumption that the error terms are not correlated with each other. We will discuss how to detect autocorrelation, discuss in greater depth the Durbin-Watson statistic’s use as a measure of the presence of autocorrelation, and how to correct for autocorrelation.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

### Multiple Regression: Specification Bias

July 1, 2010

(Eleventh in a series)

In last week’s Forecast Friday post, we discussed several of the important checks you must do to ensure that your model is valid. You always want to be sure that your model does not violate the assumptions we discussed earlier. Today we are going to see what happens when we violate the specification assumption, which says that we do not omit relevant independent variables from our regression model. You will see that when we leave out an important independent variable from a regression model, quite misleading results can emerge. You will also see that violating one assumption can trigger violations of other assumptions.

Revisiting our Multiple Regression Example

Recall our data set of 25 annual observations of U.S. Savings and Loan profit margin data, shown in the table below:

 Year Percentage Profit Margin (Yt) Net Revenues Per Deposit Dollar (X1t) Number of Offices (X2t) 1 0.75 3.92 7,298 2 0.71 3.61 6,855 3 0.66 3.32 6,636 4 0.61 3.07 6,506 5 0.70 3.06 6,450 6 0.72 3.11 6,402 7 0.77 3.21 6,368 8 0.74 3.26 6,340 9 0.90 3.42 6,349 10 0.82 3.42 6,352 11 0.75 3.45 6,361 12 0.77 3.58 6,369 13 0.78 3.66 6,546 14 0.84 3.78 6,672 15 0.79 3.82 6,890 16 0.70 3.97 7,115 17 0.68 4.07 7,327 18 0.72 4.25 7,546 19 0.55 4.41 7,931 20 0.63 4.49 8,097 21 0.56 4.70 8,468 22 0.41 4.58 8,717 23 0.51 4.69 8,991 24 0.47 4.71 9,179 25 0.32 4.78 9,318

Data taken from Spellman, L.J., “Entry and profitability in a rate-free savings and loan market.” Quarterly Review of Economics and Business, 18, no. 2 (1978): 87-95, Reprinted in Newbold, P. and Bos, T., Introductory Business & Economic Forecasting, 2nd Edition, Cincinnati (1994): 136-137

Also, recall that we built a model that hypothesized that S&L percentage profit margin (our dependent variable, Yt) was positively related to net revenues per deposit dollar (one of our independent variables, X1t), and negatively related to the number of S&L offices (our other independent variable, X2t). When we ran our regression, we got the following model:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

We also checked to see if the model parameters were significant, and obtained the following information:

 Parameter Value T-Statistic Significant? Intercept 1.5645000 19.70 Yes B1t 0.2372000 4.27 Yes B2t (0.0002490) (7.77) Yes

We also had a coefficient of determination – R2 – of 0.865, indicating that the model explains about 86.5% of the variation in S&L percentage profit margin.

Welcome to the World of Specification Bias…

Let’s deliberately leave out the number of S&L offices (X2t) from our model, and do just a simple regression with the net revenues per deposit dollar. This is the model we get:

Yt = 1.32616 – 0.16913X1t

We also get an R2 of 0.495. The t-statistics for our intercept and parameter B1t are as follows:

 Parameter Value T-Statistic Significant? Intercept 1.32616 9.57 Yes B1t (0.16913) (4.75) Yes

Compare these new results with our previous results and what do you notice? The results of our second regression are in sharp contrast to those of our first regression. Our new model has far less explanatory power – R2 dropped from 0.865 to 0.495 – and the sign of the parameter estimate for net revenue per deposit dollar has changed: The coefficient of X1t was significant and positive in the first model, and now it is significant and negative! As a result, we end up with a biased regression model.

… and to the Land of Autocorrelation…

Recall another of the regression assumptions: that error terms should not be correlated with one another. When error terms are correlated with one another, we end up with autocorrelation, which renders our parameter estimates inefficient. Recall that last week, we computed the Durbin-Watson test statistic, d, which is an indicator of autocorrelation. It is bad to have either positive autocorrelation (d close to zero), or negative autocorrelation (d close to 4). Generally, we want d to be approximately 2. In our first model, d was 1.95, so autocorrelation was pretty much nonexistent. In our second model, d=0.85, suggesting the presence of significant positive autocorrelation!

How did this happen? Basically, when an important variable is omitted from regression, its impact on the dependent variable gets incorporated into the error term. If the omitted independent variable is correlated with any of the included independent variables, the error terms will also be correlated.

…Which Leads to Yet Another Violation!

The presence of autocorrelation in our second regression reveals the presence of another violation, not in the incomplete regression, but in the full regression. As the sentence above read: “if the independent variable is correlated with any of the included independent variables…” Remember the other assumption: “no linear relationship between two or more independent variables?” Basically, the detection of autocorrelation in the incomplete regression revealed that the full regression violated this very assumption – and thus exhibits multicollinearity! Generally, a coefficient changing between positive and negative (either direction) when one or more variables is omitted is an indicator of multicollinearity.

So was the full regression wrong too? Not terribly. As you will find in upcoming posts, avoiding multicollinearity is nearly impossible, especially with time series data. That’s because multicollinearity is typically a data problem. The severity of multicollinearity can often be reduced by increasing the number of observations in the data set. This is often not a problem with cross-sectional data, where data sets can have thousands, if not millions of observations. However, with time series data, the number of observations available is limited to how many periods of data have been recorded.

Moreover, the longer your time series, the more you risk structural changes in your data over the course of your time series. For instance, if you were examining annual patterns in bank lending within a particular census tract between 1990 and 2010, you might have a reliable model to work with. But let’s say you widen your time series to go back as far as 1970. You will see dramatic shifts in patterns in your data set. That’s because prior to 1977, when Congress passed the Community Reinvestment Act, many banks engaged in a practice called “redlining,” where they literally drew red lines around some neighborhoods, usually where minorities and low-income households were, and did not lend there. In this case, increasing the size of the data set might reduce multicollinearity, but actually cause other modeling problems.

And as you’ve probably guessed, one way of reducing multicollinearity can be dropping variables from the regression. But look what happened when we dropped the number of S&L offices from our regression: we might have eliminated multicollinearity, but we gained autocorrelation and specification bias!

Bottom Line:

The lesson, for us as forecasters and analysts, therefore is that we must accept that models are far from perfect and we must weigh the impact of various regression model specifications. Is the multicollinearity that is present in our model tolerable? Can we add more observations without causing new problems? Can we drop a variable from a regression without causing either specification bias or material differences in explanatory power, parameter estimates, model validity, or even forecast accuracy? Building the model is easy – but it’s these normative considerations that’s challenging.

Next Forecast Friday Topic: Building Regression Models Using Excel

In next week’s Forecast Friday post, we will take a break from discussing the theory of regression analysis and look at a demonstration of how to use the “Regression Analysis” tool in Microsoft Excel. This demonstration is intended to show you how easy running a regression is, so that you can start applying the concepts and building forecasts for your business. Until then, thanks again for reading Forecast Friday, and I wish you and your family a great 4th of July weekend!

*************************

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page.

### Forecast Friday Topic: Multiple Regression Analysis

June 17, 2010

(Ninth in a series)

Quite often, when we try to forecast sales, more than one variable is often involved. Sales depends on how much advertising we do, the price of our products, the price of competitors’ products, the time of the year (if our product is seasonal), and also demographics of the buyers. And there can be many more factors. Hence, we need to measure the impact of all relevant variables that we know drive our sales or other dependent variable. That brings us to the need for multiple regression analysis. Because of its complexity, we will be spending the next several weeks discussing multiple regression analysis in easily digestible parts. Multiple regression is a highly useful technique, but is quite easy to forget if not used often.

Another thing to note, regression analysis is often used for both time series and cross-sectional analysis. Time series is what we have focused on all along. Cross-sectional analysis involves using regression to analyze variables on static data (such as predicting how much money a person will spend on a car based on income, race, age, etc.). We will use examples of both in our discussions of multiple regression.

Determining Parameter Estimates for Multiple Regression

When it comes to deriving the parameter estimates in a multiple regression, the process gets both complicated and tedious, even if you have just two independent variables. We strongly advise you to use the regression features of MS-Excel, or some statistical analysis tool like SAS, SPSS, or MINITAB. In fact, we will not work out the derivation of the parameters with the data sets, but will provide you the results. You are free to run the data we provide on your own to replicate the results we display. I do, however, want to show you the equations for computing the parameter estimates for a three-variable (two independent variables and one dependent variable), and point out something very important.

Let’s assume that sales is your dependent variable, Y, and advertising expenditures and price are your independent variables, X1 and X2, respectively. Also, the coefficients – your parameter estimates will have similar subscripts to correspond to their respective independent variable. Hence, your model will take on the form:

Now, how do you go about computing α, β1 and β2? The process is similar to that of a two-variable model, but a little more involved. Take a look:

The subscript “i” represents the individual oberservation.  In time series, the subscript can also be represented with a “t“.

What do you notice about the formulas for computing β1 and β2? First, you notice that the independent variables, X1 and X2, are included in the calculation for each coefficient. Why is this? Because when two or more independent variables are used to estimate the dependent variable, the independent variables themselves are likely to be related linearly as well. In fact, they need to be in order to perform multiple regression analysis. If either β1 or β2 turned out to be zero, then simple regression would be appropriate. However, if we omit one or more independent variables from the model that are related to those variables in the model, we run into serious problems, namely:

Specification Bias (Regression Assumptions Revisited)

Recall from last week’s Forecast Friday discussion on regression assumptions that 1) our equation must correctly specify the true regression model, namely that all relevant variables and no irrelevant variables are included in the model and 2) the independent variables must not be correlated with the error term. If either of these assumptions is violated, the parameter estimates you get will be biased. Looking at the above equations for β1 and β2, we can see that if we excluded one of the independent variables, say X2, from the model, the value derived for β1 will be incorrect because X1 has some relationship with X2. Moreover, X2‘s values are likely to be accounted for in the error terms, and because of its relationship with X1, X1 will be correlated with the error term, violating the second assumption above. Hence, you will end up with incorrect, biased estimators for your regression coefficient, β1.

Omitted Variables are Bad, but Excessive Variables Aren’t Much Better

Since omitting relevant variables can lead to biased parameter estimates, many analysts have a tendency to include any variable that might have any chance of affecting the dependent variable, Y. This is also bad. Additional variables means that you need to estimate more parameters, and that reduces your model’s degrees of freedom and the efficiency (trustworthiness) of your parameter estimates. Generally, for each variable – both dependent and independent – you are considering, you should have at least five data points. So, for a model with three independent variables, your data set should have 20 observations.

Another Important Regression Assumption

One last thing about multiple regression analysis – another assumption, which I deliberately left out of last week’s discussion, since it applies exclusively to multiple regression:

No combination of independent variables should have an exact linear relationship with one another.

OK, so what does this mean? Let’s assume you’re doing a model to forecast the effect of temperature on the speed at which ice melts. You use two independent variables: Celsius temperature and Fahrenheit temperature. What’s the problem here? There is a perfect linear relationship between these two variables. Every time you use a particular value of Fahrenheit temperature, you will get the same value of Celsius temperature. In this case, you will end up with multicollinearity, an assumption violation that results in inefficient parameter estimates. A relationship between independent variables need not be perfectly linear for multicollinearity to exist. Highly correlated variables can do the same thing. For example, independent variables such as “Husband Age” and “Wife Age,” or “Home Value” and “Home Square Footage” are examples of independent variables that are highly correlated.

You want to be sure that you do not put variables in the model that need not be there, because doing so could lead to multicollinearity.

Now Can We Get Into Multiple Regression????

Wasn’t that an ordeal? Well, now the fun can begin! I’m going to use an example from one of my old graduate school textbooks, because it’s good for several lessons in multiple regression. This data set is 25 annual observations to predict the percentage profit margin (Y) for U.S. savings and loan associations, based on changes in net revenues per deposit dollar (X1) and number of offices (X2). The data are as follows:

 Year Percentage Profit Margin (Yt) Net Revenues Per Deposit Dollar (X1t) Number of Offices (X2t) 1 0.75 3.92 7,298 2 0.71 3.61 6,855 3 0.66 3.32 6,636 4 0.61 3.07 6,506 5 0.70 3.06 6,450 6 0.72 3.11 6,402 7 0.77 3.21 6,368 8 0.74 3.26 6,340 9 0.90 3.42 6,349 10 0.82 3.42 6,352 11 0.75 3.45 6,361 12 0.77 3.58 6,369 13 0.78 3.66 6,546 14 0.84 3.78 6,672 15 0.79 3.82 6,890 16 0.70 3.97 7,115 17 0.68 4.07 7,327 18 0.72 4.25 7,546 19 0.55 4.41 7,931 20 0.63 4.49 8,097 21 0.56 4.70 8,468 22 0.41 4.58 8,717 23 0.51 4.69 8,991 24 0.47 4.71 9,179 25 0.32 4.78 9,318

Data taken from Spellman, L.J., “Entry and profitability in a rate-free savings and loan market.” Quarterly Review of Economics and Business, 18, no. 2 (1978): 87-95, Reprinted in Newbold, P. and Bos, T., Introductory Business & Economic Forecasting, 2nd Edition, Cincinnati (1994): 136-137

What is the relationship between the S&Ls’ profit margin percentage and the number of S&L offices? How about between the margin percentage and the net revenues per deposit dollar? Is the relationship positive (that is, profit margin percentage moves in the same direction as its independent variable(s))? Or negative (the dependent and independent variables move in opposite directions)? Let’s look at each independent variable’s individual relationship with the dependent variable.

Net Revenue Per Deposit Dollar (X1) and Percentage Profit Margin (Y)

Generally, if revenue per deposit dollar goes up, would we not expect the percentage profit margin to also go up? After all, if the S & L is making more revenue on the same dollar, it suggests more efficiency. Hence, we expect a positive relationship. So, in the resulting regression equation, we would expect the coefficient, β1, for net revenue per deposit dollar to have a “+” sign.

Number of S&L Offices (X2) and Percentage Profit Margin (Y)

Generally, if there are more S&L offices, would that not suggest either higher overhead, increased competition, or some combination of the two? Those would cut into profit margins. Hence, we expect a negative relationship. So, in the resulting regression equation, we would expect the coefficient, β2, for number of S&L offices to have a “-” sign.

Are our Expectations Correct?

Do our relationship expectations hold up?  They certainly do. The estimated multiple regression model is:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

What do the Parameter Estimates Mean?

Essentially, the model says that if net revenues per deposit dollar (X1t) increase by one unit, then percentage profit margin (Yt) will – on average – increase by 0.23720 percentage points, when the number of S&L offices is fixed. If the number of offices (X2t) increases by one, then percentage profit margin (Yt) will decrease by an average of 0.000249 percentage points, when net revenues are fixed.

Do Changes in the Independent Variables Explain Changes in The Dependent Variable?

We compute the coefficient of determination, R2, and get 0.865, indicating that changes in the number of S&L offices and in the net revenue per deposit dollar explain 86.5% of the variation in S&L percentage profit margin.

Are the Parameter Estimates Statistically Significant?

We have 25 observations, and three parameters – two coefficients for the independent variables, and one intercept – hence we have 22 degrees of freedom (25-3). If we choose a 95% confidence interval, we are saying that if we resampled and replicated this analysis 100 times, the average of our parameter estimates will be contain the true parameter approximately 95 times. To do this, we need to look at the t-values for each parameter estimate. For a two-tailed 95% significance test with 22 degrees of freedom, our critical t-value is 2.074. That means that if the t-statistic for a parameter estimate is greater than 2.074, then there is a strong positive relationship between the independent variable and the dependent variable; if the t-statistic for the parameter estimate is less than -2.074, then there is a strong negative relationship. This is what we get:

 Parameter Value T-Statistic Significant? Intercept 1.5645000 19.70 Yes B1t 0.2372000 4.27 Yes B2t (0.0002490) (7.77) Yes

So, yes, all our parameter estimates are significant.

Next Forecast Friday: Building on What You Learned

I think you’ve had enough for this week! But we are still not finished. We’re going to stop here and continue with further analysis of this example next week. Next week, we will discuss computing the 95% confidence interval for the parameter estimates; determining whether the model is valid; and checking for autocorrelation. The following Forecast Friday (July 1) blog post will discuss specification bias in greater detail, demonstrating the impact of omitting a key independent variable from the model.