## Posts Tagged ‘time series’

### Forecast Friday Topic: The Autocorrelation Function

January 6, 2011

(Thirty-fourth in a series)

Today, we begin a six-week discussion on the use of Autoregressive Integrated Moving Average (ARIMA) models in forecasting. ARIMA models were popularized by George Box and Gwilym Jenkins in the 1970s, and were traditionally known as Box-Jenkins analysis. The purpose of ARIMA methods is to fit a stochastic (randomly determined) model to a given set of time series data, such that the model can closely approximate the process that is actually generating the data.

There are three main steps in ARIMA methodology: identification, estimation and diagnostic checking, and then application. Before undertaking these steps, however, an analyst must be sure that the time series is stationary. That is, the covariance between any two values of the time series is dependent upon only the time interval between those particular values and not on their absolute location in time.

Determining whether a time series is stationary requires the use of an autocorrelation function (ACF), also called a correlogram, which is the topic of today’s post. Next Thursday, we will go into a full discussion on stationarity and how the ACF is used to determine whether a series is stationary.

Autocorrelation Revisited

Did someone say, “autocorrelation?” Yes! Remember our discussions about detecting and correcting autocorrelation in regression models in our July 29, 2010 and August 5, 2010 Forecast Friday posts? Recall that one of the ways we corrected for autocorrelation was by lagging the dependent variable by one period and then using the lagged variable as an independent variable. Anytime we lag a regression model’s dependent variable and then use it as an independent variable to predict a subsequent period’s dependent variable value, our regression model becomes an autoregressive model.

In regression analysis, we used autoregressive models to correct for autocorrelation. Yet, we can – and have – use the autoregression model to represent the behavior of the time series we’re observing.

When we lag a dependent variable by one period, our model is said to be a first-order autoregressive model. A first-order autoregressive model is denoted as:

Where φ1 is the parameter for the autoregressive term lagged by one period; at is the random variable with a mean of zero and constant variance at time period t; and C is a value that allows for the fact that time series Xt can have a nonzero mean. In fact, you can easily see that this formula mimics a regression equation, with at essentially becoming the residuals of the formula, Xt the dependent variable; C as alpha (or the intercept), and φ1Xt-1 as the independent variable. In essence, a first-order autoregressive model is forecasting the next period’s value on the most recent value.

What if you want to base next period’s forecast on the two most recent values? Then you lag by two periods, and have a second-order autoregressive model, which is denoted by:

In fact, you can use an infinite number of past periods to predict the next period. The formula below shows an autoregressive model of order p, where p is the number of past periods whose values on which you expect to predict the next period’s value:

This review of autocorrelation will help you out in the next session, when we begin to discuss the ACF.

The Autocorrelation Function (ACF)

The ACF is a plot of the autocorrelations between the data points in a time series, and is the key statistic in time series analysis. The ACF is the correlation of the time series with itself, lagged by a certain number of periods. The formula for each lag of an ACF is given by:

Where rk is the autocorrelation at lag k. If k=1, r1 shows the correlation between successive values of Y; if k=2, then r2 would denote the correlation between Y values two periods apart, and so on. Plotting each of these lags gives us our ACF.

Let’s assume we have 48 months of data, as shown in the following table:

 Year 1 Year 2 Year 3 Year 4 Month Value Month Value Month Value Month Value 1 1 13 41 25 18 37 51 2 20 14 63 26 93 38 20 3 31 15 17 27 80 39 65 4 8 16 96 28 36 40 45 5 40 17 68 29 4 41 87 6 41 18 27 30 23 42 68 7 46 19 41 31 81 43 36 8 89 20 17 32 47 44 31 9 72 21 26 33 61 45 79 10 45 22 75 34 27 46 7 11 81 23 63 35 13 47 95 12 93 24 93 36 25 48 37

As decision makers, we want to know whether this data series exhibits a pattern, and the ACF is the means to this end. If no pattern is discerned in this data series, then the series is said to be “white noise.” As you know from our regression analysis discussions our residuals must not exhibit a pattern. Hence, our residuals in regression analysis needed to be white noise. And as you will see in our later discussions on ARIMA methods, the residuals become very important in the estimation and diagnostic checking phase of the ARIMA methodology.

Sampling Distribution of Autocorrelations

Autocorrelations of a white noise series tend to have sampling distributions that are normally distributed, with a mean of zero and a standard error of 1/√n. The standard error is simply the reciprocal of the square root of the sample size. If the autocorrelations are white noise, approximately 95% of the autocorrelation coefficients will fall within two (actually, 1.96) standard errors of the mean; if they don’t, then the series is not white noise and a pattern does indeed exist.

To see if our ACF exhibits a pattern, we look at our individual rk values separately and develop a standard error formula to test whether each value for rk is statistically different from zero. We do this by plotting our ACF:

The ACF is the plot of lags (in blue) for the first 24 months of the series. The dashed red lines are the ±1.96 standard errors. If one or more lags pierce those dashed lines, then the lag(s) is significantly different from zero and the series is not white noise. As you can see, this series is white noise.

Specifically the values for the first six lags are:

 Lag Value r1 0.022 r2 0.098 r3 -0.049 r4 -0.036 r5 0.015 r6 -0.068

Apparently, there is no discernable pattern in the data: successive lags are only minimally correlated; in fact, there’s a higher correlation between lags two intervals apart.

Portmanteau Tests

In the example above, we looked at each individual lag. An alternative to this would be to examine a whole set of rk values, say the first 10 of them (r1 to r10) all at once and then test to see whether the set is significantly different from a zero set. Such a test is known as a portmanteau test, and the two most common are the Box-Pierce test and the Ljung-Box Q* statistic. We will discuss both of them here.

The Box-Pierce Test

Here is the Box-Pierce formula:

Q is the the Box-Pierce test statistic, which we will compare against the χ2 distribution; n is the total number of observations; h is the maximum lag we are considering (24 in the ACF plot).

Essentially, the Box-Pierce test indicates that if residuals are white noise, the Q-statistic follows a χ2 distribution with (h – m) degrees of freedom. If a model is fitted, then m is the number of parameters. However, no model is fitted here, so our m=0. If each rk value is close to zero, then Q will be very small; otherwise, if some rk values are large – either negatively or positively – then Q will be relatively large. We will compare Q to the χ2 distribution, just like any other significance test.

Since we plotted 24 lags, we are interested in only the r2k values for the first 24 observations (not shown). Our Q statistic is:

We have 24 degrees of freedom, and so we compare our Q statistic to the χ2 distribution. Our critical χ2 value for a 1% significance level is 42.98, well above our Q statistic, leading us to conclude that our chosen set of r2k values is not significantly different from a zero set.

The Ljung-Box Q* Statistic

In 1978, Ljung and Box believed there was a closer approximation to the χ2 distribution than the Box-Pierce Q statistic, so they developed the alternative Q* statistic. The formula for the Ljung-Box Q* statistic is:

For our r2k values, that is reflected in:

We get a Q* = 24.92. Comparing this to the same critical χ2 value, our distribution is still not significant. If the data are white noise, then the Q* and Q statistic will both have the same distribution. It’s important to note, however, that portmanteau tests have a tendency to fail in rejecting poorly fit models, so you shouldn’t rely solely on them for accepting models.

The Partial Autocorrelation Coefficient

When we do multiple regression analysis, we are sometimes interested in finding out how much explanatory power one variable has by itself. To do this, we omit the independent variable whose explanatory power we are interested in – or rather, partial out the effects of the other independent variables. We can do similarly in time series analysis, with the use of partial autocorrelations.

Partial autocorrelations measure the degree of association between various lags when the effects of other lags are removed. If the autocorrelation between Yt and Yt-1 is significant, then we will also see a similar significant autocorrelation between Yt-1 and Yt-2, as they are just one period apart. Since both Yt and Yt-2 are both correlated with Yt-1, they are also correlated with each other; so, by removing the effect of Yt-1, we can measure the true correlation between Yt and Yt-2.

A partial autocorrelation coefficient of order k, which is denoted by αk, is determined by regressing the current time series value by its lagged values:

As I mentioned earlier, this form of equation is an autoregressive (AR) one, since its independent variables are time-lagged values of the dependent variable. We use this multiple regression to find the partial autocorrelation αk. If we regress Yt only against Yt-1, then we derive our value for α1. If we regress Yt against both Yt-1 and Yt-2, then we’ll derive values for both α1 and α2.

Then, as we did for the autocorrelation coefficients, we plot our partial autocorrelation coefficients. This plot is called, not surprisingly, a partial autocorrelation function (PACF).

Let’s assume we wanted to measure the partial autocorrelations for the first 12 months of our data series. We generate the following PACF:

Since the lags fall within their 1.96 standard errors, our PACF is also indicative of a white noise series. Also, note that α1 in the PACF is always equal to r1 in the ACF.

Seasonality

Our data series exhibited no pattern, despite its monthly nature. This is unusual for many time series models, especially when you consider retail sales data. Monthly retail sales will exhibit a strong seasonal component, which will show up in your ACF at the time of the seasonal lag. The rk value at that particular lag will manifest itself as a lag that does indeed break through the critical value line, not only at that lag, but at also multiples of that lag. So, if sales are busiest in month 12, you can expect to see ACFs with significant lags at time 12, 24, 36, and so on. You’ll see examples of this in subsequent posts on ARIMA.

Next Forecast Friday Topic: Stationarity of Time Series Data

As mentioned earlier, a time series must be stationary for forecasting.  Next week, you’ll see how the ACF and PCF are used to determine whether a time series exhibits stationarity, as we move on towards our discussion of ARIMA methodology.

*************************

For the latest insights on marketing research, predictive modeling, and forecasting, be sure to check out Analysights on Facebook and Twitter! “Like-ing” us on Facebook and following us on Twitter will allow you to stay informed of each new Insight Central post published, new information about analytics, discussions Analysights will be hosting, and other opportunities for feedback. So get this New Year off right and check us out on Facebook and Twitter!

### Forecast Friday Topic: Leading Indicators and Surveys of Expectations

December 9, 2010

(Thirty-second in a series)

Most of the forecasting methods we have discussed so far deal with generating forecasts for a steady-state scenario. Yet the nature of the business cycle is such that there are long periods of growth, long periods of declines, and periods of plateau. Many managers and planners would love to know how to spot the moment when things are about to change for better or worse. Spotting these turning points can be difficult given standard forecasting procedures; yet being able to identify when business activity is going to enter a prolonged period of expansion or a protracted decline can greatly enhance managerial and organizational planning. Two of the most common ways managers anticipate turning points in a time series include leading economic indicators and surveys of expectations. This post discusses both.

Nobody has a crystal ball. Yet, some time series exhibit patterns that foreshadow economic activity to come. Quite often, when activity turns positive in one time series, months later it triggers an appropriate response in the broader economy. When movements in a time series seem to anticipate coming economic activity, the time series is said to be a leading economic indicator. When a time series moves in tandem with economic activity, the time series is said to be a coincident economic
indicator; and when movements within a particular time series trails economic activity, the time series is said to be a lagging indicator. Economic indicators are nothing new. The ancient Phoenicians, whose empire was built on trading, often used the number of ships arriving in port as an indicator of trading and economic activity.

Economic indicators can be procyclic – that is they increase as economic activity increases and decrease when economic activity decreases; or countercyclic – meaning they decline when the economy is improving or increase when the economy is declining; or they can be acyclic, having little or no correlation at all with the broader economy. Acyclic indicators are rare, and usually are relegated to subsectors of the economy, to which they are either procyclic or countercyclic.

Since 1961, the U.S. Department of Commerce has published the Survey of Current Business, which details monthly changes in leading indicators. The Conference Board publishes a composite index of 10 leading economic indicators, whose activity suggests changes in economic activity six to nine months into the future. Those 10 components include (reprinted from Investopedia.com):

1. the average weekly hours worked by manufacturing workers;
2. the average number of initial applications for unemployment insurance;
3. the amount of manufacturers’ new orders for consumer goods and materials;
4. the speed of delivery of new merchandise to vendors from suppliers;
5. the amount of new orders for capital goods unrelated to defense;
6. the amount of new building permits for residential buildings;
7. the S&P 500 stock index;
8. the inflation-adjusted monetary supply (M2);
9. the spread between long and short interest rates; and
10. consumer sentiment

These indicators are used to measure changes in the broader economy. Each industry or organization may have its own indicators of business activity. For your business, the choice of the time series(‘) to use as leading indicators and the weight they receive depend on several factors, including:

1. How well it tends to lead activity in your firm and industry;
2. How easy the time series is to measure accurately;
3. How well it conforms to the business cycle;
4. The time series’ overall performance, not just turning points;
5. Smoothness – no random blips that give misleading economic cues; and
6. Availability of data.

Over time, the use of specific indicators, and their significance in forecasting do in fact change. You need to keep an eye on how well the indicators you select continue to foreshadow business activity in your industry.

Surveys of Expectations

Sometimes time series are not available for economic indicators. Changes in technology social structure may not be readily picked up in the existing time series. Other times, consumer sentiment isn’t totally represented in the economic indicators. As a result, surveys are used to measure business optimism, or expectations of the future. Economists and business leaders are often surveyed for their opinions. Sometimes, it’s helpful to know if business leaders anticipate spending more money on equipment purchases in the coming year; whether they plan to hire or lay off workers; or whether they intend to expand. While what respondents to these surveys say and what they really do can be quite different, overall, the surveys can provide some direction as to which way the economy is heading.

Next Forecast Friday Topic: Calendar Effects in Forecasting

Easter can fall in March or April; every four years, February has an extra day; in some years, months have four weekends; others years, five. These nuances can generate huge forecast errors. Next week’s Forecast Friday post discusses these calendar effects in forecasting and what you can do to adjust for them.

*************************

Thanks to all of you, Analysights now has nearly 200 fans on Facebook … and we’d love more! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! And if you like us that much, please also pass these posts on to your friends who like forecasting and invite them to “Like” Analysights! By “Like-ing” us on Facebook, you and they will be informed every time a new blog post has been published, or when new information comes out. Check out our Facebook page! You can also follow us on Twitter. Thanks for your help!

### Forecast Friday Topic: Forecasting with Seasonally-Adjusted Data

September 16, 2010

(Twenty-first in a series)

Last week, we introduced you to fictitious businesswoman Billie Burton, who puts together handmade gift baskets and care packages for customers. Billie is interested in forecasting gift basket orders so that she can get a better idea of how much time she’s going to need to set aside to assemble the packages; how much supplies to have on hand; how much revenue – and cost – she can expect; and whether she will need assistance. Gift-giving is seasonal, and Billie’s business is no exception. The Christmas season is Billie’s busiest season, and a few other months are much busier than others, so she must adjust for these seasonal factors before doing any forecasts.

Why is it important to adjust data for seasonal factors? Imagine trying to do regression analysis on monthly retail data that hasn’t been adjusted. If sales during the holiday season are much greater than at all other times of the year, there will be significant forecast errors in the model because the holiday period’s sales will be outliers. And regression analysis places greater weight on extreme values when trying to determine the least-squares equation.

Billie’s Orders, Revisited

Recall from last week that Billie has five years of monthly gift basket orders, from January 2005 to December 2009. The orders are shown again in the table below:

 Month TOTAL GIFT BASKET ORDERS 2005 2006 2007 2008 2009 January 15 18 22 26 31 February 30 36 43 52 62 March 25 18 22 43 32 April 15 30 36 27 52 May 13 16 19 23 28 June 14 17 20 24 29 July 12 14 17 20 24 August 22 26 31 37 44 September 20 24 29 35 42 October 14 17 20 24 29 November 35 42 50 60 72 December 40 48 58 70 84

Billie would like to forecast gift basket orders for the first four months of 2010, particularly February and April, for Valentine’s Day and Easter, two other busier-than-usual periods. Billie must first adjust her data.

When we decomposed the time series, we computed the seasonal adjustment factors for each month. They were as follows:

 Month Factor January 0.78 February 1.53 March 0.89 April 1.13 May 0.65 June 0.67 July 0.55 August 1.00 September 0.91 October 0.62 November 1.53 December 1.75

Knowing these monthly seasonal factors, Billie adjusts her monthly orders by dividing each month’s orders by its respective seasonal factor (e.g., each January’s orders is divided by 0.78; each February’s orders by 1.53, and so on). Billie’s seasonally-adjusted data looks like this:

 Month Orders Adjustment Factor Seasonally Adjusted Orders Time Period Jan-05 15 0.78 19.28 1 Feb-05 30 1.53 19.61 2 Mar-05 25 0.89 28.15 3 Apr-05 15 1.13 13.30 4 May-05 13 0.65 19.93 5 Jun-05 14 0.67 21.00 6 Jul-05 12 0.55 21.81 7 Aug-05 22 1.00 22.10 8 Sep-05 20 0.91 21.93 9 Oct-05 14 0.62 22.40 10 Nov-05 35 1.53 22.89 11 Dec-05 40 1.75 22.92 12 Jan-06 18 0.78 23.13 13 Feb-06 36 1.53 23.53 14 Mar-06 18 0.89 20.27 15 Apr-06 30 1.13 26.61 16 May-06 16 0.65 24.53 17 Jun-06 17 0.67 25.49 18 Jul-06 14 0.55 25.44 19 Aug-06 26 1.00 26.12 20 Sep-06 24 0.91 26.32 21 Oct-06 17 0.62 27.20 22 Nov-06 42 1.53 27.47 23 Dec-06 48 1.75 27.50 24 Jan-07 22 0.78 28.27 25 Feb-07 43 1.53 28.11 26 Mar-07 22 0.89 24.77 27 Apr-07 36 1.13 31.93 28 May-07 19 0.65 29.13 29 Jun-07 20 0.67 29.99 30 Jul-07 17 0.55 30.90 31 Aug-07 31 1.00 31.14 32 Sep-07 29 0.91 31.80 33 Oct-07 20 0.62 32.01 34 Nov-07 50 1.53 32.70 35 Dec-07 58 1.75 33.23 36 Jan-08 26 0.78 33.42 37 Feb-08 52 1.53 33.99 38 Mar-08 43 0.89 48.41 39 Apr-08 27 1.13 23.94 40 May-08 23 0.65 35.26 41 Jun-08 24 0.67 35.99 42 Jul-08 20 0.55 36.35 43 Aug-08 37 1.00 37.17 44 Sep-08 35 0.91 38.38 45 Oct-08 24 0.62 38.41 46 Nov-08 60 1.53 39.24 47 Dec-08 70 1.75 40.11 48 Jan-09 31 0.78 39.84 49 Feb-09 62 1.53 40.53 50 Mar-09 32 0.89 36.03 51 Apr-09 52 1.13 46.12 52 May-09 28 0.65 42.93 53 Jun-09 29 0.67 43.49 54 Jul-09 24 0.55 43.62 55 Aug-09 44 1.00 44.20 56 Sep-09 42 0.91 46.06 57 Oct-09 29 0.62 46.41 58 Nov-09 72 1.53 47.09 59 Dec-09 84 1.75 48.13 60

Notice the seasonally adjusted gift basket orders in the fourth column. It is the quotient of the second and third columns. Notice that in the months where the seasonal adjustment factor is greater than 1, the seasonally adjusted orders will be lower than actual orders; in months where the factor is less than 1, the seasonally adjusted orders will be greater than actual. This is intended to normalize the data set. (Note: August has a seasonal factor of 1.00, suggesting it is an average month. However, that is due to rounding. Notice that August 2008’s actual orders are 37 baskets, but its adjusted orders are 37.17. That’s due to rounding). Also, the final column is the sequential time period number for each month, from 1 to 60.

Perform Regression Analysis

Now Billie runs regression analysis. She is going to do a simple regression, using the time period, t, in the last column as her independent variable and the seasonally adjusted orders as her dependent variable. Recall that last week, we ran a simple regression on the actual sales to isolate the trend component, and we identified an upward trend; however, because of the strong seasonal factors in the actual orders, the regression model didn’t fit the data well. By factoring out these seasonal variations, we should expect a model that better fits the data.

Running her regression of the seasonally adjusted orders, Billie gets the following output:

Ŷ = 0.47t +17.12

And as we expected, this model fits the data better, with an R2 of 0.872. Basically, in a baseline month, each passing month increases basket orders by about half an order.

Forecasting Orders

Now Billie needs to forecast orders for January through April 2010. January 2010 is period 61, so she plugs that into her regression equation:

Ŷ = 0.47(61) + 17.12

=45.81

Billie plugs in the data for the rest of the months and gets the following:

 Month Period Ŷ Jan-10 61 45.81 Feb-10 62 46.28 Mar-10 63 46.76 Apr-10 64 47.23

Remember, however, that this is seasonally-adjusted data. To get the forecasts for actual orders for each month, Billie now needs to convert them back. Since she divided each month’s orders by its seasonal adjustment factor, she must now multiply each of these months’ forecasts by those same factors. So Billie goes ahead and does that:

 Month Period Ŷ Seasonal Factor Forecast Orders Jan-10 61 45.81 0.78 35.65 Feb-10 62 46.28 1.53 70.81 Mar-10 63 46.76 0.89 41.53 Apr-10 64 47.23 1.13 53.25

So, Billie forecasts 36 gift basket orders in January; 71 in February, 42 in March, and 53 in April.

Next Forecast Friday Topic: Qualitative Variables in Regression Modeling

You’ve just learned how to adjust for seasonality when forecasting. One thing you’ve noticed through all of these forecasts we have built is that all variables have been quantitative. Yet sometimes, we need to account for qualitative, or categorical factors in our explanation of events. The next two Forecast Friday posts will discuss a simple approach for introducing qualitative information into modeling: “dummy” variables. Dummy variables can be helpful in determining differences in predictive estimates by region, gender, race, political affiliation, etc. As you will also find, dummy variables can even be used for a faster, more simplified approach to gauging seasonality. You’ll find our discussion on dummy variables highly useful.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

### Forecast Friday Topic: Decomposing a Time Series

September 9, 2010

(Twentieth in a series)

Welcome to our 20th Forecast Friday post. The last four months have been quite a journey, as we went through the various time series methods like moving average models, exponential smoothing models, and regression analysis, followed by in-depth discussions of the assumptions behind regression analysis and the consequences and remedies of violating those assumptions. Today, we resume the more practical aspects of time series analysis, with a discussion of decomposing a time series. If you recall from our May 3 post, a time series consists of four components: a trend component; a seasonal component; a cyclical component; and an irregular, or random, component. Today, we will show you how to isolate and control for these components, using the fictitious example of Billie Burton, a self-employed gift basket maker.

So Billie pulls together her monthly orders for the years 2005-2009. They look like this:

 Month TOTAL GIFT BASKET ORDERS 2005 2006 2007 2008 2009 January 15 18 22 26 31 February 30 36 43 52 62 March 25 18 22 43 32 April 15 30 36 27 52 May 13 16 19 23 28 June 14 17 20 24 29 July 12 14 17 20 24 August 22 26 31 37 44 September 20 24 29 35 42 October 14 17 20 24 29 November 35 42 50 60 72 December 40 48 58 70 84

Trend Component

When a variable exhibits a long-term increase or decrease over the course of time, it is said to have a trend. Billie’s gift basket orders for the past five years exhibit a long-term, upward trend, as shown by the time series plot below:

Although the graph looks pretty busy and bumpy, you can see that Billie’s monthly orders seem to be moving upward over the course of time. Notice that we fit a straight line across Billie’s time series. This is a linear trend line. Most times, we plot the data in a time series and then draw a straight line freehand to show whether a trend is increasing or decreasing. Another approach to fitting a trend line – like the one I used here – is to use simple regression analysis, using each time period, t, as the independent variable, and numbering each period in sequential order. Hence, January 2005 would be t=1 and December 2009 would be t=60. This is very similar to the approach we discussed in our May 27 blog post when we demonstrated how our other fictitious businesswoman, Sue Stone, could forecast her sales.

In using regression analysis, to fit our trend line, we would get the following equation:

Ŷ= 0.518t +15.829

Since the slope of the trend line is positive, we know that the trend is upward. Billie’s orders seem to increase by slightly more than half an order each month, on average. However, when we look at the R2, we get just .313, suggesting the trend line doesn’t fit the actual data well. But that is because of the drastic seasonality in the data set, which we will address shortly. For now, we at least know that the trend is increasing.

Seasonal Component

When a time series shows a repeating pattern over time, usually during the same time of the year, that pattern is known as the seasonal component in the time series. Some time series have more than one period in the year in which seasonality is strong; others have no seasonality. If you look at each of the January points, you’ll notice that it is greatly lower than the preceding December and the following February. Also, if you look at each December, you’ll see that it is the highest point of orders for each year. This strongly suggests seasonality in the data.

But what is the impact of the seasonality? We find out by isolating the seasonal component and creating a seasonal index, known as the ratio to moving average. Computing the ratio to moving average is a four-step process:

First, take the moving average of the series

Since our data is monthly, we will be taking a 12-month moving average. If our data was quarterly, we would do a 4-quarter moving average. We’ve essentially done this in the third column of the table below.

Second, center the moving averages

Next, we center the moving averages by taking the average of each successive pair of moving averages, the result is shown in the fourth column.

Third, compute the ratio to moving average

To obtain the ratio to moving average, divide the number of orders for a given month by the centered 12-month moving average that corresponds to that month. Notice that July 2005 is the first month to have a centered 12-month moving average. That is because we lose data points when we take a moving average. For July 2005, we divide its number of orders, 12, by its centered 12-month moving average, 21.38, and get .561 (the number’s multiplied by 100 for percentages, in this example).

 Month Orders 12-Month Moving Average Centered 12-Month Moving Average Ratio to Moving Average (%) Jan-05 15 Feb-05 30 Mar-05 25 Apr-05 15 May-05 13 Jun-05 14 21.25 Jul-05 12 21.50 21.38 56.1 Aug-05 22 22.00 21.75 101.1 Sep-05 20 21.42 21.71 92.1 Oct-05 14 22.67 22.04 63.5 Nov-05 35 22.92 22.79 153.6 Dec-05 40 23.17 23.04 173.6 Jan-06 18 23.33 23.25 77.4 Feb-06 36 23.67 23.50 153.2 Mar-06 18 24.00 23.83 75.5 Apr-06 30 24.25 24.13 124.4 May-06 16 24.83 24.54 65.2 Jun-06 17 25.50 25.17 67.5 Jul-06 14 25.83 25.67 54.5 Aug-06 26 26.42 26.13 99.5 Sep-06 24 26.75 26.58 90.3 Oct-06 17 27.25 27.00 63.0 Nov-06 42 27.50 27.38 153.4 Dec-06 48 27.75 27.63 173.8 Jan-07 22 28.00 27.88 78.9 Feb-07 43 28.42 28.21 152.4 Mar-07 22 28.83 28.63 76.9 Apr-07 36 29.08 28.96 124.3 May-07 19 29.75 29.42 64.6 Jun-07 20 30.58 30.17 66.3 Jul-07 17 30.92 30.75 55.3 Aug-07 31 31.67 31.29 99.1 Sep-07 29 33.42 32.54 89.1 Oct-07 20 32.67 33.04 60.5 Nov-07 50 33.00 32.83 152.3 Dec-07 58 33.33 33.17 174.9 Jan-08 26 33.58 33.46 77.7 Feb-08 52 34.08 33.83 153.7 Mar-08 43 34.58 34.33 125.2 Apr-08 27 34.92 34.75 77.7 May-08 23 35.75 35.33 65.1 Jun-08 24 36.75 36.25 66.2 Jul-08 20 37.17 36.96 54.1 Aug-08 37 38.00 37.58 98.4 Sep-08 35 37.08 37.54 93.2 Oct-08 24 39.17 38.13 63.0 Nov-08 60 39.58 39.38 152.4 Dec-08 70 40.00 39.79 175.9 Jan-09 31 40.33 40.17 77.2 Feb-09 62 40.92 40.63 152.6 Mar-09 32 41.50 41.21 77.7 Apr-09 52 41.92 41.71 124.7 May-09 28 42.92 42.42 66.0 Jun-09 29 44.08 43.50 66.7 Jul-09 24 Aug-09 44 Sep-09 42 Oct-09 29 Nov-09 72 Dec-09 84

We have exactly 48 months of ratios to examine. Lets plot each year’s ratios on a graph:

At first glance, it appears that there are only two lines on the graphs, those for years three and four. However, all four years are represented on this graph. It’s just that all the turning points are the same, and the ratio to moving averages for each month are nearly identical. The only difference is in Year 3 (July 2007 to June 2008). Notice how the green line for year three doesn’t follow the same pattern as the other years, from February to April. Year 3’s ratio to moving average is actually higher for March than in all previous years, and lower for April. This is because Easter Sunday fell in late March 2008, so the Easter gift basket season was moved a couple weeks earlier than in prior years.

Finally, compute the average seasonal index for each month

We now have the ratio to moving averages for each month. Let’s average them:

 RATIO TO MOVING AVERAGES Month Year 1 Year 2 Year 3 Year 4 Average July 0.56 0.55 0.55 0.54 0.55 August 1.01 1.00 0.99 0.98 1.00 September 0.92 0.90 0.89 0.93 0.91 October 0.64 0.63 0.61 0.63 0.62 November 1.54 1.53 1.52 1.52 1.53 December 1.74 1.74 1.75 1.76 1.75 January 0.77 0.79 0.78 0.77 0.78 February 1.53 1.52 1.54 1.53 1.53 March 0.76 0.77 1.25 0.78 0.89 April 1.24 1.24 0.78 1.25 1.13 May 0.65 0.65 0.65 0.66 0.65 June 0.68 0.66 0.66 0.67 0.67

Hence, we see that August is a normal month (the average seasonal index =1). However, look at December. Its seasonal index is 1.75. That means that Billie’s orders are generally 175 percent higher than the monthly average in December. Given the Christmas gift giving season, that’s expected in Billie’s gift basket business. We also notice higher seasonal indices in November (when the Christmas shopping season kicks off), February (Valentine’s Day), and in April (Easter). The other months tend to be below average.

Notice that April isn’t superbly high above the baseline and that March had one year where it’s index was 1.25 (when in other years it was under 0.80). That’s because Easter sometimes falls in late March. Stuff like this is important to keep track of, since it can dramatically impact planning. Also, if a given month has five weekends one year and only 4 weekends the next; or if leap year adds one day in February every four years, depending on your business, these events can make a big difference in the accuracy of your forecasts.

The Cyclical and Irregular Components

Now that we’ve isolated the trend and seasonal components, we know that Billie’s orders exhibit an increasing trend and that orders tend to be above average during November, December, February, and April. Now we need to isolate the cyclical and seasonal components. Cyclical variations don’t repeat themselves in a regular pattern, but they are not random variations either. Cyclical patterns are recognizable, but they almost always vary in intensity (the height from peak to trough) and timing (frequency with which the peaks and troughs occur). Since they cannot be accurately predicted, they are often analyzed with the irregular components.

The way we isolate the cyclical and irregular components is by first isolating the trend and seasonal components like we did above. So we take our trend regression equation from above, plug in each month’s sequence number to get the trend value. Then we multiply it by that month’s average seasonal ratio to moving average to derive the statistical normal. To derive the cyclical/irregular component, we divide the actual orders for that month by the statistical normal. The following table shows us how:

 Month Orders Time Period Trend Value Seasonal Index Ratio Statistical Normal Cyclical – Irregular Component (%) Y t T S T*S 100*Y/(T*S) Jan-05 15 1 16 0.78 12.72 117.92 Feb-05 30 2 17 1.53 25.80 116.27 Mar-05 25 3 17 0.89 15.44 161.91 Apr-05 15 4 18 1.13 20.19 74.31 May-05 13 5 18 0.65 12.01 108.20 Jun-05 14 6 19 0.67 12.63 110.86 Jul-05 12 7 19 0.55 10.71 112.09 Aug-05 22 8 20 1.00 19.88 110.64 Sep-05 20 9 20 0.91 18.69 107.02 Oct-05 14 10 21 0.62 13.13 106.63 Nov-05 35 11 22 1.53 32.92 106.31 Dec-05 40 12 22 1.75 38.48 103.95 Jan-06 18 13 23 0.78 17.56 102.52 Feb-06 36 14 23 1.53 35.31 101.94 Mar-06 18 15 24 0.89 20.96 85.86 Apr-06 30 16 24 1.13 27.20 110.30 May-06 16 17 25 0.65 16.07 99.57 Jun-06 17 18 25 0.67 16.77 101.34 Jul-06 14 19 26 0.55 14.13 99.10 Aug-06 26 20 26 1.00 26.07 99.72 Sep-06 24 21 27 0.91 24.36 98.53 Oct-06 17 22 27 0.62 17.02 99.91 Nov-06 42 23 28 1.53 42.43 98.99 Dec-06 48 24 28 1.75 49.33 97.30 Jan-07 22 25 29 0.78 22.40 98.23 Feb-07 43 26 29 1.53 44.83 95.92 Mar-07 22 27 30 0.89 26.49 83.06 Apr-07 36 28 30 1.13 34.21 105.23 May-07 19 29 31 0.65 20.13 94.41 Jun-07 20 30 31 0.67 20.92 95.60 Jul-07 17 31 32 0.55 17.55 96.88 Aug-07 31 32 32 1.00 32.26 96.08 Sep-07 29 33 33 0.91 30.03 96.58 Oct-07 20 34 33 0.62 20.90 95.69 Nov-07 50 35 34 1.53 51.94 96.27 Dec-07 58 36 34 1.75 60.19 96.37 Jan-08 26 37 35 0.78 27.23 95.47 Feb-08 52 38 36 1.53 54.34 95.70 Mar-08 43 39 36 0.89 32.01 134.34 Apr-08 27 40 37 1.13 41.22 65.50 May-08 23 41 37 0.65 24.18 95.12 Jun-08 24 42 38 0.67 25.07 95.75 Jul-08 20 43 38 0.55 20.97 95.38 Aug-08 37 44 39 1.00 38.45 96.22 Sep-08 35 45 39 0.91 35.70 98.05 Oct-08 24 46 40 0.62 24.79 96.83 Nov-08 60 47 40 1.53 61.44 97.65 Dec-08 70 48 41 1.75 71.04 98.54 Jan-09 31 49 41 0.78 32.07 96.66 Feb-09 62 50 42 1.53 63.85 97.10 Mar-09 32 51 42 0.89 37.53 85.26 Apr-09 52 52 43 1.13 48.23 107.81 May-09 28 53 43 0.65 28.24 99.16 Jun-09 29 54 44 0.67 29.21 99.27 Jul-09 24 55 44 0.55 24.39 98.40 Aug-09 44 56 45 1.00 44.64 98.56 Sep-09 42 57 45 0.91 41.37 101.53 Oct-09 29 58 46 0.62 28.67 101.14 Nov-09 72 59 46 1.53 70.95 101.48 Dec-09 84 60 47 1.75 81.89 102.58

For the most part, Billie’s orders don’t seem to exhibit much cyclical or irregular behavior. In most months, the cyclical-irregular component ratio is pretty close to 100. Given her kind of business, we know this would be either not true or a fluke, since the recession of 2008 through 2009 would likely have meant a reduction in orders. In much of those months, we would expect to see a ratio well below 100. We do see that in much of 2005, the cyclical-irregular component for Billie’s gift basket orders are well above 100. It is very likely that in these years, Billie’s business was seeing a positive cyclical pattern. We then see irregular patterns in March and April of later years, where the cyclical-irregular component is also well above 100. That’s again the irregularity of when Easter falls. Not surprisingly, Easter has both a seasonal and irregular component!

This does not mean that Billie can kick up her feet and rest assured knowing that her business doesn’t suffer much from cyclical or irregular patterns. A deepening of the recession can ultimately sink her orders; a war can cut off the materials that are used to produce her gift baskets; a shortage or drastic price increase in the materials she uses can also force her prices higher, which in turn lowers her orders; her workshop could be destroyed in a flood or fire; and so on. To handle some of these irregular patterns – which are almost impossible to plan for – Billie would purchase insurance.

**********************************

Knowing the composition of a time series is an important element of forecasting. Decomposing the time series helps decision makers know and explain the variability in their data and how much of it to attribute it to trend, seasonal, cyclical and irregular components. In next week’s Forecast Friday post, we’ll discuss how to forecast using data that is seasonally-adjusted.

### Forecast Friday Topic: Detecting Autocorrelation

July 29, 2010

(Fifteenth in a series)

We have spent the last few Forecast Friday posts discussing violations of different assumptions in regression analysis. So far, we have discussed the effects of specification bias and multicollinearity on parameter estimates, and their corresponding effect on your forecasts. Today, we will discuss another violation, autocorrelation, which occurs when sequential residual (error) terms are correlated with one another.

When working with time series data, autocorrelation is the most common problem forecasters face. When the assumption of uncorrelated residuals is violated, we end up with models that have inefficient parameter estimates and upwardly-biased t-ratios and R2 values. These inflated values make our forecasting model appear better than it really is, and can cause our model to miss turning points. Hence, if you’re model is predicting an increase in sales and you, in actuality, see sales plunge, it may be due to autocorrelation.

What Does Autocorrelation Look Like?

Autocorrelation can take on two types: positive or negative. In positive autocorrelation, consecutive errors usually have the same sign: positive residuals are almost always followed by positive residuals, while negative residuals are almost always followed by negative residuals. In negative autocorrelation, consecutive errors typically have opposite signs: positive residuals are almost always followed by negative residuals and vice versa.

In addition, there are different orders of autocorrelation. The simplest, most common kind of autocorrelation, first-order autocorrelation, occurs when the consecutive errors are correlated. Second-order autocorrelation occurs when error terms two periods apart are correlated, and so forth. Here, we will concentrate solely on first-order autocorrelation.

You will see a visual depiction of positive autocorrelation later in this post.

What Causes Autocorrelation?

The two main culprits for autocorrelation are sluggishness in the business cycle (also known as inertia) and omitted variables from the model. At various turning points in a time series, inertia is very common. At the time when a time series turns upward (downward), its observations build (lose) momentum, and continue going up (down) until the series reaches its peak (trough). As a result, successive observations and the error terms associated with them depend on each other.

Another example of inertia happens when forecasting a time series where the same observations can be in multiple successive periods. For example, I once developed a model to forecast enrollment for a community college, and found autocorrelation to be present in my initial model. This happened because many of the students enrolled during the spring term were also enrolled in the previous fall term. As a result, I needed to correct for that.

The other main cause of autocorrelation is omitted variables from the model. When an important independent variable is omitted from a model, its effect on the dependent variable becomes part of the error term. Hence, if the omitted variable has a positive correlation with the dependent variable, it is likely to cause error terms that are positively correlated.

How Do We Detect Autocorrelation?

To illustrate how we go about detecting autocorrelation, let’s first start with a data set. I have pulled the average hourly wages of textile and apparel workers for the 18 months from January 1986 through June 1987. The original source was the Survey of Current Business, September issues from 1986 and 1987, but this data set was reprinted in Data Analysis Using Microsoft ® Excel, by Michael R. Middleton, page 219:

 Month t Wage Jan-86 1 5.82 Feb-86 2 5.79 Mar-86 3 5.8 Apr-86 4 5.81 May-86 5 5.78 Jun-86 6 5.79 Jul-86 7 5.79 Aug-86 8 5.83 Sep-86 9 5.91 Oct-86 10 5.87 Nov-86 11 5.87 Dec-86 12 5.9 Jan-87 13 5.94 Feb-87 14 5.93 Mar-87 15 5.93 Apr-87 16 5.94 May-87 17 5.89 Jun-87 18 5.91

Now, let’s run a simple regression model, using time period t as the independent variable and Wage as the dependent variable. Using the data set above, we derive the following model:

Ŷ = 5.7709 + 0.0095t

Examine the Model Output

Notice also the following model diagnostic statistics:

 R2= 0.728 Variable Coefficient t-ratio Intercept 5.7709 367.62 t 0.0095 6.55

You can see that the R2 is a high number, with changes in t explaining nearly three-quarters the variation in average hourly wage. Note also the t-ratios for both the intercept and the parameter estimate for t. Both are very high. Recall that a high R2 and high t-ratios are symptoms of autocorrelation.

Visually Inspect Residuals

Just because a model has a high R2 and parameters with high t-ratios doesn’t mean autocorrelation is present. More work must be done to detect autocorrelation. Another way to check for autocorrelation is to visually inspect the residuals. The best way to do this is through plotting the average hourly wage predicted by the model against the actual average hourly wage, as Middleton has done:

Notice the green line representing the Predicted Wage. It is a straight, upward line. This is to be expected, since the independent variable is sequential and shows an increasing trend. The red line depicts the actual wage in the time series. Notice that the model’s forecast is higher than actual for months 5 through 8, and for months 17 and 18. The model also underpredicts for months 12 through 16. This clearly illustrates the presence of positive, first-order autocorrelation.

The Durbin-Watson Statistic

Examining the model components and visually inspecting the residuals are intuitive, but not definitive ways to diagnose autocorrelation. To really be sure if autocorrelation exists, we must compute the Durbin-Watson statistic, often denoted as d.

In our June 24 Forecast Friday post, we demonstrated how to calculate the Durbin-Watson statistic. The actual formula is:

That is, beginning with the error term for the second observation, we subtract the immediate previous error term from it; then we square the difference. We do this for each observation from the second one onward. Then we sum all of those squared differences together. Next, we square the error terms for each observation, and sum those together. Then we divide the sum of squared differences by the sum of squared error terms, to get our Durbin-Watson statistic.

For our example, we have the following:

 t Error Squared Error et-et-1 Squared Difference 1 0.0396 0.0016 2 0.0001 0.0000 (0.0395) 0.0016 3 0.0006 0.0000 0.0005 0.0000 4 0.0011 0.0000 0.0005 0.0000 5 (0.0384) 0.0015 (0.0395) 0.0016 6 (0.0379) 0.0014 0.0005 0.0000 7 (0.0474) 0.0022 (0.0095) 0.0001 8 (0.0169) 0.0003 0.0305 0.0009 9 0.0536 0.0029 0.0705 0.0050 10 0.0041 0.0000 (0.0495) 0.0024 11 (0.0054) 0.0000 (0.0095) 0.0001 12 0.0152 0.0002 0.0205 0.0004 13 0.0457 0.0021 0.0305 0.0009 14 0.0262 0.0007 (0.0195) 0.0004 15 0.0167 0.0003 (0.0095) 0.0001 16 0.0172 0.0003 0.0005 0.0000 17 (0.0423) 0.0018 (0.0595) 0.0035 18 (0.0318) 0.0010 0.0105 0.0001 Sum: 0.0163 0.0171

To obtain our Durbin-Watson statistic, we plug our sums into the formula:

= 1.050

What Does the Durbin-Watson Statistic Tell Us?

Our Durbin-Watson statistic is 1.050. What does that mean? The Durbin-Watson statistic is interpreted as follows:

• If d is close to zero (0), then positive autocorrelation is probably present;
• If d is close to two (2), then the model is likely free of autocorrelation; and
• If d is close to four (4), then negative autocorrelation is probably present.

As we saw from our visual examination of the residuals, we appear to have positive autocorrelation, and the fact that our Durbin-Watson statistic is about halfway between zero and two suggests the presence of positive autocorrelation.

Next Forecast Friday Topic: Correcting Autocorrelation

Today we went through the process of understanding the causes and effect of autocorrelation, and how to suspect and detect its presence. Next week, we will discuss how to correct for autocorrelation and eliminate it so that we can have more efficient parameter estimates.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.