(Thirtyfourth in a series)
Today, we begin a sixweek discussion on the use of Autoregressive Integrated Moving Average (ARIMA) models in forecasting. ARIMA models were popularized by George Box and Gwilym Jenkins in the 1970s, and were traditionally known as BoxJenkins analysis. The purpose of ARIMA methods is to fit a stochastic (randomly determined) model to a given set of time series data, such that the model can closely approximate the process that is actually generating the data.
There are three main steps in ARIMA methodology: identification, estimation and diagnostic checking, and then application. Before undertaking these steps, however, an analyst must be sure that the time series is stationary. That is, the covariance between any two values of the time series is dependent upon only the time interval between those particular values and not on their absolute location in time.
Determining whether a time series is stationary requires the use of an autocorrelation function (ACF), also called a correlogram, which is the topic of today’s post. Next Thursday, we will go into a full discussion on stationarity and how the ACF is used to determine whether a series is stationary.
Autocorrelation Revisited
Did someone say, “autocorrelation?” Yes! Remember our discussions about detecting and correcting autocorrelation in regression models in our July 29, 2010 and August 5, 2010 Forecast Friday posts? Recall that one of the ways we corrected for autocorrelation was by lagging the dependent variable by one period and then using the lagged variable as an independent variable. Anytime we lag a regression model’s dependent variable and then use it as an independent variable to predict a subsequent period’s dependent variable value, our regression model becomes an autoregressive model.
In regression analysis, we used autoregressive models to correct for autocorrelation. Yet, we can – and have – use the autoregression model to represent the behavior of the time series we’re observing.
When we lag a dependent variable by one period, our model is said to be a firstorder autoregressive model. A firstorder autoregressive model is denoted as:
Where φ_{1} is the parameter for the autoregressive term lagged by one period; a_{t} is the random variable with a mean of zero and constant variance at time period t; and C is a value that allows for the fact that time series X_{t} can have a nonzero mean. In fact, you can easily see that this formula mimics a regression equation, with a_{t} essentially becoming the residuals of the formula, X_{t} the dependent variable; C as alpha (or the intercept), and φ_{1}X_{t1} as the independent variable. In essence, a firstorder autoregressive model is forecasting the next period’s value on the most recent value.
What if you want to base next period’s forecast on the two most recent values? Then you lag by two periods, and have a secondorder autoregressive model, which is denoted by:
In fact, you can use an infinite number of past periods to predict the next period. The formula below shows an autoregressive model of order p, where p is the number of past periods whose values on which you expect to predict the next period’s value:
This review of autocorrelation will help you out in the next session, when we begin to discuss the ACF.
The Autocorrelation Function (ACF)
The ACF is a plot of the autocorrelations between the data points in a time series, and is the key statistic in time series analysis. The ACF is the correlation of the time series with itself, lagged by a certain number of periods. The formula for each lag of an ACF is given by:
Where r_{k} is the autocorrelation at lag k. If k=1, r_{1} shows the correlation between successive values of Y; if k=2, then r_{2} would denote the correlation between Y values two periods apart, and so on. Plotting each of these lags gives us our ACF.
Let’s assume we have 48 months of data, as shown in the following table:
Year 1

Year 2

Year 3

Year 4

Month

Value

Month

Value

Month

Value

Month

Value

1

1

13

41

25

18

37

51

2

20

14

63

26

93

38

20

3

31

15

17

27

80

39

65

4

8

16

96

28

36

40

45

5

40

17

68

29

4

41

87

6

41

18

27

30

23

42

68

7

46

19

41

31

81

43

36

8

89

20

17

32

47

44

31

9

72

21

26

33

61

45

79

10

45

22

75

34

27

46

7

11

81

23

63

35

13

47

95

12

93

24

93

36

25

48

37

As decision makers, we want to know whether this data series exhibits a pattern, and the ACF is the means to this end. If no pattern is discerned in this data series, then the series is said to be “white noise.” As you know from our regression analysis discussions our residuals must not exhibit a pattern. Hence, our residuals in regression analysis needed to be white noise. And as you will see in our later discussions on ARIMA methods, the residuals become very important in the estimation and diagnostic checking phase of the ARIMA methodology.
Sampling Distribution of Autocorrelations
Autocorrelations of a white noise series tend to have sampling distributions that are normally distributed, with a mean of zero and a standard error of 1/√n. The standard error is simply the reciprocal of the square root of the sample size. If the autocorrelations are white noise, approximately 95% of the autocorrelation coefficients will fall within two (actually, 1.96) standard errors of the mean; if they don’t, then the series is not white noise and a pattern does indeed exist.
To see if our ACF exhibits a pattern, we look at our individual r_{k} values separately and develop a standard error formula to test whether each value for r_{k} is statistically different from zero. We do this by plotting our ACF:
The ACF is the plot of lags (in blue) for the first 24 months of the series. The dashed red lines are the ±1.96 standard errors. If one or more lags pierce those dashed lines, then the lag(s) is significantly different from zero and the series is not white noise. As you can see, this series is white noise.
Specifically the values for the first six lags are:
Lag 
Value 
r_{1} 
0.022

r_{2} 
0.098

r_{3} 
0.049

r_{4} 
0.036

r_{5} 
0.015

r_{6} 
0.068

Apparently, there is no discernable pattern in the data: successive lags are only minimally correlated; in fact, there’s a higher correlation between lags two intervals apart.
Portmanteau Tests
In the example above, we looked at each individual lag. An alternative to this would be to examine a whole set of r_{k} values, say the first 10 of them (r_{1} to r_{10}) all at once and then test to see whether the set is significantly different from a zero set. Such a test is known as a portmanteau test, and the two most common are the BoxPierce test and the LjungBox Q^{*} statistic. We will discuss both of them here.
The BoxPierce Test
Here is the BoxPierce formula:
Q is the the BoxPierce test statistic, which we will compare against the χ^{2} distribution; n is the total number of observations; h is the maximum lag we are considering (24 in the ACF plot).
Essentially, the BoxPierce test indicates that if residuals are white noise, the Qstatistic follows a χ^{2} distribution with (h – m) degrees of freedom. If a model is fitted, then m is the number of parameters. However, no model is fitted here, so our m=0. If each r_{k} value is close to zero, then Q will be very small; otherwise, if some r_{k} values are large – either negatively or positively – then Q will be relatively large. We will compare Q to the χ^{2} distribution, just like any other significance test.
Since we plotted 24 lags, we are interested in only the r^{2}_{k} values for the first 24 observations (not shown). Our Q statistic is:
We have 24 degrees of freedom, and so we compare our Q statistic to the χ^{2} distribution. Our critical χ^{2} value for a 1% significance level is 42.98, well above our Q statistic, leading us to conclude that our chosen set of r^{2}_{k} values is not significantly different from a zero set.
The LjungBox Q^{*} Statistic
In 1978, Ljung and Box believed there was a closer approximation to the χ^{2} distribution than the BoxPierce Q statistic, so they developed the alternative Q^{*} statistic. The formula for the LjungBox Q^{*} statistic is:
For our r^{2}_{k} values, that is reflected in:
We get a Q^{*} = 24.92. Comparing this to the same critical χ^{2} value, our distribution is still not significant. If the data are white noise, then the Q^{*} and Q statistic will both have the same distribution. It’s important to note, however, that portmanteau tests have a tendency to fail in rejecting poorly fit models, so you shouldn’t rely solely on them for accepting models.
The Partial Autocorrelation Coefficient
When we do multiple regression analysis, we are sometimes interested in finding out how much explanatory power one variable has by itself. To do this, we omit the independent variable whose explanatory power we are interested in – or rather, partial out the effects of the other independent variables. We can do similarly in time series analysis, with the use of partial autocorrelations.
Partial autocorrelations measure the degree of association between various lags when the effects of other lags are removed. If the autocorrelation between Y_{t} and Y_{t1 }is significant, then we will also see a similar significant autocorrelation between Y_{t1} and Y_{t2}, as they are just one period apart. Since both Y_{t} and Y_{t2} are both correlated with Y_{t1}, they are also correlated with each other; so, by removing the effect of Y_{t1}, we can measure the true correlation between Y_{t} and Y_{t2}.
A partial autocorrelation coefficient of order k, which is denoted by α_{k}, is determined by regressing the current time series value by its lagged values:
As I mentioned earlier, this form of equation is an autoregressive (AR) one, since its independent variables are timelagged values of the dependent variable. We use this multiple regression to find the partial autocorrelation α_{k}. If we regress Y_{t} only against Y_{t1}, then we derive our value for α_{1}. If we regress Y_{t} against both Y_{t1} and Y_{t2}, then we’ll derive values for both α_{1} and α_{2}.
Then, as we did for the autocorrelation coefficients, we plot our partial autocorrelation coefficients. This plot is called, not surprisingly, a partial autocorrelation function (PACF).
Let’s assume we wanted to measure the partial autocorrelations for the first 12 months of our data series. We generate the following PACF:
Since the lags fall within their 1.96 standard errors, our PACF is also indicative of a white noise series. Also, note that α_{1} in the PACF is always equal to r_{1} in the ACF.
Seasonality
Our data series exhibited no pattern, despite its monthly nature. This is unusual for many time series models, especially when you consider retail sales data. Monthly retail sales will exhibit a strong seasonal component, which will show up in your ACF at the time of the seasonal lag. The r_{k} value at that particular lag will manifest itself as a lag that does indeed break through the critical value line, not only at that lag, but at also multiples of that lag. So, if sales are busiest in month 12, you can expect to see ACFs with significant lags at time 12, 24, 36, and so on. You’ll see examples of this in subsequent posts on ARIMA.
Next Forecast Friday Topic: Stationarity of Time Series Data
As mentioned earlier, a time series must be stationary for forecasting. Next week, you’ll see how the ACF and PCF are used to determine whether a time series exhibits stationarity, as we move on towards our discussion of ARIMA methodology.
*************************
Start the New Year on the Right Foot: Follow us on Facebook and Twitter !
For the latest insights on marketing research, predictive modeling, and forecasting, be sure to check out Analysights on Facebook and Twitter! “Likeing” us on Facebook and following us on Twitter will allow you to stay informed of each new Insight Central post published, new information about analytics, discussions Analysights will be hosting, and other opportunities for feedback. So get this New Year off right and check us out on Facebook and Twitter!