You are on page 1of 97


a Basic Definitions and Theorems about ARIMA models
First we define some important concepts. A stochastic process (c.q. probabilistic process) is defined by a T-
dimensional distribution function.

Before analyzing the structure of a time series model one must make sure that the time series are stationary with
respect to the variance and with respect to the mean. First, we will assume statistical stationarity of all time series
(later on, this restriction will be relaxed).
Statistical stationarity of a time series implies that the marginal probability distribution is time-independent
which means that:

the expected values and variances are constant

where T is the number of observations in the time series;

the autocovariances (and autocorrelations) must be constant

where k is an integer time-lag;

the variable has a joint normal distribution f(X
, X
, ..., X
) with marginal normal distribution in each dimension

If only this last condition is not met, we denote this by weak stationarity.
Now it is possible to define white noise as a stochastic process (which is statistically stationary) defined by a
marginal distribution function (V.I.1-1), where all X
are independent variables (with zero covariances), with a
joint normal distribution f(X
, X
, ..., X
), and with

It is obvious from this definition that for any white noise process the probability function can be written as

Define the autocovariance as


whereas the autocorrelation is defined as

In practice however, we only have the sample observations at our disposal. Therefore we use the sample

for any integer k.
Remark that the autocovariance matrix and autocorrelation matrix associated with a stochastic stationary


is always positive definite, which can be easily shown since a linear combination of the stochastic variable

has a variance of

which is always positive.
This implies for instance for T=3 that


Bartlett proved that the variance of autocorrelation of a stationary normal stochastic process can be formulated

This expression can be shown to be reduced to

if the autocorrelation coefficients decrease exponentially like

Since the autocorrelations for i > q (a natural number) are equal to zero, expression (V.I.1-17) can be shown to
be reformulated as

which is the so called large-lag variance. Now it is possible to vary q from 1 to any desired integer number of
autocorrelations, replace the theoretical correlations by their sample estimates, and compute the square root of
(V.I.1-20) to find the standard deviation of the sample autocorrelation.
Note that the standard deviation of one autocorrelation coefficient is almost always approximated by

The covariances between autocorrelation coefficients have also been deduced by Bartlett

which is a good indicator for dependencies between autocorrelations. Remind therefore that inter-correlated
autocorrelations can seriously distort the picture of the autocorrelation function (ACF c.q. autocorrelations as a
function of a time-lag).
It is however possible to remove the intervening correlations between X
and X
by defining a partial
autocorrelation function (PACF)
The partial autocorrelation coefficients are defined as the last coefficient of a partial autoregression equation of
order k

It is obvious that there exists a relationship between the PACF and the ACF since (V.I.1-23) can be rewritten

or (on taking expectations and dividing by the variance)

Sometimes (V.I.1-25) is written in matrix formulation according to the Yule-Walker relations

or simply

Solving (V.I.1-27) according to Cramer's Rule yields

Note that the determinant of the numerator contains the same elements as the determinant of the denominator,
except for the last column that has been replaced.
A practical numerical estimation algorithm for the PACF is given by Durbin


The standard error of a partial autocorrelation coefficient for k > p (where p is the order of the
autoregressive data generating process; see later) is given by

Finally, we define the following polynomial lag-processes

where B is the backshift operator (c.q. B
= Y
) and where

These polynomial expressions are used to define linear filters. By definition a linear filter

generates a stochastic process

where a
is a white noise variable.

for which the following is obvious

We call eq. (V.I.1-36) the random-walk model: a model that describes time series that are fluctuating around
in the short and in the long run (since a
is white noise).
It is interesting to note that a random-walk is normally distributed. This can be proved by using the definition of
white noise and computing the moment generating function of the random-walk


from which we deduce

A deterministic trend is generated by a random-walk model with an added constant

The trend can be illustrated by re-expressing (V.I.1-41) as

where ct is a linear deterministic trend (as a function of time).
The linear filter (V.I.1-35) is normally distributed with

due to the additivity property of eq. (I.III-33), (I.III-34), and (I.III-35) applied to a
Now the autocorrelation of a linear filter can be quite easily computed as



Now it is quite evident that, if the linear filter (V.I.1-35) generates the variable X
, then X
is a stationary
stochastic process ((V.I.1-1) - (V.I.1-3)) defined by a normal distribution (V.I.1-4) (and therefore strongly
stationary), and a autocovariance function (V.I.1-45) which is only dependent on the time-lag k.
The set of equations resulting from a linear filter (V.I.1-35) with ACF (V.I.1-44) are sometimes called stochastic
difference equations. These stochastic difference equations can be used in practice to forecast (economic) time
series. Theforecasting function is given by

On using (V.I.1-35), the density of the forecasting function (V.I.1-47) is


is known, and therefore equal to a constant term. Therefore it is obvious that


The concepts defined and described above are all time-related. This implies for instance that autocorrelations are
defined as a function of time. Historically, this time-domain viewpoint is preceded by the frequency-
domain viewpoint where it is assumed that time series consist of sine and cosine waves at different frequencies.
In practice there are both advantages and disadvantages to both viewpoints. Nevertheless, both should be seen as
complementary to each other.

for the Fourier series model

In (V.I.1-53) we define


The least squares estimates of the parameters in (V.I.1-52) are computed by

In case of a time series with an even number of observations T = 2 q the same definitions are applicable except

It can furthermore be shown that


such that



It is also possible to show that








which state the orthogonality properties of sinusoids and which can be proved. Remark that (V.I.1-67) is a
special case of (V.I.1-64) and (V.I.1-68) is a special case of (V.I.1-66). Particularly eq. (V.I.1-66) is interesting for
our discussion in regard to (V.I.1-60) and (V.I.1-53), since it states that sinusoids are independent.
If (V.I.1-52) is redefined as

then I(f) is called the sample spectrum.
The sample spectrum is in fact a Fourier cosine transformation of the autocovariance function estimate. Denote
the covariance-estimate of (V.I.1-7)by the sample-covariance (c.q. the numerator of (V.I.1-10)), the complex
number i, and the frequency by f, then

On using (V.I.1-55)and (V.I.1-70) it follows that

which can be substituted into (V.I.1-70) yielding

Now from (V.I.1-10) it follows

and if (t - t') is substituted by k then (V.I.1-72) becomes

which proves the link between the sample spectrum and the estimated autocovariance function.
On taking expectations of the spectrum we obtain

for which it can be shown that

On combining (V.I.1-75) and (V.I1.1-76) and on defining the power spectrum as p(f) we find

It is quite obvious that

so that it follows that the power spectrum converges if the covariance decreases rather quickly. The power
spectrum is a Fourier cosine transformation of the (population) autocovariance function. This implies that for any
theoretical autocovariance function (cfr. the following sections) a respective theoretical power spectrum can be
Of course the power spectrum can be reformulated with respect to autocorrelations in stead of autocovariances

which is the so-called spectral density function.

it follows that

and since g(f) > 0 the properties of g(f) are quite similar to those of a frequency distribution function.
Since it can be shown that the sample spectrum fluctuates wildly around the theoretical power spectrum a
modified (c.q. smoothed) estimate of the power spectrum is suggested as


b. The AR(1) process
The AR(1) process is defined as

where W
is a stationary time series, e
is a white noise error term, and F
is called the forecasting function. Now
we derive the theoretical pattern of the ACF of an AR(1) process for identification purposes.
First, we note that (V.I.1-83) may be alternatively written in the form

Second, we multiply the AR(1) process in (V.I.1-83) by W
in expectations form

Since we know that for k = 0 the RHS of eq. (V.I.1-85) may be rewritten as

and that for k > 0 the RHS of eq. (V.I.1-85) is

we may write the LHS of (V.I.1-85) as

From (V.I.1-88) we deduce



(figure V.I.1-1)
We can now easily observe how the theoretical ACF of an AR(1) process should look like. Note that we have
already added the theoretical PACF of the AR(1) process since the first partial autocorrelation coefficient is
exactly equivalent to the first autocorrelation coefficient.
In general, a linear filter process is stationary if the ¢(B) polynomial converges.
Remark that the AR(1) process is stationary if the solution for (1 - |B) = 0 is larger in absolute value than 1 (c.q.
the roots of ¢(B) are, in absolute value, less than 1).
This solution is |
.Hence, if the absolute value of the AR(1) parameter is less than 1, then model is stationary
which can be illustrated by the fact that

For a general AR(p) model the solutions of

for which

must be satisfied in order to obtain stationarity.

c. The AR(2) process
The AR(2) process is defined as

where W
is a stationary time series, e
is a white noise error term, and F
is the forecasting function.
The process defined in (V.I.1-94) can be written in the form

and therefore

Now, for (V.I.1-96) to be valid, it easily follows that

and that

and that

and finally that

The model is stationary if the ¢
weights converge. This is the case when some conditions on |
and |
imposed. These conditions can be found on using the solutions of the polynomial of the AR(2) model. The so-
called characteristic equationis used to find these solutions

The solutions of ç
and ç

which can be either real or complex. Notice that the roots are complex if

When these solutions, in absolute value, are smaller than 1, the AR(2) model is stationary.
Later, it will be shown that these conditions are satisfied if |
and |
lie in a (Stralkowski) triangular
region restricted by

The derivation of the theoretical ACF and PACF for an AR(2) model is described below.
On multiplying the AR(2) model by W
, and taking expectations we obtain

From (V.I.1-97) and (V.I.1-98) it follows that

Now it is possible to combine (V.I.1-104) with (V.I.1-105) such that

from which it follows that


Eq. (V.I.1-106) can be rewritten as

such that on using (V.I.1-108) it is obvious that


According to (V.I.1-107) the ACF is a second order stochastic difference equation of the form

where (due to (V.I.1-108))

are starting values of the difference equation.
In general, the solution to the difference equation is, according to Box and Jenkins (1976), given by


In particular, three different cases can be worked out for the solutions of the difference equation


of (V.I.1-102). The general solution of eq. (V.I.1-113) can be written in the form


Remark that for the case the following stationarity conditions


has two solutions

due to (V.I.1-114) and

due to


Hence we find the general solution to the difference equation

In order to impose convergence the following must hold

Hence two conditions have to be satisfied

which describes a part of a parabola consisting of acceptable parameter values for

Remark that this parabola is the frontier between acceptable real-valued and acceptable complex roots (cfr.
Triangle of Stralkowski).

in goniometric notation.
The general solution for the second-order difference equation can be found by

On defining

the ACF can be shown to be real-valued since

On using the property

eq. (V.I.1-126) becomes


In eq. (V.I.1-128) it is shown that the ACF is oscillating with period ¸ = 2t/u and a variable amplitude of

as a function of k.
A useful equation can be found to compute the period of the pseudo-periodic behavior of the time series as

which must satisfy the convergence condition (c.q. the amplitude is exponentially decreasing)


The pattern of the theoretical PACF can be deduced from relations (V.I.1-25) - (V.I.1-28).
The theoretical ACF and PACF are illustrated below. Figure (V.I.1-2) contains two possible ACF and PACF patterns
for real roots while figure (V.I.1-3) shows the ACF and PACF patterns when the roots are complex.

(figure V.I.1-2)

(figure V.I.1-3)

d. The AR(p) process
An AR(p) process is defined by

where W
is a stationary time series, e
is a white noise error component, and F
is the forecasting function.
As described above, the AR(p) process can be written


The ¢ weights converge if the stationarity conditions of the roots of the characteristic equation


The variance can be shown to be


which can be used to study the behavior of the theoretical ACF pattern.
Remember, that the Yule-Walker relations (V.I.1-26) and (V.I.1-27) hold for all AR(p) models. These can be
used (together with the application of Cramer's Rule (V.I.1-28)) to derive the theoretical PACF pattern from the
theoretical ACF function.

e. The MA(1) process
The definition of the MA(1) process is given by

where W
is a stationary time series, e
is a white noise error component, and F
is the forecasting function

eq. (V.I.1-46) and (V.I.1-45) we obtain

Therefore the pattern of the theoretical ACF is

Note that from eq. (5.I.1-141) it follows that


This implies that there exist at least two MA(1) processes which generate the same theoretical ACF.
Since an MA process consists of a finite number of y weights it follows that the process is always stationary.
However, it is necessary to impose the so-called invertibility restrictions such that the MA(q) process can be
rewritten into a AR(·) model.

On using the Yule-Walker equations and eq. (V.I.1-141) it can be shown that the theoretical PACF is


Hence the theoretical PACF is dominated by an exponential function which decreases.

The theoretical ACF and PACF for the MA(1) are illustrated in figure (V.I.1-4).

(figure V.I.1-4)

f. The MA(2) process
By definition the MA(2) process is

which can be rewritten on using (V.I.1-139)

where W
is a stationary time series, e
is a white noise error component, and F
is the forecasting function.

into eq. (V.I.1-46) and (V.I.1-45) we obtain

Hence the theoretical ACF can be deduced

The invertibility conditions can be shown to be

(compare with the stationarity conditions of the AR(2) process).
The deduction of the theoretical PACF is rather complicated but can be shown to be dominated by the sum of
two exponentials (in case of real roots), or by decreasing sine waves (in case the roots are complex).
These two possible cases are shown in figures (V.I.1-5) and (V.I.1-6).

(figure V.I.1-5)

(figure V.I.1-6)

g. The MA(q) process
The MA(q) process is defined by

where W
is a stationary time series, e
is a white noise error component, and F
is the forecasting function.
Remark that this description of the MA(q) process is not straightforward-to-use for forecasting purposes due to it’s
recursive character.

into (V.I.1-46) and (V.I.1-45), the following autocovariances can be deduced

Hence the theoretical ACF is

The theoretical PACF for higher order MA(q) processes are extremely complicated and not extensively discussed
in literature.


h. The ARMA(1,1) process
On combining an AR(1) and a MA(1) process one obtains an ARMA(1,1) model which is defined as

where W
is a stationary time series, e
is a white noise error component, and F
is the forecasting function.
Note that the model of (V.I.1-154) may alternatively be written as

such that

in ¢-weight notation.
The ¢-weights can be related to the ARMA parameters on using

such that the following is obtained

Also the t-weights can be related to the ARMA parameters on using

such that the following is obtained

From (V.I.1-158) and (V.I.1-160) it can be clearly seen that an ARMA(1,1) is in fact a parsimonious description of
either an AR or a MA process with an infinite amount of weights. This does not imply that all higher order AR(p) or
MA(q) processes may be written as an ARMA(1,1). Though, in practice an ARMA process (c.q. a mixed model) is,
quite frequently, capable of capturing higher order pure-AR t-weights or pure-MA ¢-weights.
On writing the ARMA(1,1) process as

(which is a difference equation) we may multiply by W
and take expectations. This gives

In case k > 1 the RHS of (V.I.1-162) is zero thus

If k = 0 or if k = 1 then

Hence we obtain

The theoretical ACF is therefore


The theoretical ACF and PACF patterns for the ARMA(1,1) are illustrated in figures (V.I.1-7), (V.I.1-8), and (V.I.1-

(figure V.I.1-7)

(figure V.I.1-8)

(figure V.I.1-9)

i. The ARMA(p,q) process
The general ARMA(p,q) can be defined by

or alternatively in MA(·) notation

or in AR(·) notation

where t(B) = 1/¢(B).
The stationarity conditions depend on the AR part: the roots of |(B) = 0 must be larger than 1.
The invertibilityconditions only depend on the MA part: the roots of u(B) = 0 must also be larger than 1.
The theoretical ACF and PACF patterns are deduced from the so-called difference equations


k. Non stationary time series
Most economic (and also many other) time series do not satisfy the stationarity conditions stated earlier for which
ARMA models have been derived. Then these times series are called non stationary and should be re-expressed
such that they become stationary with respect to the variance and the mean.
It is not suggested that the description of the following re-expression tools is exhaustive! They rather form a set of
tools which have shown to be useful in practice. It is quite evident that many extensions are possible with respect
to re-expression tools: these are discussed in literature such as in JENKINS (1976 and 1978), MILLS (1990),
MCLEOD (1983), etc...
Transformation of time series
If we write a time series as the sum of a deterministic mean and a disturbance term


where h is an arbitrary function.

This can be used to obtain the variance of the transformed series

which implies that the variance can be stabilized by imposing

Accordingly, if the standard deviation of the series is proportional to the mean level


from which it follows that

In case the variance of the series is proportional to the mean level, then

from which it follows that

With the use of a Standard Deviation / Mean Procedure (SMP) we are able to detect heteroskedasticity in
the time series. Above that, with the help of the SMP, it is quite often possible to find an appropriate
transformation which will ensure the time series to be homoskedastic. In fact, it is assumed that there exists a
relationship between the mean level of the time series and the variance or standard deviation as in

which is an explicitly assumed relationship, in contrast to (V.I.1-194).
The SMP is generated by a three step process:

the time series is spilt into equal (chronological) segments;

for each segment the arithmetic mean and standard deviation is computed;

the mean and S.D. of each segment is plotted or regressed against each other.

By selecting the length of the segments equal to the seasonal period one ensures that the S.D. and mean is
independent from the seasonal period.
In practice one of the following patterns will be recognized (as summarized in the graph). Note that the lambda
parameter should take a value of zero when a linearly proportional association between S.D. and the mean is
The value of lambda is in fact the transformation parameter which implies the following:


(Figure V.I.1-10)
Differencing of time series
With the use of the Autocorrelation Function (ACF) (with autocorrelations on the y axis and the different time
lags on the x axis) it is possible to detect unstationarity of the time series with respect to the mean level.

(figure V.I.1-11)
When the ACF of the time series is slowly decreasing, this is an indication that the mean is not stationary. An
example of such an ACF is given in figure (V.I.1-11).
The differencing operator (nabla) is used to make the time series stationary.

l. Differencing (Nabla and B operator)
We have already used the back shift operator in previous sections. As we know the back shift operator (B-
operator) transforms an observation of a time series to the previous one

Also, it can easily be shown that


Note that if we back shift k times seasonally we get

The back shift operator can be useful in defining the nabla operator which is

or in general

which is sometimes also called the differencing operator.
As stated before, a time series which is not stationary with respect to the mean can be made stationary by
differencing. How can this be interpreted ?

(figure V.I.1-12)
In figure (V.I.1-12) a function is displayed with two points on a graph (a, f(a)), and (b, f(b)).
Assume that a time series is generated by the function f(x). Then the derivative of the function gives the slope of
a line tangent with respect to the graph in every point of the function's domain.
The derivative of a function is defined as

If we compute the slope of the cord in (figure V.I.1-12), this is in fact the same as the derivative of f(x) between a
and b with a discrete step in stead of an infinitesimal small step.
This results in computing

Although we have assumed the time series to be generated by f(x), in practice we only observe sample values at
discrete time intervals. Therefore the best approximation of f(x) between two known points (a, f(a)) and (b, f(b))
is a straight line with slope given by (V.I.1-211).
If this approximation is to be optimal, the distance between a and b should be as small as possible. Since the
smallest difference between observations of equally spaced time series is the time lag itself, the smallest value of
h in eq. (V.I.1-211) is in fact equal to 1.
Therefore (V.I.1-211) reduces to

which is nothing else but the differencing operator.
We conclude that by differencing a time series we 'derive' the function by which it is generated, and therefore
reduce the function's power by 1. If e.g. we would have a time series generated by a quadratic function, we could
make it stationary by differencing the series twice.
Furthermore it should be noted that if a time series is non stationary, and must therefore be 'derived' to induce
stationarity, the series is often called to be generated by an integrated process. Now the ARMA models which
have been describedbefore, can be elaborated to the class of ARIMA (c.q. Autoregressive Integrated Moving
Average) models.

m. The behavior of non stationary time series
In the previous subsections, non stationarity has been discussed at a rather intuitive level. Now we will discuss
some morefundamental properties of the behavior of non stationary time series.
A time series that is generated by

with ¸(B) an AR operator which is not stationary: ¸(B) has d roots equal to 1; all other roots lie outside the unit
circle. Thus eq. (V.I.1-213) can be written by factoring out the unit roots

where |(B) is stationary.
In general a univariate stochastic process as (V.I.1-214) is denoted an ARIMA(p,d,q) model where p is the
autoregressive order, d is the number of non-seasonal differences, and q is the order of the moving average
Quite evidently, time series exhibiting non stationarity in both variance and mean, are first to be transformed in
order to induce a stable variance, and then to be differenced enabling stationarity with respect to the mean level.
The reason for this is that power, and logarithmic transformations are not always defined for negative (real)
The ARIMA(p,d,q) model can be expanded by introducing deterministic d-order polynomial trends.
This is simply achieved by adding a parameter - constant to (V.I.1-214), expressed in terms of a (non-seasonal)
non-stationary time series Z


The same properties can be achieved by writing (V.I.1-215) as an invertible ARMA process

where c is a parameter-constant. This is because


Also remark that the p AR parameters must not add to unity, since this would, according to (V.I.1-217), imply (in
the limit) an infinite mean level, an obvious nonsense!
An ARIMA model can be generally written as a difference equation. For instance, the ARIMA(1,1,1) can be
formulated as

which illustrates the postulated fact. This form of the ARIMA model is used for recursive forecasting purposes.
The ARIMA model can also be generally written as a random shock model (c.q. a model in terms of the ¢-weights,
and the white noise error components) since

it follows that

Hence, if j is the maximum of (p + d - 1, q)

it follows that the ¢-weights satisfy

which implies that large-lagged ¢-weights are composed of polynomials, exponentials (damped), and sinusoids
(damped) with respect to index j.
This form of the ARIMA model (c.q. eq. (V.I.1-219)) is used to compute the forecast confidence intervals.
A third way of writing an ARIMA model is the truncated random shock model form.

The parameter k may be interpreted as the time origin of the observable data. First, we observe that if Y
' is a
particular solution of (V.I.1-213), thus if

then it follows from (V.I.1-213), and (V.I.1-223) that


Hence, the general solution of (V.I.1-213) is the sum of
'' (c.q. a complementary function which is the solution of (V.I.1-224)), and Y
' (c.q. a particular integral which is
a particular solution of (V.I.1-213)).

and that the general solution of the homogeneous difference equation with respect to time origin k < t is given by




see also (V.I.1-227).

The general complementary function for


with D
described in

From (V.I.1-231) it can be concluded that the complementary function involves a mixture of:

(with ¢-weights of the random shock model form) satisfying the ARIMA model structure (where B operates on t,
not on k)

which can be easily proved on noting that

such that


Hence, if t - k > q eq. (V.I.1-233) is the particular integral of (V.I.1-234).

If in an extreme case k = -· then

called the nontruncated random shock form of the ARIMA model.

(compare this result with (V.I.1-237)).
Also remark that it is evident that


This implies that when using the complementary function for forecasting purposes, it is advisable to update the
forecast as new observations become available.

o. Unit root tests
There are d unit roots in a non-stationary time series (with respect to the mean) if |(B) is stationary and u(B)
invertible in


The most frequently used test for unit roots is the augmented Dickey-Fuller regression (ADF)



An example of the use of the ADF is the following LR-test



Some critical 95% values for this LR-test (K >1) are: 7.24 (for T > 24), 6.73 (for T > 50), 6.49 (for T > 100), and
6.25 (for T >120). It is also possible to perform an Engle-Granger cointegration test between the variables X
This test estimates the cointegrating regression in a first step