Beruflich Dokumente
Kultur Dokumente
Introduction
Autocovariance Functions
In modeling finite number of random variables, a covariance matrix is usually computed to summarize the dependence between these variables. For a time series {Xt }1
t= 1 , we need to model
the dependence over infinite number of random variables. The autocovariance and autocorrelation
functions provide us a tool for this purpose.
Definition 1 (Autocovariance function). The autocovariance function of a time series {Xt } with
V ar(Xt ) < 1 is defined by
X (s, t)
= Cov(Xs , Xt ) = E[(Xs
EXs )(Xt
EXt )].
12000
0.2
Detrended Log(GDP)
10000
GDP
8000
6000
4000
2000
0.1
0.2
0.1
1500
1000
500
0
1990 1992 1994 1996 1998 2000 2002
Time
0.2
0.1
0.1
0.2
1990 1992 1994 1996 1998 2000 2002
Time
X (s, t)
= E(Xt2 ) = 1.25,
when t = s + 1,
X (t, t
when t
s > 1,
X (s, t)
+ 1) = E[(t + 0.5t
1 )(t+1
+ 0.5t )] = 0.5,
= 0.
With autocovariance functions, we can define the covariance stationarity, or weak stationarity. In
the literature, usually stationarity means weak stationarity, unless otherwise specified.
Definition 2 (Stationarity or weak stationarity) The time series {Xt , t 2 Z} (where Z is the
integer set) is said to be stationary if
(I) E(Xt2 ) < 1 8 t 2 Z.
(II) EXt = 8 t 2 Z.
(III)
X (s, t)
X (s
+ h, t + h) 8 s, t, h 2 Z.
In other words, a stationary time series {Xt } must have three features: finite variation, constant
first moment, and that the second moment X (s, t) only depends on (t s) and not depends on s
or t. In light of the last point, we can rewrite the autocovariance function of a stationary process
as
X (h) = Cov(Xt , Xt+h ) for t, h 2 Z.
Also, when Xt is stationary, we must have
X (h)
X(
h).
X (h)
X (0)
Example 1 (continued): In example 1, we see that E(Xt ) = 0, E(Xt2 ) = 1.25, and the autocovariance functions does not depend on s or t. Actually we have X (0) = 1.25, X (1) = 0.5, and
x (h) = 0 for h > 1. Therefore, {Xt } is a stationary process.
Pt
Example 2 (Random walk) Let St be a random walk St =
s=0 Xs with S0 = 0 and Xt is
independent and identically distributed with mean zero and variance 2 . Then for h > 0,
S (t, t
+ h) = Cov(St , St+h )
1
0
t
t+h
X
X
= Cov @
Xi ,
Xj A
i=1
= V ar
t
X
i=1
= t
j=1
Xi
since
Cov(Xi , Xj ) = 0
for i 6= j
In this case, the autocovariance function depends on time t, therefore the random walk process St
is not stationary.
Example 3 (Process with linear trend): Let t iid(0,
2)
and
Xt = t + t .
Then E(Xt ) = t, which depends on t, therefore a process with linear trend is not stationary.
Among stationary processes, there is simple type of process that is widely used in constructing
more complicated processes.
Example 4 (White noise): The time series t is said to be a white noise with mean zero and
variance 2 , written as
W N (0, 2 )
It is clear that a white noise process is stationary. Note that white noise assumption is weaker
than identically independent distributed assumption.
To tell if a process is covariance stationary, we compute the unconditional first two moments,
therefore, processes with conditional heteroskedasticity may still be stationary.
Example 5 (ARCH model) Let Xt = t with E(t ) = 0, E(2t ) =
t 6= s. Assume the following process for 2t ,
2t = c + 2t
+ ut
2
1 (Xt )
= Et
2
1 (t )
= Et
1 (c
+ 2t
+ ut ) = c + 2t 1 .
= c/(1
). Therefore, this
Definition 3 (Strict stationarity) The time series {Xt , t 2 Z} is said to be strict stationary if the
joint distribution of (Xt1 , Xt2 , . . . , Xtk ) is the same as that of (Xt1 +h , Xt2 +h , . . . , Xtk +h ).
In other words, strict stationarity means that the joint distribution only depends on the difference h, not the time (t1 , . . . , tk ).
Remarks: First note that finite variance is not assumed in the definition of strong stationarity,
therefore, strict stationarity does not necessarily imply weak stationarity. For example, processes
like i.i.d. Cauchy is strictly stationary but not weak stationary. Second, a nonlinear function of
a strict stationary variable is still strictly stationary, but this is not true for weak stationary. For
example, the square of a covariance stationary process may not have finite variance. Finally, weak
4
0.05
S&P 500 returns in year 1999
1500
1400
1300
1200
1100
1000
900
100
200
0.05
300
100
200
300
100
200
300
0.05
S&P 500 returns in year 2001
1500
1400
1300
1200
1100
1000
900
100
200
0.05
300
Figure 2: Plots of S&P index and returns in year 1999 and 2001
stationarity usually does not imply strict stationarity as higher moments of the process may depend
on time t. However, if process {Xt } is a Gaussian time series, which means that the distribution
functions of {Xt } are all multivariate Gaussian, i.e. the joint density of
fXt ,Xt+j1 ,...,Xt+jk (xt , xt+j1 , . . . , xt+jk )
is Gaussian for any j1 , j2 , . . . , jk , weak stationary also implies strict stationary. This is because a
multivariate Gaussian distribution is fully characterized by its first two moments.
For example, a white noise is stationary but may not be strict stationary, but a Gaussian
white noise is strict stationary. Also, general white noise only implies uncorrelation while Gaussian
white noise also implies independence. Because if a process is Gaussian, uncorrelation implies
independence. Therefore, a Gaussian white noise is just i.i.d.N (0, 2 ).
Stationary and nonstationary processes are very dierent in their properties, and they require
dierent inference procedures. We will discuss this in much details through this course. At this
point, note that a simple and useful method to tell if a process is stationary in empirical studies is
to plot the data. Loosely speaking, if a series does not seem to have a constant mean or variance,
then very likely, it is not stationary. For example, Figure 2 plots the daily S&P 500 index in year
1999 and 2001. The upper left figure plots the index in 1999, upper right figure plots the returns
in 1999, lower left figure plots the index in 2001, and lower right figure plots the returns in 2001.
Note that the index level are very dierent in 1999 and 2001. In year 1999, it is wandering at
a higher level and the market rises. In year 2001, the level is much lower and the market drops.
5
In comparison, we did not see much dierence in the returns in year 1999 and 2001 (although
the returns in 2001 seem to have thicker tails). Actually, only judging from the return data, it
is very hard to tell which figure plots the market in booms, and which figure plots the market in
crashes. Therefore, people usually treat stock price data as nonstationary and stock return data as
stationary.
Ergodicity
Recall that Kolmogorovs law of large number (LLN) tells that if Xi i.i.d.(,
then we have the following limit for the ensemble average
n = n
X
n
X
i=1
2)
for i = 1, . . . , n,
Xi ! .
In time series, we have time series average, not ensemble average. To explain the dierences
between ensemble average and time series average, consider the following experiment. Suppose we
want to track the movements of some particles and draw inference about their expected position
(suppose that these particles move on the real line). If we have a group of particles (group size n),
then we could track down the position of each particle and plot a distribution of their positions.
The mean of this sample is called ensemble average. If all these particles are i.i.d., LLN tells that
this average converges to its expectation as n ! 1. However, as we remarked earlier, with time
series observations, we only have one history. That means, in this experiment, we only have one
particle. Then instead of collecting n particles, we can only track this single particle and record
PT its
1
position, say xt , for t = 1, 2, . . . , T . The mean we computed by averaging over time, T
t=1 xt
is called time series average.
Does the time series average converges to the same limit as the ensemble average? The answer
is yes if Xt is stationary and ergodic. If Xt is stationary and ergodic with E(Xt ) = , then the
time series average has the same limit as ensemble average,
T = T
X
T
X
t=1
Xt ! .
This result is given as ergodic theorem, and we will discuss it later in our lecture 4 on asymptotic theory. Note that this result require both stationarity and ergodicity. We have explained
stationarity and we see that stationarity allows time series dependence. Ergodicity requires average asymptotic independence. Note that stationarity itself does not guarantee ergodicity (page 47
in Hamilton and lecture 4).
Readings:
Hamilton, Ch. 3.1
Brockwell and Davis, Page 1-29
Hayashi, Page 97-102
ARMA Process
As we have remarked, dependence is very common in time series observations. To model this time
series dependence, we start with univariate ARMA models. To motivate the model, basically we
can track two lines of thinking. First, for a series xt , we can model that the level of its current
observations depends on the level of its lagged observations. For example, if we observe a high
GDP realization this quarter, we would expect that the GDP in the next few quarters are good
as well. This way of thinking can be represented by an AR model. The AR(1) (autoregressive of
order one) can be written as:
xt = xt 1 + t
where t W N (0, 2 ) and we keep this assumption through this lecture. Similarly, AR(p) (autoregressive of order p) can be written as:
xt =
1 xt 1
2 xt 2
+ ... +
p xt p
+ t .
In a second way of thinking, we can model that the observations of a random variable at time
t are not only aected by the shock at time t, but also the shocks that have taken place before
time t. For example, if we observe a negative shock to the economy, say, a catastrophic earthquake,
then we would expect that this negative eect aects the economy not only for the time it takes
place, but also for the near future. This kind of thinking can be represented by an MA model. The
MA(1) (moving average of order one) and MA(q) (moving average of order q) can be written as
xt = t + t
and
xt = t + 1 t
+ . . . + q t
q.
1 xt 1
2 xt 2
+ ... +
p xt p
+ t + 1 t
+ . . . + q t
q.
ARMA model provides one of the basic tools in time series modeling. In the next few sections,
we will discuss how to draw inferences using a univariate ARMA model.
Lag Operators
Lag operators enable us to present an ARMA in a much concise way. Applying lag operator
(denoted L) once, we move the index back one time unit; and applying it k times, we move the
index back k units.
Lxt = xt
L xt = xt
..
.
2
Lk xt = xt
1
2
+ yt
(1
L)xt = t
AR(p) :
(1
1L
2
2L
...
p
p L )xt
= t
MA(1) : xt = (1 + L)t
MA(q) : xt = (1 + 1 L + 2 L2 + . . . + q Lq )t
Let
1L
(L) = 1 + 1 L +
2
2L
2 L2 +
...
... +
p
pL
p Lq
With lag polynomials, we can rewrite an ARMA process in a more compact way:
AR :
(L)xt = t
MA :
xt = (L)t
ARMA :
(L)xt = (L)t
Invertibility
Given a time series probability model, usually we can find multiple ways to represent it. Which
representation to choose depends on our problem. For example, to study the impulse-response
functions (section 4), MA representations maybe more convenient; while to estimate an ARMA
model, AR representations maybe more convenient as usually xt is observable while t is not.
However, not all ARMA processes can be inverted. In this section, we will consider under what
conditions can we invert an AR model to an MA model and invert an MA model to an AR model. It
turns out that invertibility, which means that the process can be inverted, is an important property
of the model.
If we let 1 denotes the identity operator, i.e., 1yt = yt , then the inversion operator (1
L) 1
is defined to be the operator so that
(1
L)
(1
2
L) = 1
L)
xt = (1
L)
1?
L)
L)(L)
= (1
L)(1 + 1 L + 2 L2 + . . .)
= (1
L)(1 + L +
= 1
= 1
= 1
L+ L
lim
k!1
L + . . .)
2 2
L +
2 2
2 2
L + ...
3 3
for | | < 1
xt
+ t
=
..
.
xt
+ t + t
xt
+ t + t
xt
k 1
X
+ ... +
k 1
k+1
j=0
With | | < 1, we have that limk!1 k xt k = 0, so again, we get the moving average representation
with MA coefficient equal to k . So the condition that | | < 1 enables us to invert an AR(1)
process to an MA(1) process,
AR(1) :
(1
L)xt = t
MA(1) : xt = (L)t
with k =
We have got some nice results in inverting an AR(1) process to a MA(1) process. Then, how
to invert a general AR(p) process? We need to factorize a lag polynomial and then make use of the
result that (1
L) 1 = (L). For example, let p = 2, we have
(1
1L
2
2 L )xt
1L
2
2L )
= (1
= t
1
and
1 L)(1
(1)
2
such that
2 L)
Given that both | 1 | < 1 and | 2 | < 1 (or when they are complex number, they lie within the
unit circle. Keep this in mind as I may not mention this again in the remaining of the lecture), we
could write
(1
(1
1 L)
= 1 (L)
2 L)
= 2 (L)
1 L)
(1
2 L)
= 1 (L)2 (L)t
Solving 1 (L)2 (L) is straightforward,
1 (L)2 (L) = (1 +
1L
2 2
1L
+ . . .)(1 +
= 1+(
1 + 2 )L + (
k
X j k j
k
(
1 2 )L
k=0 j=0
1
X
=
=
(L),
2
1
1 2
2L
2
2 )L
2 2
2L
2
+ . . .)
+ ...
say,
P
with k = kj=0 j1 k2 j . Similarly, we can also invert the general AR(p) process given that all
roots i has less than one absolute value. An alternative way to represent this MA process (to
express ) is to make use of partial fractions. Let c1 , c2 be two constants, and their values are
determined by
1
L)(1
1
(1
2 L)
c1
1
1L
c2
1
c1 (1
(1
2L
+ c2 (1
1 L)
L)(1
L)
1
2
2 L)
We must have
1 = c1 (1
2 L)
= (c1 + c2 )
+ c2 (1
(c1
1 L)
+ c2
1 )L
which gives
c1 + c2 = 1
and c1
+ c2
= 0.
1
1
c2 =
,
2
2
2
.
1
1 L)(1
= c1 (1
1
X
= c1
k=0
1
X
2 L)]
= c1
k
1
+ c2
t + c2 (1
2 L)
1
X
k
k
1 t k + c2
2 t k
1 L)
k=0
k t k
k=0
where
k.
2
Similarly, an MA process,
xt = (L)t ,
is invertible if (L) 1 exists. An MA(1) process is invertible if || < 1, and an MA(q) process is
invertible if all roots of
1 + 1 z + 2 z 2 + . . . q z q = 0
lie outside of the unit circle. Note that for any invertible MA process, we can find a noninvertible
MA process which is the same as the invertible process up to the second moment. The converse is
also true. We will give an example in section 5.
Finally, given an invertible ARMA(p, q) process,
(L)xt = (L)t
xt =
(L)(L)t
xt = (L)t
then what is the series
k?
(L)(L)t = (L)t ,
L)(
+(
1L
0 )L
+(
2
2L
+ . . .)
2
1 )L
+ ...
0=
j 1
for j
=1
+
j 1
( + )
for
Impulse-Response Functions
Given an ARMA model, (L)xt = (L)t , it is natural to ask: what is the eect on xt given a unit
shock at time s (for s < t)?
4.1
MA process
: 0 1 0 0 0
x: 0 1 0 0
+ 2 t
+ . . . + q t
q,
: 0 1 0 0 ... 0 0
x : 0 1 1 2 . . . q 0
The left figure in Figure 1 plots the impulse-response function of an MA(3) process. Similarly,
we can write down the eects for an MA(1) process. As you can see, we can get impulse-response
function immediately from an MA process.
4.2
AR process
...
...
As can be seen from above, the impulse-response dynamics is quite clear from a MA representation.
For example, let t > s > 0, given one unit increase in s , the eect on xt would be t s , if there
are no other shocks. If there are shocks that take place at time other than s and has nonzero eect
on xt , then we can add these eects, since this is a linear model.
The dynamics is a bit complicated for higher order AR process. But applying our old trick
of inverting them to a MA process, then the following analysis will be straightforward. Take an
AR(2) process as example.
Example 2
xt = 0.6xt
+ 0.2xt
+ t
or
(1
0.6L
0.2L2 )xt = t
5=0
0.6L
0.2L2 )xt = (1
xt = (1
=
0.84L)(1 + 0.24L)xt
0.84L)
(1 + 0.24L)
(L)t
b2
2a
4ac
= 1/y2 =
P
where k = kj=0 j1 k2 j . In this example, the series of
the eects of on x can be described as:
: 0 1 0
0
0
...
x : 0 1 0.6 0.5616 0.4579 . . .
The right figure in Figure 1 plots this impulse-response function. So after we invert an AR(p)
process to an MA process, given t > s > 0, the eect of one unit increase in s on xt is just t s .
We can see that given a linear process, AR or ARMA, if we could represent them as a MA
process, we will find impulse-response dynamics immediately. In fact, MA representation is the
same thing as the impulse-response function.
0.5
0.5
Response
1.5
Response
1.5
0.5
0.5
10
20
30
10
Time
20
5
5.1
30
Time
where t W N (0,
2
).
1,
= (1 + )
2
1)
=0
and
x (t, t
2 for h = 1
=
0
for h > 1
7
1 )]
So, for a MA(1) process, we have a fixed mean and a covariance function which does not depend
on time t: (0) = (1 + 2 ) 2 , (1) = 2 , and (h) = 0 for h > 1. So we know MA(1) is stationary
given any finite value of .
The autocorrelation can be computed as x (h) = x (h)/ x (0), so
x (0) = 1,
x (1) =
,
1 + 2
x (h) = 0
for h > 1
1,
t W N (0,
) || > 1
t W N (0, 2
1,
.
Then we can compute that E(xt ) = E(
xt ) = 0, E(x2t ) = E(
x2t ) = (1 + 2 ) 2 , x (1) = x (1) =
and x (h) = x (h) = 0 for h > 1. Therefore, these two processes are equivalent up to the
second moments. To be more concrete, we plug in some numbers.
Let = 2, and we know that the process
2,
xt = t + 2t
t W N (0, 1)
1,
1,
t W N (0, 4)
.
Note that E(xt ) = E(
xt ) = 0, E(x2t ) = E(
xt )2 = 5,
for h > 1.
x (1)
x
(1)
= 2, and
x (h)
x
(h)
=0
Although these two representations, noninvertible MA and invertible MA, could generate the
same process up to the second moment, we prefer the invertible presentations in practice because if
we can invert an MA process to an AR process, we can find the value of t (non-observable) based
on all past values of x (observable). If a process is noninvertible, then, in order to find the value of
t , we have to know all future values of x.
5.2
MA(q)
xt = (L)t =
q
X
k=0
(k Lk )t
k=0
and
x (h)
Pq
0
h
2
k=0 k k+h
for h = 1, 2, . . . , q
for h > q
5.3
MA(1)
xt = (L)t =
1
X
(k Lk )t
k=0
Before we compute moments and discuss the stationarity of xt , we should first make sure that
{xt } converges.
P
2
Proposition 1 If {t } is a sequence of white noise with 2 < 1, and if 1
k=0 k < 1, then the
series
1
X
xt = (L)t =
k t k
k=0
Proof (See Appendix 3.A. in Hamilton): Recall the Cauchy criterion: a sequence {yn } converges in
mean square if and only if kyn ym k ! 0 as n, m ! 1. In this problem, for n > m > 0, we want
to show that
" n
#2
m
X
X
E
k t k
k t k
k=1
mkn
n
X
k2
k=0
"
! 0
as
k=1
k2 2
m
X
k=0
k2
m, n ! 1
The result holds since {k } is square summable. It is often more convenient to work with a
slightly stronger condition absolutely summability:
1
X
k=0
|k | < 1.
It is easy to show that absolutely summable implies square summable. A MA(1) process with
absolutely summable coefficients is stationary with moments:
E(xt ) = 0
1
X
2
E(xt ) =
k2
k=0
x (h)
1
X
k k+h
k=0
5.4
AR(1)
(1
L)xt = t
(2)
Recall that an AR(1) process with | | < 1 can be inverted to an MA(1) process
xt = (L)t
with k =
|k | =
1
X
k=0
| < 1.
Using the results for MA(1), the moments for xt in (2) can be computed:
E(xt ) = 0
1
X
E(x2t ) =
k=0
2
2
)
/(1
1
X
2k+h 2
k=0
h 2
2
)
/(1
x (h)
2k 2
=
=
5.5
AR(p)
1L
2
2L
1L
2
2L
...
p
pL )
p
p L )xt
...
= (1
= t
in
1 L)(1
2 L) . . . (1
p L)
(3)
have
P1 less than one absolute value. It also turns out that with | i | < 1, the absolute summability
k=0 | k | < 1 is also satisfied. (The proof can be found on page 770 of Hamilton and the proof
uses the result that k = c1 k1 + c2 k2 .)
10
y1 )(L
y2 ) . . . (L
yp ) = 0
(4)
the requirement that | i | < 1 is equivalent to that all roots in (4) lie outside of the unit circle, i.e.,
|yi | > 1 for all i.
First calculate the expectation for xt , E(xt ) = 0. To compute the second moments, one method
is to invert it into a MA process and using the formula of autocovariance function for MA(1).
This method requires finding the moving average coefficients , and an alternative method which
is known as Yule-Walker method maybe more convenient in finding the autocovariance functions.
To illustrate this method, take an AR(2) process as an example:
xt =
1 xt 1
+ t
2 xt 2
Multiply xt , xt 1 , xt 2 , . . . to both sides of the equation, take expectation and and then divide
by (0), we get the following equations:
1 =
1 (1)
(1) =
(2) =
1 (1)
(k) =
1 (k
2 (2)
2
/
(0)
2 (1)
1) +
2)
2 (k
for
(1) can be first solved from the second equation: (1) = 1 /(1
2 ), (2) can then be solved
from the third equation. (k) can be solved recursively using (1) and (2) and finally, (0) can
be solved from the first equation. Using (0) and (k), (k) can computed using (k) = (k) (0).
Figure 2 plots this autocorrelation for k = 0, . . . , 50 and the parameters are set to be 1 = 0.5 and
2 = 0.3. As is clear from the graph, the autocorrelation is very close to zero when k > 40.
1
0.9
0.8
0.7
rho(k)
0.6
0.5
0.4
0.3
0.2
0.1
10
15
20
25
k
30
35
40
11
45
= 0.5 and
= 0.3
5.6
ARMA(p, q)
(L)t = (L)t .
L)xt = (1 + L)t
xt = (L)t =
1
X
j t j
j=0
where 0 = 1 and
process we have
j 1(
+ ) for j
x (0) =
2 2
k
k=0
=
=
1+
1
X
2(k 1)
k=1
( + )2
1+
2
1
( + )
If we plug in some numbers, say, = 0.5 and = 0.5, so the original process is xt = 0.5xt
0.5t 1 , then x (0) = (7/3) 2 . For h 1,
x (h)
1
X
2
k k+h
k=0
=
=
Plug in
h 1
h 1
( + ) +
( + ) 1 +
h 2
( + )2
( + )
2
1
1,
x (h)
5 21
3
12
2
.
1
X
k=1
2
2k
+ t +
x (1)
= [1 + ( + )]
x (1)
x (0)
x (2)
x (1)
= 0
..
.
x (h)
x (h
1) = 0
(L)xt =
for h > 2
where we use that xt = (L)t in taking expectation on the right side, for instance, E(xt t ) =
E((t + 1 t 1 + . . .)t ) = 2 . Plug in = = 0.5 and solving those equations, we have x (0) =
(7/3) 2 , x (1) = (5/3) 2 , and x (h) = x (h 1)/2 for h 2. This is the same results as we got
using the first method.
Summary: A MA processPis stationary if and
P1 only if the coefficients {k } are square summable
1
2
(absolute summable), i.e., k=0 k < 1 or k=0 |k | < 1. Therefore, MA with finite number of
MA coefficients are always stationary. Note that stationarity does not require MA to be invertible.
An AR process is stationary if it is invertible, i.e. | i | < 1 or |yi | > 1, as defined in (3) and (4)
respectively. An ARMA(p, q) process is stationary if its autoregressive lag polynomial is invertible.
5.7
For covariance stationary process, we see that autocovariance function is very useful
P1in describing
the process. One way to summarize absolutely summable autocovariance functions ( h= 1 | (h)| <
1) is to use the autocovariance-generating function:
gx (z) =
1
X
(h)z h .
h= 1
gx (z) =
2
1
[z
+ (1 + 2 ) + z] =
2
(1
+ z)(1 + z
).
x (h)
Pq
h
2
k=0 k k+h
gx (z) =
1
X
for h = 1, . . . , q and
(h)z h
h= 1
13
x (h)
q
X
k2 +
k=0
q
X
k z k
k=0
q X
q h
X
h=1 k=0
q
X
(k k
k
k z
k=0
hz
+ k k+h z h )
P
For a MA(1) process xt = (L)t where 1
k=0 |k | < 1, we can naturally let q be replaced by
1 in the AGF for MA(q) to get AGF for MA(1),
! 1
!
1
X
X
gx (z) = 2
k z k
k z k = 2 (z)(z 1 ).
k=0
k=0
Next, for a stationary AR or ARMA process, we can invert them to a MA process. For instance,
an AR(1) process, (1
L)xt = t , invert it to
xt =
1
1
1
X
k=0
where k =
k.
1
X
z)(1
k z
k=0
1)
z
=
(z)(z
),
k z
(1
t ,
2
(1
+ 1 z + . . . + q z q )(1 + 1 z 1 + . . . + q z q )
p
1
p
(1
...
...
1z
p z )(1
1z
pz )
1
2 (z)(z )
(z) (z 1 )
In this section, we plot a few simulated ARMA processes. In the simulations, the errors are Gaussian
white noise i.i.d.N (0, 1). As a comparison, we first plot a Gaussian white noise (or AR(1) with
= 0) in Figure 3. Then, we plot AR(1) with = 0.4 and = 0.9 in Figure 4 and Figure 5. As
you can see, the white noise process is very choppy and patternless. When = 0.4, it becomes a
bit smoother, and when = 0.9, the departures from the mean (zero) is very prolonged. Figure 6
plots an AR(2) process and the coefficients are set to numbers as in our example in this lecture.
Finally, Figure 7 plots a MA(3) process. Compare this MA(3) process with the white noise, we
could see an increase of volatilities (the volatility of the white noise is 1 and the volatility of the
MA(3) process is 1.77).
14
20
40
60
80
100
120
140
160
180
200
20
40
60
80
100
120
140
160
180
15
200
= 0.4
10
10
20
40
60
80
100
120
140
160
180
200
= 0.9
20
40
60
80
100
120
140
16
160
180
200
= 0.6,
= 0.2
20
40
60
80
100
120
140
160
180
200
7
7.1
If we are interested in forecasting a random variable yt+h based on the observations of x up to time
t (denoted by X) we can have dierent candidates, denoted by g(X). If our criterion in picking the
best forecast is to minimize the mean squared error (MSE), then the best forecast is the conditional
expectation, g(X) = EX (yt+h ). The proof can be found on page 73 in Hamilton. In our following
discussion, we assume that the data generating process is known (so parameters are known), so we
can compute the conditional moments.
7.2
AR models
+ t
where we continue to assume that t is a white noise with mean zero and variance
compute
Et (xt+1 ) = Et ( xt + t+1 ) = xt
Et (xt+2 ) = Et (
xt + t+1 + t+2 ) =
xt +
xt
... = ...
Et (xt+k ) = Et (
k 1
t+1 + . . . + t+k ) =
xt
xt + t+1 + t+2 ) = (1 +
xt +
... = ...
Vart (xt+k ) = Vart (
k 1
t+1 + . . . + t+k ) =
k 1
X
j=0
17
2j 2
2
,
then we can
Note that as k ! 1,
Et (xt+k ) ! 0
Vart (xt+k ) !
2
/(1
7.3
MA Models
1,
if we know t , then
Et (xt+1 ) = Et (t+1 + t ) = t
Et (xt+2 ) = Et (t+2 + t+1 ) = 0
... = ...
Et (xt+k ) = Et (t+k + t+k
1)
=0
and
Vart (xt+1 ) = Vart (t+1 + t ) =
... = ...
Vart (xt+k ) = Vart (t+k + t+k
1)
= (1 + 2 )
It is easy to see that for an MA(1) process, the conditional expectation for two step ahead and
higher is the same as unconditional expectation, so is the variance. Next, for a MA(q) model,
xt = t + 1 t
1 + 2 t
2 + . . . + q t
q
X
j t
j,
j=0
if we know t , t
1 , . . . , t q ,
then
q
X
Et (xt+1 ) = Et (
j t+1
j=0
q
X
Et (xt+2 ) = Et (
... = ...
j t+2
j) =
j)
j=0
Et (xt+k ) = Et (
j t+1
j=1
q
X
j t+2
q
X
j t+k
j=2
q
X
Et (xt+k ) = Et (
j t+k
j=0
q
X
q
X
j t+k
j) =
j=k
j)
j=0
18
=0
for k > q
for k q
and
q
X
Vart (xt+1 ) = Vart (
j t+1
j=0
q
X
j t+2
j)
j)
= 1 + 12
j=0
q
X
Vart (xt+k ) = Vart (
j t+k
j) =
j=0
k
X
j2
j=0
8 k>0
We could see that for an MA(q) process, the conditional expectation and variance of forecast for
q + 1 and higher is the same as unconditional expectations and variance.
Wold Decomposition
So far we have focused on ARMA models, which are linear time series models. Is there any relationship between a general covariance stationary process (maybe nonlinear) to linear representations?
The answer is given by the Wold decomposition theorem:
Proposition 2 (Wold Decomposition) Any zero-mean covariance stationary process xt can be represented in the form
1
X
xt =
j t j + Vt
j=0
where
(i)
=1
and
(ii) t W N (0,
(iii) E(t Vs ) = 0
P1
j=0
2
j
<1
2
)
s, t > 0
E(xt |xt
1 , xt 2 , . . .)
(v) Vt is a deterministic process and it can be predicted from a linear function of lagged x.
Remarks: Wold decomposition says that any covariance stationary process has a linear representation: a linear deterministic component (Vt ) and a linear indeterministic components (t ). If
Vt = 0, then the process is said to be purely-non-deterministic, and the process can be represented
as a MA(1) process. Basically, t is the error from the projection of xt on lagged x, therefore it is
uniquely determined and it is orthogonal to lagged x and lagged . Since this error is the residual
from the projections, it may not be the true errors in the DGP of xt . Also note that the error term
() is a white noise process, and does not need to be iid.
Readings:
Hamilton Ch. 1-4
Brockwell and Davis Ch. 3
Hayashi Ch 6.1, 6.2
19
Any covariance stationary process has both a time domain representation and a spectrum domain representation. So far, our analysis is in the time domain as we represent a time series {xt }
in terms of past values of innovations and investigate the dependence of x at distinct time. In some
cases, a spectrum-domain representation is more convenient in describing a process. To transform
a time-domain representation to a spectrum-domain representation, we use the Fourier transform.
Fourier Transforms
Let ! denote the frequency ( < ! < ), and let T denote the period : the minimum time that
it takes the wave to go through a whole cycle, and we have T = 2/!. Given any integer number
z, we have x(t) = x(t + zT ). Finally, we will let denote the phase: the amount that a wave is
shifted.
Given a time series {xt }, its Fourier transformation is:
x(!) =
and the inverse Fourier transform is:
1
1 X
e
2 t= 1
x(t) =
it!
x(t)
eit! x(!)d!
(1)
(2)
Spectrum
Recall that the autocovariance function for a zero-mean stationary process {xt } is defined as:
x (h)
= E(xt xt
h)
and it serves to characterize the time series {xt }. The spectrum of {x} is defined to be the Fourier
transform of x (h),
1
1 X
Sx (!) =
e ih! x (h)
(3)
2
h= 1
P
h
Recall that the autocovariance generating function is gx (z) = 1
h= 1 x (h)z , if we let z =
i!
e , then the spectrum is just the autocovariance generating function divided by 2. In (3), if we
take ! = 0, we see that
1
X
x (h) = 2Sx (0),
h= 1
which tells that the sum of autocorrelations equals the spectrum at zero multiplied by 2. Using
the identity
ei = cos + i sin ,
we can also write (3) as
1
Sx (!) =
2
"
+2
1
X
x (h) cos(h!)
h=1
(4)
Note that since cos(!) = cos( !), and x (h) = x ( h), the spectrum is symmetric about zero.
Also the cosine function is periodic with period 2, therefore, for spectral analysis, we only need
to find the spectrum for ! 2 [0, ]. Now if we know x (h), we can compute its spectrum using (4),
and if we know the spectrum Sx (!), we can compute x (h) using the inverse Fourier transform:
Z
ei!h Sx (!)d!
(5)
x (h) =
So the variance of {xt } is just the sum of the spectrum over all frequencies < ! < .
Therefore we can see that the spectrum function Sx (!) decomposes the variance into components
contributed from each frequency. In other words, we can use spectrum to find the importance of
cycles of dierent frequencies.
If we normalize the spectrum Sx (!) by dividing x (0), we get the Fourier transform of the
autocorrelation function x (h),
1
1 X
f (!) =
e
2
ih!
x (h)
(6)
h= 1
The autocorrelation functions can be generated from f (!) using the inverse transform
Z
x (h) =
ei!h fx (!)d!
(7)
1=
fx (!)d!
Note that f (!) is positive and integrate to one, just like a probability distribution density, so
we call it spectral density.
Example 1 (spectral density of white noise) Let WN(0,
for h 6= 0. Using (3) and (6), we can compute
S (!) =
Divide it by
(0),
1
2
(0)
we have
1
2
2
).
We have
(0)
and
(h)
=0
2
.
1
.
2
So the spectral density is uniform over [ , ], i.e., every frequency has equal contribution to
the variance.
fx () =
Considering that the spectrum of a white noise process is so simple, we may want to know if we
could make use it for a more complicated process, say,
1
X
xt =
k t k = (L)t .
k= 1
We call this process a two-sided moving average process. Then what is the relationship between
Sx (!) and S (!)? The general solution is given in the following statement.
Proposition 1 If {xt } is a zero mean stationary process with spectrum function Sx (!), and {yt }
is the process
1
X
yt =
k xt k = (L)xt
k= 1
1
X
Sy (!) =
Sx (!) = (e
ik!
k e
i!
) Sx (!).
k= 1
= E(yt yt h )
0
1
X
= E@
j xt
j= 1
1
X
=
=
1
1 X
e
2
1
2
h= 1
1
X
Sy (!) =
j= 1
i!
= (e
= (e
=
(e
i!
ih!
j xt h k )
+k
j)
ij!
y (h)
ih!
h= 1
h k
j,k= 1
x (h
j k
k xt
k= 1
j k E(xt
j,k= 1
1
X
1
X
P1
1
2
l= 1 e
1
X
eik! k
i!
j k
j,k= 1
k= 1
)(e
1
X
i! )S
x (!)
) Sx (!)
3
il!
x (h
x (l),
+k
j)
1
1 X
e
2
l= 1
il!
x (l)
Example 2 To apply this results, first consider the problem of computing an MA(1) process,
xt = t + t
= (1 + L)t .
In this problem,
(e
i!
) = 1 + e
i!
thus
(e
i!
= (1 + e
i!
)(1 + ei! )
= 1 + 2 + (e
i!
+ ei! )
Therefore,
Sx (!) =
=
(e i! ) S (!)
1
[1 + 2 + (e
2
i!
+ ei! )]
We can verify this result by using the spectrum to compute the autocovarinace function, say,
(1).
Using (5).
x
Z
ei! Sx (!)d!
x (1) =
Z
1 2 i!
=
e [1 + 2 + (e i! + ei! )]d!
2
1 2
=
2
2
= 2
which is the Rsame as what we got from working in the time domain. In the computation we use
the fact the ei! d! = 0, as the integral of sine or cosine functions all the way around a circle is
zero.
Figure 1 plots the spectrum of MA(1) processes with positive and negative coefficients. When
> 0, we see that the spectrum is high for low frequencies and low for high frequencies. When
< 0, we observe the opposite. This is because when is positive, we have positive one lag
correlation which makes the series smooth with only small contribution from high frequency (say,
day to day) components. When is negative, we have negative one lag correlation, therefore the
series fluctuates rapidly about its mean value.
Above we have considered the moving average process, the next proposition gives results for an
ARMA models with white noise errors:
Proposition 2 Let {xt } be an ARMA(p, q) process satisfying
(L)xt = (L)t
0.4
0.35
0.35
0.3
0.3
0.25
0.25
Spectrum
Spectrum
0.4
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
2
Frequency
2
Frequency
Figure 1: Plots of the spectrum of MA(1) processes ( = 0.5 for the left figure and =
the right figure)
where W N (0,
2
),
0.5 for
all roots of (L) lies out of the unit circle, then the spectrum of xt is:
|(e i! )|2
S (!)
| (e i! )|2
1 |(e i! )|2 2
2 | (e i! )|2
Sx (!) =
=
Example 3 Consider an AR(1) process,
xt = xt
+ t .
2
2
|1
(1 +
i!
2 cos !)
(8)
Figure 2 plots the AR(1) processes with positive and negative coefficients. We have similar
observations here as the MA processes. However, note that when ! 1, Sx (!) ! 1, which means
that a random walk process has an infinite spectrum at frequency zero. This is similar as we are
working with summation and dierencing. When we add up a white noise (say, = 1 as in a
random walk), the high frequencies are smoothed out (those spikes in the white noise disappear)
and what is left is the long term stochastic trend. On the contrary, when we do dierencing (say,
do first dierencing to a random walk, then we are back to the white noise series), we get rid of
the long term trend, and what is left is the high frequencies (lots of spikes with mean zero, say).
Finally we introduce a spectral representation theorem without proof. For zero-mean stationary
process with absolutely summable autocovariances, define random variables (!) and (!), we
could represent the series in the form
Z
xt =
[(!) cos(!t) + (!) sin(!t)]d!.
0
0.7
0.6
0.6
0.5
0.5
0.4
0.4
Spectrum
Spectrum
0.7
0.3
0.3
0.2
0.2
0.1
0.1
2
Frequency
2
Frequency
Figure 2: Plots of the spectrum of AR(1) processes ( = 0.5 for the left figure and
the right figure)
0.5 for
where (!) and (!) have zero mean and are mutually and serially uncorrelated. The representation
theorem tells that for a stationary process with absolutely summable autocovariances, we can write
it as a weighted sum of periodic functions.
Spectrum is an autocovariance generating function and we can use it to compute the autocovariance
for a stationary process. Besides computing autocovariance of a single time series, the spectrum
function can also capture the covariance cross two time series. We call such spectrum functions
cross spectrum.
For a single time series {xt }, a spectrum function is the Fourier transform of the autocovariance
function x (h) = E(xt xt h ). Similarly, for two time series {xt } and {yt }, the cross spectrum is the
Fourier transform of the covariance function of xt and yt h , i.e.
1
X
Sxy (!) =
ih!
E(xt yt
h)
h= 1
In general,
Sxy (!) 6= Syx (!) =
But they have the following relationship:
1
X
ih!
E(yt xt
h= 1
1
X
ih!
E(xt yt
h= 1
h)
h)
=
=
=
1
X
ih!
h= 1
1
X
E(yt xt+h )
eik! E(yt xt
k= 1
1
X
( ik!)
k)
(let k =
E(yt xt
h)
k)
k= 1
= Syx ( !)
Note that if xt and ys are uncorrelated for all t, s, then E(xt yt h ) = 0 for all h, therefore,
Sxy (!) = Syx (!) = 0. Knowing the cross spectrum, next we can compute the spectrum of a sum.
For a process zt = xt + yt , the spectrum of zt can be computed as follows:
Sz (!) =
=
=
1
X
h= 1
1
X
h= 1
1
X
h)
ih!
E(zt zt
ih!
E[(xt + yt )(xt
ih!
[E(xt xt
h)
+ yt
+ E(xt yt
h )]
h)
+ E(yt xt
h)
+ E(yt yt
h )]
h= 1
Estimation
1
1 X
e
2
ih!
x (h).
h= 1
T
X
[(xt
x
)(xt
x
)].
t=h+1
To estimate the spectrum, we may compute the sample analog of (3), which is known as the
sample periodogram
T 1
1 X
Ix (!) =
e ih! x (h).
2
h= T +1
(0) + 2
T
X1
(h) cos(!h) .
h=1
(9)
Sx (!)
(2)
Since E( 2 (2)) = 2, the sample periodogram provides an unbiased estimate of the spectrum,
limT !1 EIx (!) = Sx (!). However, the variance of Ix (!) does not go to zero. In fact,
t W N (0, 1)
1,
Sx (!) =
[1 + 2 + (e
2
i!
+ ei! )].
A potential problem with parametric estimation is that we have to specify a parametric model
for the process, say, ARMA(p, q). So we may have some errors due to misspecification. However,
even if the model is incorrectly specified, if the autocovariances of the true process are close to those
for our specifications, then this procedure still could provide a useful estimate of the population
spectrum.
An alternative approach is to estimate the spectrum nonparametrically. Doing this could save
us from specifying a model for the process. We still make use of the sample periodogram, however,
to estimate the spectrum Sx (!), we use a weighted average of the sample periodogram over several
neighboring !s. How much weight to put on each ! in the neighborhood is determined by a function
which is known as the kernel, or kernel function. This means that the spectrum is estimated by
Sx (!j ) =
m
X
l= m
k(l, m) Ix (!j+l ).
k(l, m) = 1.
l= m
(10)
Here m is the bandwidth or window indicating how many dierent frequencies to viewed as useful
in estimating Sx (!j ). Averaging Ix (!) over dierent frequencies can equivalently be represented as
multiplying the hth autocovariances (h) in (9) by a weight function w(h, q). A derivation can be
found on page 166 on Hamilton.
These weight function w(h, q) satisfy that w(0, q) = 1, |w(h, q)| 1, and w(h, q) = 0 for h > q.
The q in weight function works in a similar way as the m in k(l, m), as it specifies a length of the
window. Some commonly used weight functions are
Truncated kernel, let x = h/q,
w(x) =
1
0
1
0
1 for |x| 1
0 otherwise
|x| for |x| 1
otherwise
h
q+1
for h = 1, 2, . . . , q
otherwise
A typical problem in nonparametric estimation is the trade o between variance and bias.
Usually a large bandwidth reduces variance but induces bias. To reduce the variance without adding
much bias, we need to choose a proper bandwidth. In practice, we may plot an estimate of the
spectrum using several dierent bandwidths and use subjective judgment to choose the bandwidth.
Basically, if the plot is too flat, then it is hard to draw information like which frequencies are more
important than others; on the other hand, if the plot is too choppy (too many peaks and valleys
mixed together), then it is hard to make convincing comments.
Example 4 (Spectrum estimation of an AR(1) process). The data are generated from
xt = xt
+ t ,
We simulated a sequence of length n = 200 using this DGP and the OLS estimates of is 0.59
(OLS estimate is consistent in this problem). The upper-left figure in Figure 3 plots the population
spectrum, i.e., using (8) with = 0.5. The upper-right figure plots the estimated spectrum using
(8) with the OLS estimates of , 0.59. The lower-left figure plots the sample periodogram Ix (!),
which is very volatile. Finally, the lower right figure plots the smoothed estimate for the spectrum
using the Bartlett kernel, i.e.
2
3
q
X
j
Sx (!) = (2) 1 4 x (0) + 2
1
x (j)cos(!j)5 ,
q+1
j=1
where q is set to be 5.
0.7
Population Spectrum
0.6
0.5
0.4
0.3
0.2
0.1
0
2
Frequency
Sample Periodogram
1.5
0.5
2
Frequency
1
0.8
0.6
0.4
0.2
0
2
Frequency
2
Frequency
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10
In empirical studies, Section 6.4 on spectrum of industrial production series in Hamilton provides
a very good example. Without any detrending, the spectrum is focused on the low frequency region,
which means that the variance of the series is largely from the long term trend (here is the economics
growth). After detrending, we obtain the growth rate which is stationary, and the variance now
mostly come from the business cycle and seasonal eects. After filtering the seasonal eects, most
of the variance is now due to the business cycle.
Readings: Hamilton, Ch. 6; Brockwell and Davis, Ch. 4, Ch. 10
11
In time series analysis, we usually use asymptotic theories to derive joint distributions of the
estimators for parameters in a model. Asymptotic distribution is a distribution we obtain by letting
the time horizon (sample size) go to infinity. We can simplify the analysis by doing so (as we know
that some terms converge to zero in the limit), but we may also have a finite sample error. Hopefully,
when the sample size is large enough, the error becomes small and we can have a satisfactory
approximation to the true or exact distribution. The reason that we use asymptotic distribution
instead of exact distribution is that the exact finite sample distribution in many cases are too
complicated to derive, even for Gaussian processes. Therefore, we use asymptotic distributions as
alternatives.
Review
I think that this lecture may contain more propositions and definitions than any other lecture for
this course. In summary, we are interested in two type of asymptotic results. The first result is
about convergence to a constant. For example, we are interested in whether the sample moments
converge to the population moments, and law of large numbers (LLN) is a famous result on this.
The second type of results is about convergence to a random variable, say, Z, and in many cases, Z
follows a standard normal distribution. Central limit theorem (CLT) provides a tool in establishing
asymptotic normality.
The confusing part in this lecture might be that we have several versions of LLN and CLT.
The results may look similar, but the assumptions are dierent. We will start from the strongest
assumption, i.i.d., then we will show how to obtain similar results when i.i.d. is violated. Before
we come to the major part on LLN and CLT, we review some basic concepts first.
1.1
8n
r)
Proposition 1 if Xn and Yn are random variables defined in the same probability space and an > 0,
bn > 0, then
(i) If Xn = op (an ) and Yn = op (bn ), we have
Xn Yn = op (an bn )
Xn + Yn = op (max(an , bn ))
|Xn |r = op (arn )
for
r > 0.
! 0
If |Xn + Yn |/ max(an , bn ) > , then either |Xn |/an > /2 or |Yn |/bn > /2.
P (|Xn + Yn |/ max(an , bn ) > )
! 0.
Finally,
P (|Xn |r /arn > ) = P (|Xn |/an > 1/r ) ! 0.
Proof of (ii): If |Xn Yn |/(an bn ) > , then either |Yn |/bn >
|Yn |/bn () and |Xn |/an > / (), then
P (|Xn Yn |/(an bn ) > )
! 0
X| = op (1).
where
A = {|g(Xn )
X| > ().}
Therefore,
P (|g(Xn )
X| > M/2)
Given any > 0, we can choose M to make the second and third terms each less than /4.
Since Xn ! X, the first and fourth terms will each be less than /4. Therefore, we have
P (|g(Xn )
g(X)| > ) .
Example 1 (Convergence in probability but not almost surely) Let the sample space S = [0, 1], a
closed interval. Define the sequence {Xn } as
X1 (s) = s + 1[0,1] (s) X2 (s) = s + 1[0,1/2] (s) X3 (s) = s + 1[1/2,1] (s)
X4 (s) = s + 1[0,1/3] (s) X5 (s) = s + 1[1/3,2/3] (s) X6 (s) = s + 1[2/3,1] (s)
etc, where 1 is the indicator function, i.e., it equals to 1 if the statement is true and equals to 0
otherwise. Let X(s) = s. Then Xn !p X, as P (|Xn X| ) is equal to the probability of the
length of the interval of s values whose length is going to zero as n ! 1. However, Xn does not
converge to X almost surely, Actually there is no s 2 S for which Xn (s) ! s = X(s). For every s,
the value of Xn (s) alternates between the values of s and s + 1 infinately often.
1.2
Convergence in Lp Norm
When E(|Xn |p ) < 1 with p > 0, Xn is said to be Lp -bounded. Define that the Lp norm of X is
kXkp = (E|X|p )1/p . Before we define Lp convergence, we first review some useful inequalities.
Proposition 5 (Markovs Inequality) If E|X|p < 1, p
P (|X|
) = P (|X|p
E|X|p
Proof:
P (|X|
1)
= E1[1,1) (|X|
p
E[|X|
1[1,1) (|X|p
)]
E|X|
In the Markovs inequality, we can also replace |X| with |X c|, where c can be any real
number. When p = 2, the inequality is also known as Chebyshevs inequality. If X is Lp bounded,
then Markovs inequality tells that the tail probabilities converge to zero at the rate p as ! 1.
Therefore, the order of Lp boundedness measures the tendency of a distribution to generate outliers.
Proposition 6 (Holders inequality) For any p
1,
E|XY | kXkp kY kq ,
where q = p/(p
1) if p > 1 and q = 1 if p = 1.
kXkq .
Proof: Let Z = |X|q , Y = 1, s = p/q, Then by Holders inequality, E|ZY | kZks kY ks/(s
1) ,
or
For any p > q > 0, Lp convergences implies Lq convergence by Liaponovs inequality. We can
take convergence in probability as an L0 convergence, therefore, Lp convergence implies convergence
in probability:
Proposition 8 (Lp convergence implies convergence in probability) If Xn !Lp X then Xn !p X.
Proof:
P (|Xn
X| > )
X|p
E|Xn
by Markov0 s inequality
! 0
1.3
Convergence in Distribution
n!1
Again, we can naturally extend the definition and related results about scalar random variable
X to vector valued random variable X. To verify convergence in distribution of a k by 1 vector, if
the scalar ( 1 X1n + 2 X2n + . . . + k Xkn ) converges in distribution to ( 1 X1 + 2 X2 + . . . + k Xk )
for any real values of ( 1 , 2 , . . . , k ), then the vector (X1n , X2n , . . . , Xkn ) converges in distribution
to the vector (X1 , X2 , . . . , Xk ).
We also have the continuous mapping theorem for convergence in distribution.
Proposition 9 If {Xn } is a sequence of random k-vectors with Xn !d X and if g : Rk ! Rm is
a continuous function. Then g(Xn ) !d g(X).
In the special case that that the limit is a constant scalar or vector, convergence in distribution
implies convergence in probability.
Proposition 10 If Xn !d c where c is a constant, then Xn !p c.
Proof:. If Xn !d c, then FXn (x) ! 1[c,1) (x) for all x 6= c. For any > 0,
P (|Xn
c| )
P (c
Xn c + )
! 1[c,1) (c + )
=
1[c,1) (c
On the other side, for a sequence {Xn }, if the limit of convergence in probability or convergence
almost sure is a random variable X, then the sequence also converges in distribution to x.
1.4
Theorem 1 (Chebychevs Weak LLN) Let X be a random variable with E(X) = and limn!1
n ) = 0, then
V ar(X
n
X
n = 1
X
Xt !p .
n
t=1
| > )
n)
V ar(X
! 0.
2
WLLN tells that the sample mean is a consistent estimate for the population mean and the
n )2 = V ar(X
n ) ! 0, we also know that X
n converges
variance goes away as n ! 1. Since E(X
to the population mean in mean square.
Theorem 2 (Kolmogorovs Strong LLN) Let Xt be i.i.d and E(|X|) < 1, then
n !a.s. .
X
Note that Kolmogorovs LLN does not require finite variance. Next we consider the LLN for an
heterogeneous P
process without serial correlations, say, E(Xt ) = t and V ar(Xt ) = t2 , and assume
n) =
that
n = n 1 nt=1 t ! . Then we know that E(X
n ! , and
n) = E
V ar(X
n
X
(Xt
!2
t )
t=1
=n
n
X
2
t.
t=1
P1
n) = n
Proof: take bt = t2 , then by Kroneckers lemma, V ar(X
2
t=1 t
2 2
t
Pn
t=1
< 1, then
2
t
! 0. Then we have
1.5
Finally, central limit theorem (CLT) provides a tool to establish asymptotic normality of an estimator.
Definition 6 (Asymptotic Normality) A sequence of random variables {Xn } is said to be asymptotic normal with mean n and standard deviation n if n > 0 for n sufficiently large and
(Xn
n )/
where
!d Z,
Z N (0, 1).
2 ),
n = (X1 + . . . +
and X
n(Xn ) !d
) + op (n
p
p
n(Xn
n[g(Xn )
) ! N (0,
then
p
g()] = g 0 () n(Xn
).
),
2 ),
) ! N (0, g 0 ()2
).
p
p
For example, let g(Xn ) = 1/Xn , and n(Xn ) !d N (0, 2 ), then we have n(1/Xn 1/) !d
N (0, 2 /4 ).
Lindeberg-Levy CLT assumes i.i.d., which is too strong in practice. Now we retain the assumption of independence but allow heterogeneous distributions (i.ni.d), and in the next section, we will
show versions of CLT for serial dependent sequence.
7
In the following analysis, it is more convenient to work with normalized variables. We also need
to use triangular arrays in the analysis. An array Xnt is a double-indexed collection of numbers
and each sample size n can be associated with a dierent sequence. We use {{Xnt }nt=0 }1
n=1 , or just
{Xnt }Pto denote an array. Let {Yt } be the sequence of the raw sequence with E(Yt ) = t . Define
2 = E(Y
s2n = nt=1 E(Yt t )2 , nt
t )2 /s2n , and
t
Xnt =
2
nt .
Yt
sn
Define
Sn =
n
X
Xnt ,
t=1
n
X
2
nt
= 1.
(1)
t=1
Definition 7 (Lindeberg CLT) Let the array {Xnt } be independent with zero mean and variance
2 } satisfying (1). If the following condition holds,
sequence { nt
lim
n!1
n Z
X
t=1
{|Xnt |>}
2
Xnt
dP = 0
for all
> 0,
(2)
n!1
n
X
t=1
E|Xnt |2+ = 0,
for some
>0
(3)
(3) is known as Liapunov condition. It is stronger than Lindeberg condition, but it is more
easily checkable. Therefore it is more frequently used in practice.
We have seen that if the data {Xn } are generated by an ARMA process, then the observations are
not i.i.d, but serially correlated. In this section, we will discuss how to derive asymptotic theories
for stationary and serially dependent process.
2.1
i,j=1
n
X
x (i
j)
i,j=1
= (1/n)
or
= (1/n)
+2
n
X1
h=1
(1
|h|<n
h
n
(h)
|h|) (h)
h
(h)
n
h=1
1
2
= (0) + 1
2 (2) + 1
2 (3) + . . . + 1
n
n
| (0)| + 2| (1)| + 2| (2)| + . . . < 1
n2 ) =
nE(X
0+2
n
X1
m
2 (m) + . . .
n
n!1
1
X
h= 1
nE(Xn ) remain the same. A covariance stationary process is said to ergodic for the mean if the
9
time series average converges to the population mean. Similarly, if the sample average provides an
consistent estimate for the second moment, then the process is said to be ergodic for the second
moment. In this section, we see
P that a sufficient condition for a covariance stationary process to be
ergodic for the mean is that 1
h=0 | (h)| < 1. Further, if the process is Gaussian, then absolute
summable autocovariances also ensure that the process is ergodic for all moments.
Recall that in spectrum analysis, we have
1
X
x (h)
= 2Sx (0),
h= 1
2.2
Ergodic Theorem*
Ergodic theorem is a law of large number for a strictly stationary and ergodic process. We need a few
concepts to define ergodic stationarity, and those concepts can be found in the appendix. Given a
probability space (, F, P ), an event E 2 F is invariant under transformation T if E = T 1 E. Now,
a measure-preserving transformation T is ergodic if for any invariant event E, we have P (E) = 1
or P (E) = 0. In other words, events that are invariant under ergodic transformations either occur
almost surely, or do not occur almost surely. Let T be a shift operator, then a strictly stationary
process {Xt } is said to be ergodic if Xt = T t 1 X1 for any t where T is measure-preserving and
ergodic.
Below is an alternative way to define ergodicity,
Theorem 7 Let (, F, P ) be a probability space and let {Xt } be a strictly stationary process,
Xt (!) = X1 (T t 1 !). Then this process is ergodic if and only if for any pair of events A, B 2 F,
n
1X
P (T k A \ B) = P (A)P (B).
n!1 n
lim
(4)
k=1
To understand this result, if event A is not invariant and T is measure preserving, then T A \ Ac
is not empty. Therefore repeated iterations of the transformation generate a sequence of sets {T k A}
containing dierent mixtures of the elements of A and Ac . A positive dependence of B on A implies
a negative dependence of B on Ac , i.e.
P (A \ B) > P (A)P (B) ) P (Ac \ B) = P (B)
P (A \ B) < P (B)
U1 + Z,
so Z is an invariant event under the shift operator. If we compute the autocovariance, X (h) =
E(Xt Xt+h ) = 1, no matter how large h is. This means that the dependence is too persistent. Recall
that in lecture one we have proposed that the time series average of a stationary converges to its
10
population mean only when it is ergodic. In this example, the series is not ergodic. We P
can compute
that the true expectation of the process is 1/2, while the sample average Xn = (1/n) nt=1 Ut + Z
does not converge to 1/2, but to Z + 1/2.
In Example 2 we can see that in order for Xt to be ergodic, Z has to be a constant almost surely.
In practice, ergodicity is usually assumed theoretically, and it is impossible to test it empirically.
If a process is stationary and ergodic, we have the following LLN:
Theorem 8 (Ergodic theorem) Let Xt be a strictly stationary and ergodic process and E(Xt ) = ,
then
n
X
n =
X
Xt !a.s. .
t=1
Recall that when a process is strictly stationary, then a measurable function of this process is also
strictly stationary. Similar property holds for ergodicity. Also, if the process is ergodic stationary,
then all its moment, given that they exist and are finite, can also be consistently estimated by
the sample
For instance, if Xt and strictly stationary and ergodic, E(Xt2 ) = 2 , then
Pn moment.
2
2
(1/n) t=1 Xt ! .
2.3
Mixing Sequences*
Application of ergodic theorem is restricted in applications since it requires strict stationary, which
is a too strong assumption in many cases. Now, we introduce another condition on dependence:
mixing.
A mixing transformation T implies that repeated application of T to event A mix up A and
Ac , so that when k is large, T k A provides no information about the original event A. A classical
example about mixing is due to Halmos (1956) (draw a picture here).
Consider that to make a dry martini, we pour a layer of vermouth (10% of the volume) on top
of the gin (90% of the volume). let G denote the gin, and F an arbitrary small region of the fluid,
so that F \ G is the gin contained in F . If P () denotes the volume of a set as a proportion of the
whole, P (G) = 0.9. The proportion of gin in F , denoted by P (F \ G)/P (F ) is initially either 0 or
1. Let T denote the operation of stirring the martini with a swizzle stick, so that P (T k F \G)/P (F )
is the proportion of gin in F after k stirs. If the stirring mixes the martini we would expect the
proportion of gin in T k F , which is P (T k F \ G)/P (F ) tends to P (G), so that each region F of the
martini eventually contains 90% gin.
Let (, F, P ) be a probability space, and let G, H be subfields of F, define
(G, H) =
sup
G2G,H2H
|P (G \ H)
P (G)P (H)|,
(5)
and
(G, H) =
sup
G2G,H2H;P (G)>0
|P (H|G)
P (H)|.
(6)
Clearly, (G, H) (G, H). The events in G and H are independent i and are zero.
1
For a sequence, {Xt }11 , let F t 1 = (...Xt 1 , Xt ), Ft+m
= (Xt+m , Xt+m+1 , ...). Define
t
1
the strong mixing coefficient m = supt (F 1 , Ft+m ) and the uniform mixing coefficient to be
t
1
m = supt (F 1 , Ft+m ).
11
Next, the sequence is said to be -mixing or strong mixing if limm!1 m = 0 and it is said to
be -mixing or uniform mixing if limm!1 m = 0. Since , -mixing implies -mixing.
A mixing sequence is not necessarily stationary, and it could be hetergeneous. However if
a strictly stationary process is mixing, it must be ergodic. As you can see from (4), ergodicity
implies average asymptotic independence. However, ergodicity does not imply that any two parts
will eventually become independent. On the other hand, a mixing sequence has this property
(asymptotic independence). Hence mixing is a stronger condition than ergodicity. A stationary
and ergodic sequence needs not be mixing.
We usually use a statistics called size to characterize the rate of convergence of m or m . A
sequence is said to be -mixing of size
> 0 . If Xt is a -mixing
0 if m = O(m ) for some
sequence of size
,
and
if
Y
=
g(X
,
X
,
.
.
.
,
X
)
is
a
measurable
function and k be finite,
0
t
t
t 1
t k
then Y is also -mixing of size
0 . All above statements can also be applied to -mixing.
When a sequence is stationary and mixing, then Cov(X1 , Xm ) ! 0 as m ! 1. Consider the
ARMA processes. If it is MA(q), then the process must be mixing since any two events with time
interval larger than q are independent, i.e., (m) = (m) = 0 for m > q. We will not discuss
sufficient conditions for a MA(1) to be strong or uniform mixing, but note that if the innovations
are i.i.d. Gaussian, then absolute summability of the moving average coefficients is sufficient to
ensure strong mixing.
The following LLN (McLeish (1975)) applies to hetergeneous and temporarily dependent (mixing) sequences. We will only consider strong mixing.
Proposition 13 (LLN for heterogeneous mixing sequences) Let {Xt } be strong mixing with size
r/(r 1) for some r > 1, with finite means t = E(Xt ). If for some , 0 < r,
1/r
1
X
E|Zt t |r+
< 1,
(7)
tr+
t=1
n
then X
2.4
n !a.s. 0.
In time series observations, we know the past but we do not know the future. Therefore, a very
important way in time series modeling is to condition sequentially on past events. In a probability
space (, F, P ), we characterize partial knowledge by specifying a -subfield of events from F, for
which it is known whether each of the events belonging to it has occurred or not. The accumulation
of information over time is represented by an increasing sequence of -field, {F}11 , with . . .
F0 F1 . . . F. Here the set F has also been referred as the universal information set. If Xt is
known given Ft for each t, then {Ft }11 is said to be adapted to the sequence {Xt }11 . The pair
{Xt , Ft }11 are called an adapted sequence. Setting Ft = (Xs , 1 < s t), i.e.,Ft generated by
all lagged observations of X, we obtain the minimum adapted sequence. And Ft defined in this
way is also known as the natural filtration.
Given an adapted sequence {Xt , Ft }11 , if we have
E|Xt | < 1,
E(Xt |Ft
1)
= Xt
1,
for all t, then the sequence is called a martingale. A simple example of martingale is a random
walk.
12
+ t , X0 = 0, t i.i.d.(0,
Ft = {t , t
1 , . . . , 1 }
Pt
k=1 E|k |
1)
= Xt
1.
Let {Xt , Ft }11 be an adapted sequence, two concepts that are related to martingales are
submartingales, which means E(Xt+1 |Ft ) Xt and supermartingales, which means E(Xt+1 |Ft )
Xt .
A sequence {Zt } is known as a martingale dierence sequence if E(Zt |Ft 1 ) = 0. As you can
see, a mds can be constructed using martingales. For example, let Zt = Xt Xt 1 where {Xt }
is a martingale. Then this sequence of Zt is an mds.
P On the other hand, the sum of mds is a
martingale, i.e., {Xt } will be a martingale if Xt = ti=1 Zi where Zi is an mds.
Proposition 14 If Xt is an mds, then E(Xt Xt
h)
2 ),
then Xt = t t
Another example is Garch model. In Garch model, the error terms are mds, but the variance of
the error depends on past values. Although mds is weaker than independence, it behaves in many
ways just like independent sequence. In cases where independence is violated, if the sequence is
an mds, then we will find that many asymptotic results which hold for independent sequence also
hold for mds.
One of the fundamental results in martingale theory is the martingale convergence theorem.
Theorem 9 (Martingale convergence theorem) If {Xt , Ft }11 is an L1 -bounded submartingale,
then Xn !a.s. X where E|X| < 1. Further, let 1 < p < 1. If {Xt , Ft }11 is a martingale and
supt E|Xt |p < 1, then Xt converges in Lp as well as with probability one.
This is an existence theorem and it tells that Xn converges to X, but it does not tell what X
is. But martingale convergence theorem (MGCT) is still a very powerful result.
2
Example
mds(0,
supt t2 = M < 1. Define
Pn 5 (LLN for heterogeneous mds) Let t
Pn t )2 with
2
2
Sn = P
E(Sn ) = t=1 t /t . Verify that supn E(|Sn |2 )
t=1 t /t, then Sn is a martingale with
P
n
n
supt t2 ( t=1 (1/t2 )) < 1. Therefore, Sn = t=1 t /t converges by MGCT. Next, let bn = n, then
by Kroneckers lemma
n
n
1X
1 X t
t =
t ! 0.
n
t
n
t=1
t=1
13
(9)
for all t 1 and m 0. Intuitively, mixingale captures the idea that the sequence {Fs } contains
progressively more information about Xt as s increases. In the remote past nothing is known
according to (8), or any past event eventually became useless in predicting event that will happen
today (t). While in the future, everything will eventually be known according to (9). When Xt is
Ft -measurable, as in most of the cases we will be interested in, condition (9) always holds (since
E(Xt |Ft+m ) = Xt ). So to test if a sequence is mixingale, in many cases we only need to test
condition (8). In what follows, we will mostly use L1 -mixingale. Condition (8) can then be written
as
E|E(Xt |Ft m )| ct m .
(10)
As you can see, mixingales are even more general than mds, in fact, a mds is a special kind of
mixingale and you can set ct = E|Xt | and set 0 = 1 and m = 0 for m 1.
Example 6 Consider a two-sided MA(1) process,
Xt =
1
X
j t
j,
1
X
j t
j= 1
m)
j.
j=m
P1
Take ct = supt E|
Pt1| and take m = j=m |j |. Then if the moving average coefficients are absolutely
summable, i.e., j= 1 |j | < 1, then its tails has to go to zero, i.e., m ! 0. Then condition (10)
is satisfied and Xt is an L1 -mixingale.
In this example, first, we specify an MA process as generated by mds errors, which is a more
generalized class of stochastic processes than i.i.d and white noise. Second, if E(|t |) < 1 (which
controls the tails of t ), then the condition of absolutely summable coefficients makes Xt a L1 mixingale.
2.5
To derive the law of large numbers for L1 -mixingales, we need the notion of uniformly integrable.
Definition 9 (Uniformly integrable sequence) A sequence {Xt } is said to be uniformly integrable
if for every > 0 there exists a number c > 0 for all t such that
E(|Xt |1[c,1) (|Xt |) <
14
We will see how to make use of this notion in a moment. First, we introduce the following two
conditions for uniform integrability.
Proposition 15 (Conditions for uniform integrability) (a) A sequence {Xt } is uniformly integrable
r ) < M for all t. (b) Let {X } be a uniformly
if there exits an r > 1 and an M P
< 1 such that E(|Xt |P
t
1
integrable sequence and if Yt = k= 1 k Xt k with 1
|
|
<
1,
then
the
sequence {Yt } is
k
k= 1
also uniformly integrable.
To derive inference for a uniformly integrable sequence, we have the following proposition.
Proposition 16 (Law of large numbers for L1 -mixingale) Let {Xt } be an L1 -mixingale. If {Xt }
is uniformly integrable and there exists a sequence of {ct } such that
lim (1/n)
n!1
n = (1/n) Pn Xt !p 0.
then X
t=1
n
X
t=1
ct < 1,
2
Example 7 (LLN for mds with finite variance) Let {Xt } be a mds with
PnE|Xt | = M < 1,
then it is uniformly integrable and we can take ct = M , and since (1/n) t=1 ct = M < 1, by
n !p 0.
proposition 16, X
We can naturally generalize mixingale sequence to mixingale arrays. An array {Xnt } is said
to be L1 mixingale with respect to {Fnt } if there exists nonnegative constant constants {cnt } and
non-negative sequence {m } such that m ! 0 as m ! and
kE(Xnt |Fn,t
kXnt
m )kp
(11)
cnt m
for all t 1 P
and m 0. If the array is uniformly integrable with limn!1 (1/n)
Pn
t=1 cnt
(12)
< 1, then
r
Example 8 Let {t }1
t=1 be an mds with E|| < M for some r > 1 and M < 1 ( i.e. t
is Lr -bounded ). Let Xnt = (t/n)t . Then {Xnt } is a uniformly integrable L1 -mixingale with
cnt = supt E|t |, 0 = 1 and m = 0 for m > 0. Then applying LLN for L1 -mixingales, we have
n ! 0.
X
2.6
In this section, we will show how to prove the consistency of the estimate of second moments using
the LLN of L1 -mixingales. There are two steps in the proof: first, we need to construct an L1 mixingales; second, we need to verify that the conditions for applying the LLN is satisfied. This
kind of methodology is very useful in many applications. Out following proof can also be found on
page 192-192 in Hamilton.
construct a mixingale. Out problem is outlined as follows. Let Xt =
P1First, we want to
P1
r
,
where
j=0 j t j
j=0 |j | < 1 and t is i.i.d. with E|t | < 1 for some r > 2. We what
to prove that
n
X
(1/n)
Xt Xt k !p E(Xt Xt k ).
t=1
15
Define Xtk = Xt Xt
E(Xt Xt
Xt Xt
k ),
j=0
1 X
1
X
i j t i t
k j
i=0 j=0
E(Xt Xt
= E@
k)
1 X
1
X
i j t i t
k j
i=0 j=0
1 X
1
X
i j E(t i t
k j)
1
A
i=0 j=0
then
Xtk =
1 X
1
X
i j (t i t
k j
E(t i t
k j )).
i=0 j=0
Let Ft = {t , t
1 , . . .},
then
E(Xtk |Ft
m)
1
1
X
X
i j (t i t
E(t i t
k j
k j )).
i=m j=m k
m )| = E
1
1
X
X
i j (t i t
k j
E(t i t
k j ))
i=m j=m k
E@
1
X
1
1
X
X
i=m j=m k
1
X
i=m j=m k
|i j ||t i t
k j
E(t i t
1
A
k j )|
|i j |M
1
1
X
X
i=m j=m k
|i j | =
1
X
i=m
|i |
1
X
j=m k
|j |.
P1
Since j is absolutely summable, its tails goes to zero, i.e.,
i=m i ! 0 as m ! 0, therefore,
m ! 0.
Now, we have shown
P that Xtk is an L1 -mixingale. Next, we want to show that it is uniformly
integrable and (1/n) nt=1 ct < 1. Since ct = M < 1, this latter condition holds. The uniform
16
integrability can also be easily verified using the second part (b) of proposition 15. Therefore,
applying the LLN, we have
(1/n)
n
X
Xtk = (1/n)
t=1
therefore,
n
X
E(Xt Xt
t=1
(1/n)
n
X
Xt Xt
t=1
2.7
(Xt Xt
!p E(Xt Xt
k ))
!p 0,
k ).
(13)
We have already learned several versions of CLT: (1) CLT for independently identically distributed
sequence (Lindeberg-Levy CLT), (2) CLT for independently non-identically distributed sequence
(Lindeberg CLT, Liapunov CLT). Now, we will consider the conditions for CLT to hold for a
martingale dierence sequence. Actually we can have CLT for any stationary ergodic mds with
finite variance:
Proposition 17 Let {Xt } be stationary and ergodic martingale dierence sequences with E(Xt2 ) =
2 < 1, then
n
1 X
p
Xt ! N (0, 2 ).
(14)
n t=1
Let Sn = Sn 1 + Xn with E(Sn ) = 0, which is a martingale with stationary and ergodic dierences,
then from the above proposition we can have n 1/2 Sn ! N (0, 2 ).
The conditions in the following version of CLT is usually easy to check in applications:
n = n 1 Pn Xt .
Proposition 18 (Central Limit Theorem for
mds) Let {Xt } be a mds with X
t=1
P
Suppose that (a) E(Xt2P
) = t2 > 0 with n 1 nt=1 t2 ! 2 > 0, (b) E|Xt |r < 1 for some r > 2
p
2 ).
and all t, and (c), n 1 nt=1 Xt2 !p 2 . Then nX
n ! N (0,
2 )=
Again, this proposition can be extended from sequence {Xt } to mds array {Xnt } with E(Xnt
In our last example in this lecture, we will use the next proposition, which is also a very useful
tool.
P1
4 ) < 1. Let Y =
Proposition
19
Let
X
be
a
strictly
stationary
process
with
E(X
t
t
t
j=0 j Xt j ,
P1
where j=0 |j | < 1. Then Yt is a strictly stationary process with E|Yt Ys Yi Yj | < 1 for all t, s, i
and j.
P
P1
2)
Example 9 (Example 7.15 in Hamilton) Let Yt = 1
j=0 j t j with
j=0 |j | < 1, t iid(0,
P
2
and E(4 ) < 1. Then we see that E(Yt ) = 0 and E(Yt2 ) = 2 1
Yt k for
j=0 j . Define Xt = tP
1
2
2
2
4
2
k > 0, then Xt is an mds with respect to {t , t 1 , . . .}, with E(Xt ) = E(Yt ) =
j=0 j
4
4
4
4
4
(so condition (a) in proposition 18 is satisfied), E(Xt ) = E(t Yt k ) = E( )E(Yt ) < 1. Here
E(4t ) < 1 by assumption and E(Yt4 ) < 1 by proposition 19. So condition (b) in proposition 18
is also satisfied, and the remaining condition we need to verify to apply CLT is condition (c),
2
nt .
(1/n)
n
X
t=1
Xt2 !p E(Xt2 ).
17
Write
(1/n)
n
X
= (1/n)
Xt2
t=1
n
X
2t Yt2 k
t=1
= (1/n)
n
X
(2t
)Yt2 k
+ (1/n)
t=1
n
X
Yt2 k
t=1
The first term is a normed sum of mds with finite variance To see this,
Et
2
1 [(t
2
1 (t )
)=0
and
P
Then (1/n) nt=1 (2t
By (13), we have
) Yt4 k ] = E(4t
E[(2t
2 2
2 )Y 2
t k
! 0 (example 7).
(1/n)
n
X
t=1
Therefore, we have
(1/n)
n
X
t=1
Yt2 k !p
Xt2 !p
)E(Yt4 ) < 1.
E(Yt2 ).
E(Yt2 ).
0
n
X
1
p
Xt !d N (0, E(Xt2 )) = N @0,
n t=1
2.8
1
X
j=0
j2 A .
1
X
cj t
j.
j=0
P1
n
n(X
j=0 j
) !d N (0,
1
X
(h)).
h= 1
To prove the results, we can use a tool known as BN Decomposition and Phillips-Solo Device. Let
ut = C(L)t =
1
X
j=0
18
cj t
j,
(15)
P1
j=0 j
where C(1) =P 1
j Lj , and cj = 1
j=0 cj , C(L) =
j=0 c
j+1 ck . Since we assume that
j=0 j |cj | <
1, we have 1
|
c
|
<
1.
When
C(1)
>
0
(the
assumption
ensured
that
C(1)
<
1),
we can
j=0 j
rewrite ut as
ut = (C(1) + (L 1)C(L))
t
= C(1)t C(L)(t t 1 )
= C(1)t
(
ut
u
t
1 ).
= (1 + )t
(t
1 ).
Hence for this process, C(1) = 1 + , c0 = , and cj = 0 for j > 0. Note that the variance of ut is
that 0 = (1 + 2 ) 2 while the long run variance of ut is 2 = 2 C(1)2 = (1 + )2 2 .
Note that since cj is absolutely summable, then u
n u
0 is bounded in probability, hence
n
1 X
1 X
p
ut = C(1) p
t + op (1) ! N (0, C(1)2
n t=1
n t=1
2
).
(16)
P
2
P
1
2
2 2
You can verify that 1
(h)
=
c
= C(1) .
h= 1 x
j=0 j
This result also applies when t is a martingale dierence sequence satisfying certain moment
conditions (Phillips and Solo 1992).
Readings: Hamilton (Ch. 7) Davidson (Part IV and Part V)
S1
n=1 An
2 F.
So a -Field is closed under the operations of complementation and countable unions and
intersections. The smallest -field for a set X is {X, ;}. Let A be subset of X, the smallest -field
that contains A is {X, A, Ac , ;}. So given any set or a collection of sets, we can write down the
smallest -field that contains it. Let C denote a collection of sets, then the smallest field containing
C is called the -field generated by C.
A measure is a nonnegative countably additive set function and it associates a real number with
a set.
Definition 11 (Measure) Given a class F of a subsets of a set , a measure : F 7! R is a
function satisfying
(a) (A)
0, for all A 2 F.
(b) (;) = 0.
(c) For a countable collection {Aj 2 F, j 2 N} with Aj \ Al = ; for j 6= l and
0
[
j
Aj A =
Aj 2 F,
(Aj ).
(a) P (A)
0 for all A 2 F.
(b) P () = 1.
(c) Countable additivity: for a disjoint collection {Aj 2 F, j 2 N},
0
1
[
X
P @ Aj A =
P (Aj ).
j
We can define a random variable in a probability space. If the mapping X : 7! R is Fmeasurable then X is a real valued random variable on . For example, if is a discrete probability
space, as in our example of tossing a coin, then any function X : 7! R is a random variable.
Let (, F, P ) be a probability space, the transformation T : 7! is measure-preserving if it
is measurable and P (A) = P (T A) for all A 2 F. A shift transformation T for a sequence {Xt (!)}
is defined by Xt (T !) = Xt+1 (!). So a shift transformation works like a lag operator. If the shift
1
transformation T is measure-preserving, then the sequences {Xt }1
t=1 and {Xt+k }t=1 have the same
joint distribution for every k > 0. Therefore we can see that when the shift transformation T is
measure-preserving, the process is strictly stationary.
21
In lecture 2, we introduced stationary linear time series models. In that lecture, we discussed
the data generating processes and their characteristics, assuming that we know all parameters
(autoregressive or moving average coefficients). However, in empirical studies, we have to specify
an econometric model, estimate this model and draw inferences based on the estimates. In this
lecture, we will provide an introduction to parametric estimation of a linear model with time
series observations. Three commonly used estimation methods are least square estimation (LS),
maximum likelihood estimation (MLE) and general method of moments (GMM). In this lecture,
we will discuss LS and MLE.
Least square (LS) estimation is one of the first techniques we learn in econometrics. It is both
intuitive and easy to implement, and the famous Gauss-Markov theorem tells that under certain
assumptions, ordinary least square (OLS) estimator is the best linear unbiased estimator (BLUE).
We will start from review of classical LS estimation and then we will consider estimations with
relaxed assumptions.
Below are our notations in this lecture and the basic algebra in LS estimation. Consider the
regression
yt = x0t 0 + ut , t = 1, . . . , n
(1)
where xt is k by 1 vector and
of 0 , denoted by n is
0,
"
n
X
xt x0t
t=1
1" n
X
t=1
xt yt
(2)
x0t n .
u
t = yt
Sometimes, it is more convenient
2
y1
6 y2
6
Yn = 6 .
4 ..
Yn = Xn
+ Un ,
(3)
Xn0 Yn .
(4)
Define
Xn (Xn0 Xn )
MX = In
Xn0 .
Xn0 Yn = (Xn0 Xn )
Xn0 (Xn
1.1
+ Un ) =
1X 0 U .
n n
+ (Xn0 Xn )
Xn0 Un .
(5)
1X 0 U ]
n n
2 );
+ (Xn0 Xn )
Xn0 E(Un ) =
2I .
n
2 ).
0,
and
E[( n
0 )( n
0
0) ]
= E[(Xn0 Xn )
=
=
Under these assumptions, Gauss-Markov theorem tells that the OLS estimator n is the best
linear unbiased estimator for 0 . The OLS estimator for 2 is
n U
n0 /(n
s2n = U
k) = Un0 MX MX Un /(n
k) = Un0 MX Un /(n
(6)
k).
and P 0 P = In
where is a n by n matrix with the eigenvalues of MX along the principal diagonal and zeros
elsewhere. From properties of MX we can compute that contains k zeros and n k ones along
its principal diagonal. Then
RSS =
Un0 MX Un
= Un P P Un = (P Un )(P Un ) =
Wn0
Wn =
n
X
t=1
2
t wt
n
X
2
t E(wt )
2I ,
n
= (n
k)
t=1
(Xn0 Xn )
2.
).
Note that here n is exact normal, while many of the estimator in our later discussions are
asymptotically normal. Actually, under assumption 1, OLS estimator is optimal. Also, with the
Gaussian assumption, wt is i.i.d.N (0, 2 ). Therefore we have
Un0 MX Un /
1.2
(n
k).
The assumption of deterministic regressors is very strong for empirical studies in economics. Some
examples of deterministic regressors are constants and deterministic trend (i.e. xt = (1, t, t2 , . . .)).
However, most data we have for econometric regression are stochastic. Therefore from this subsection, we will allow the regressors to be stochastic. However, in case 2 and case 3, we assume that
xt is independent of errors (leads and lags). This is still too strong in time series, as it rules out
many processes including ARMA models.
Assumption 2 (a) xt is stochastic and independent of us for all t, s; (b) ut i.i.d.N (0,
2 ).
This assumption can be equivalently written as Un |Xn N (0, 2 In ). Under these assumptions,
n is still unbiased:
E( n ) = 0 + E[(Xn0 Xn ) 1 Xn0 ]E(Un ) = 0 .
Conditional on Xn , n is normal, n |Xn N ( 0 , 2 (Xn0 Xn ) 1 ). To get the unconditional
probability distribution for n , we have to integrate this conditional density over X. Therefore, the
unconditional distribution of n will depend on the distribution of X. However, we still have the
unconditional distribution for the estimate of the variance Un0 MX Un / 2 2 (n k).
1.3
Compared to case 2, in this section we let the error terms to follow arbitrary i.i.d. distribution
with finite fourth moments. Since this is an arbitrary unknown distribution, it is very hard obtain
exact distribution (finite sample distribution) for n , instead, we will apply asymptotic theory in
this problem.
2
Assumption 3 (a) xt is stochastic and independent of us for all t, s; (b)
Pnut i.i.d.(0, ), and
4
0
E(ut ) = 4 < 1; (c) E(xt xt ) = Qt , a positive definite matrix with (1/n)
Pn t=1 Qt0 ! Q, a positive
definite matrix; (d) E(xit xjt xkt xlt ) < 1 for all i, j, k, l and t; (e) (1/n) t=1 (xt xt ) !p Q.
"
t=1
t=1
n
X
(1/n)
xt x0t
t=1
1"
(1/n)
n
X
xt ut
t=1
(1/n)
n
X
xt x0t
t=1
!p Q
xt ut is a martingale dierence sequence with finite variance, then by LLN for mixingales, we
have
"
#
n
X
(1/n)
xt ut !p 0.
t=1
Therefore,
p
n( n
0)
"
(1/n)
n
X
t=1
1
! N (0, [Q
so the n follows
n N
xt x0t
0,
1"
Q)Q
2Q 1
(1/ n)
n
X
xt ut
t=1
]) = N (0,
).
Note that this distribution is not exact, but approximate. So we should read it as approximately
distributed as normal.
u2t = (yt
2
0)
x0t n + x0t ( n
x0 n )2 + 2(yt
= [yt
2.
2
0 )]
x0t n )x0t ( n
0
2
= (yt
0 ) + [xt ( n
0 )]
t
P
By LLN, we have (1/n) nt=1 u2t ! 2 . There are three terms in the above equation. For the
second term, we have
n
X
(1/n)
(yt x0t n )x0t ( n
0) = 0
t=1
as (yt
0)
!p 0
t=1
as n
Pn
0
t=1 xt xt
n2
= (1/n)
n
X
(yt
x0t n )2 ,
t=1
and we have
n2
= (1/n)
n
X
(yt
x0t n )2
= (1/n)
t=1
n
X
u2t
(1/n)
t=1
n
X
n(n2
) = (1/ n)
n
X
(u2t
n( n
t=1
2
0 )]
t=1
[x0t ( n
0
0)
k)
s2n /n). Since (n
"
(1/n)
n
X
t=1
x0t xt ( n
.
k)/n ! 1 as
0 ).
P
2
The second term goes to zero as [(1/n) nt=1 x0t xt ] !p Q and n
0 !p 0. Define zt = ut
4
4
4
then zt is i.i.d. with mean zero and variance E(ut )
= 4
. Applying CLT, we have
n
p X
zt !d N (0, 4
(1/ n)
),
t=1
therefore,
n(n2
) !d N (0, 4
).
The same limit distribution applies for s2n , since the dierence between n2 and s2n is op (n
1/2 ).
2,
1.4
In an autoregression, say, xt = 0 xt 1 + t , where t is i.i.d., the regressors are no longer independent of t . In this case, the OLS estimator of 0 is biased. However, we will show that under
assumption 4, the estimator is consistent.
Assumption 4 The regression model is
yt = c +
2
with roots of (1
...
1z
2z
t i.i.d. with mean zero , variance
1 yt 1
2 yt 2
+ ... +
p yt p
+ t ,
p
p z ) = 0 outside the unit circle (so
2 , and finite fourth moments .
4
Page 215-216 in Hamilton presents the general AR(p) case with constant. We will use AR(2)
as an example, yt = 1 yt 1 + 2 yt 2 + t . Let x0t = (yt 1 , yt 2 ), ut = t and yt = x0t 0 + ut (so
0
0 = ( 1 , 2 )).
"
# 1"
#
n
n
X
p
p X
0
n( n
xt xt
(1/ n)
xt ut
(7)
0 ) = (1/n)
t=1
t=1
(1/n)
n
X
xt x0t
= (1/n)
t=1
Pn
2
Pn t=1 yt 1
t=1 yt 1 yt
Pn
n
X
2
t=1 yt j
xt x0t
t=1
Pn
t=1 yt
P
n
1 yt 2
2
t=1 yt 2
converge to
!p Q =
0.
Pn
t=1 yt 1 yt 2
Q),
t=1
therefore,
n( n
0)
!d N (0,
).
So far we have considered four cases in OLS regressions. The common assumption in all those
four cases are i.i.d. errors. From next section, we will consider cases where the errors are not i.i.d..
1.5
When the error ut is i.i.d., then the variance-covariance matrix V = E(Un Un0 ) = 2 In . If V is
still diagonal but the elements are not equal, for example, the errors on some dates display larger
variance and the errors on some dates display smaller variance, then the errors are said to exhibit
heteroskedasticity. If V is non-diagonal, then the errors are said to be autocorrelated. For example,
let ut = t
t 1 where t is i.i.d., then ut is serially correlated errors.
Case 5 in Hamilton assumes
6
Assumption 5 (a) xt is stochastic; (b) conditional on the full matrix X, the vector U N (0,
(c) V is a known positive matrix.
2V
);
Under these assumptions, the exact distribution of n can be derived. However, this is a very
strong assumption and it rules out the autoregressive regression. Also, the assumption that V is
known rarely holds in applications.
Case 6 in Hamilton assumes uncorrelated but heteroskedastic errors with unknown covariance
matrix. Under assumption 6, the OLS estimator is still consistent and asymptotically normal.
Assumption 6 (a) xt stochastic, including perhaps lagged values of y; (b) xt uP
t is martingale
n
2 x x0 ) = , a positive definite matrix, with (1/n)
dierence sequence;
(c)
E(u
t
t t t
t=1 t !p
Pn
2
0
4
and (1/n)
t=1 ut xt xt !p ;P(d) E(ut xit xjt xlt xkt ) < 1 for all i, j, k, l and t; (e)
P
P plims of
(1/n) nt=1 ut xit xt x0t and (1/n) nt=1 xit xjt xt x0t exist and are finite for all i, j and (1/n) nt=1 x0t xt !p
Q, a nonsingular matrix.
Again, write the OLS estimator as
p
n( n
0)
"
= (1/n)
"
therefore,
(1/n)
xt x0t
t=1
n
X
n
X
xt x0t
t=1
(1/ n)
n
X
t=1
n( n
0)
1"
(1/ n)
n
X
t=1
xt ut
!p Q
xt ut ! N (0, ),
! N (0, Q
).
t
t
t=1
t=1 t
u
t is the OLS residual yt x0t n .
Proposition 1 With heteroskedasticity of unknown form satisfying assumption 6, the asymptotic
variance-covariance matrix of the OLS coefficient vector can be consistently estimated by
n 1
nQ
n 1 !p Q
Q
n = (1/n)
n
X
t=1
n
X
t=1
u2t xt x0t !p .
(
u2t
(8)
The trick here is to make use of a known fact that n
0 !p 0. If we could write n
u2t = (
ut + ut )(
ut
= [2(yt
2ut ( n
ut )
( n
0
0 xt )
0
0 ) xt
( n
0
0 ) xt ][
+ [( n
n as
0
0 ) xt ]
0
2
0 ) xt ]
Then
n
n = ( 2/n)
n
X
ut ( n
0
0
0 ) xt (xt xt )
+ (1/n)
t=1
n
X
[( n
0
2
0
0 ) xt ] (xt xt ).
t=1
n
X
ut ( n
0
0 ) xt (xt xt )
t=1
k
X
( in
i=1
i0 )
"
(1/n)
n
X
t=1
The term in the bracket has a finite plim by assumption 6 (e) and we have in
i0 ! 0 for
each i. Then this term converges to zero. (if this looks messy, take k = 1, then you can simply
move ( n
0 ) out of the summation. n
0 !p 0 and the sum has a finite limit, so the product
goes to zero).
Similarly for the second term,
(1/n)
n
X
[( n
0
2
0
0 ) xt ] (xt xt )
t=1
k X
k
X
( in
i0 )( jn
i=1 j=1
j0 )
"
(1/n)
n
X
t=1
n
n ! 0.
as the term in bracket has a finite plim. Therefore,
n 1
nQ
n 1 , then
Define Vn = Q
n N ( 0 , Vn /n),
(xt u
t u
t k xt k + xt k u
t k u
t xt ) (Xn0 Xn ) 1 .
Vn /n = (Xn Xn )
u
t xt xt +
1
q+1
t=1
1.6
t=k+1
k=1
General least square (GLS) and feasible general least square (FGLS) is preferred in least square
estimation when the errors are heteroskedastic or/and autocorrelated.
Let xt be stochastic and U |X N (0, 2 V ) where V is known (assumption 5). Since V is
symmetric and positive definite, there exists matrix L such that V 1 = L0 L. Premultiply L to our
regression and get
LY = LX 0 + LU.
= LU is i.i.d. conditional on X,
Then the new error U
U
0 |X) = LE(U U 0 |X)L0 =
E(U
Then the estimator
n = (X 0 L0 LX)
X 0 L0 Ly = (X 0 V
LV L0 =
1
X)
In .
X 0V
1X
u
t u
t
n
0 xt
t=1
=
=
1X
[ut + (
n
1
n
t=1
n
X
ut ut
n )0 xt ][ut
+(
01
n )
t=1
n
1X
ut ut
n
n xt ) = ut + (
0 xt
+(
n
X
n )0 xt
(ut xt
+ ut
n )0 xt .
1]
1 xt )
+(
t=1
n )
"
1X
xt x0t
n
t=1
+ op (1)
t=1
n
1X
(t + ut
n
1 )ut 1
t=1
! var(ut ).
Similarly, we can show that
can show that
Hence
1
n
Pn
t u
t
t=1 u
1 X
p
u
t u
t
n t=1
p
n(
1 X
=p
ut ut
n t=1
0 ) ! N (0, (1
+ op (1).
20 )).
1.7
Some commonly used test statistics for LS estimator are t statistics and F statistics. t statistics
is used to test the hypothesis of a single parameter, say i = c. For simplicity, we assume that
c = 0, so we use t statistics to test if a variable is significant. The t statistics is defined as the ratio
of i is the product of s and the square root of the ith element on the diagonal, i.e.,
i
t= p p .
s2 wii
Recall that if X/ N (0, 1), and Y 2 /
2 (m),
(9)
X m
Y
follows exact student t distribution with m degree of freedom.
F -statistics is used to test the hypothesis of m dierent linear restrictions about , say
t=
H0 : R = r,
where R is a m by k matrix. The F statistics is then defined as
F = (R
r)0 [V ar(R
r)]
(R
(10)
r).
This is a Wald statistics. To derive the distribution of the statistics, we will need the following
result
Proposition 2 If a k by 1 vector X N (, ), then (X
)0
1 (X
F (m, n) =
= (Xn0 Xn )
With assumption 1 W
then write
1,
2 (n)/n
2 (k).
2 w ).
ii
We can
p i
2w
ii
t= q
.
s2
Since the numerator is N (0, 1) and the denominator is the square root of 2 (n k) divided by n k
(since RSS/ 2 2 (n k)), and the numerator and denominator are independent, so t statistics
(9) under assumption 1 follows exact t distribution.
With assumption 1 and under the null hypothesis, we have
R
r N (0,
R(Xn0 Xn )
R),
r)0 [
R(Xn0 Xn )
R]
(R
r)
(m).
If we replace 2 with s2 , and divide it by the number of restrictions m, we get the OLS F test
of a linear hypothesis
F
= (R
=
R]
(R
r)/m
2,
r)/m
n(R
p
r) = R n(
0)
!d N (0,
RQ
R0 ).
(m).
We can then use similar methods to derive the distribution for other cases. In general if !p 0
and asymptotically normal, s2n ! 2 , and we have found a consistent estimate for the variance of ,
then the t and F statistics follow asymptotically normal and 2 (m) distribution. Actually, under
assumption 1 or 2, when the sample size is large, we can also use normal and 2 distribution to
approximate the exact t and F distribution. Further, since we are using the asymptotic distribution,
the Wald test can also be used to test nonlinear restrictions.
11
2.1
The basic idea of maximum likelihood principle is to choose the parameter estimates that maximizes the probability of obtaining the observed sample. Consider that we observe a sample
Xn = (x1 , x2 , . . . , xn ) and assume that the sample is drawn from an i.i.d. distribution and the
associated parameters are denoted by . Let p(xt ; ) denote the pdf of the tth observation. For
example, when xt i.i.d.N (, 2 ), then = (, 2 ) and
(xt )2
2 1/2
p(xt ; ) = (2 )
exp
.
2 2
The likelihood function for the whole sample Xn is
L(Xn ; ) =
n
Y
p(xt ; )
t=1
n
X
log p(xt ; ).
t=1
The maximum likelihood estimates for are chosen so that l(Xn ; ) is maximized. Define the
score function S() = @l()/@, and the Hessian matrix H() = @ 2 l( )/@@0 , then the famous
Cramer-Rao inequality tells that the lowest bound for the variance of an unbiased estimator of
is the inverse of the information matrix I(0 ) = E[S(0 )S(0 )0 ], where 0 denotes the true value
of the parameter. An estimator that have a variance equal to this bound is known as efficient.
Under some regularity condition which are satisfied for the Gaussian density, we have the following
equality
2
@ l()
.
I() = E[H()] = E
@@0
So, if we find an unbiased estimator and its variance achieves the Cramer-Rao lower bound,
then we know that this estimator is efficient and there is no other unbiased estimator (linear or
nonlinear) that could have smaller variance than this estimator. However, this lower bound is not
always achievable. If an estimator does achieve this bound, then this estimator is identical to MLE.
Note that Cramer-Rao inequality holds for unbiased estimator while sometimes ML estimators
are biased. If the estimator is biased but consistent, and its variance approaches the Cramer-Rao
bound asymptotically, then this estimator is known as asymptotically efficient.
Example 1 (MLE estimation for i.i.d. Gaussian distribution) Let xt i.i.d.N (,
rameter = (, 2 ). Then we have
1
(xt )2
p(xt ; ) = p
exp
2 2
2 2
n
n
n
1 X
2
l(Xn ; ) =
log(2)
log( )
(xt )2
2
2
2 2
t=1
12
2 ),
so the pa-
S(Xn ; ) =
n
@l(Xn ; )
1 X
= 2
(xt
@
)2
t=1
S(Xn ;
) =
n
n
1 X
+
(xt
2 2 2 4
@l(Xn ; )
=
@ 2
)2
t=1
n and 2 = 1 Pn (xt
Set the score functions to zero, we found the MLE estimator for are
=X
t=1
n
n ) = , so
= E(xt
)2
= E[(xt
=
) + (
)]2
2 2 1 2
+
n
n
1 2
H(Xn ; ) =
1
n 1
Pn
t=1 (xt
)2 , then
n
@ 2
2
@ l(Xn ;)
@2 2
where
@l(Xn ; )
@2
@l(Xn ; )
@ 2
@l(Xn ; )
@2 2
n
2
n
1 X
@l(Xn ; )
=
@ 2
n
2 4
n
1 X
6
(xt
t=1
(xt
)2
t=1
n2
> 0,
2 6
so we know that the we have found the maximum (not minimum) of the likelihood function. Next,
compute the information matrix,
" n
#
" n
#
X
X
E
(xt ) = 0, E
(xt )2 = n 2 .
|H(Xn ; )|= =
t=1
t=1
13
n
2 4
So the MLE of has achieved the Cramer-Rao lower bound of variance n . Although s2 does
not achieve to the lower bound, it turns out it is still the unbiased estimator for 2 with minimum
variance.
2.2
There are a few regularity conditions to ensure that the MLE is consistent. First we assume that the
data is strictly stationary and ergodic (for example, i.i.d.). Second, we assume that the parameter
space is convex and neither the estimate nor the true parameter 0 lie on the boundary of .
Third, we require that the likelihood function evaluated at is dierent from 0 , for any 6= 0 in
. This is known as the identification condition. Finally, we assume that E[sup2 |l(Xn ; )|] < 1.
With all those conditions satisfied, the MLE is consistent !p 0 .
Next we will discuss the asymptotic results on the score function S(Xn ; ), the Hessian matrix
So we know that E[S(Xn , 0 )] = 0. Next, let the integral (which equal to zero) take 0 , it is
Z
Z 2
@l(Xn ; 0 ) @L(Xn , 0 )
@ l(Xn ; 0 )
dXn +
L(Xn , 0 )dXn = 0.
0
@
@
@@0
The second term is just E[H(Xn ; 0 )]. The first can be written as
Z
@l(Xn ; 0 )
1
@L(Xn , 0 )
L(Xn , 0 )dXn
@
L(Xn , 0 )
@0
Z
@l(Xn ; 0 ) @l(Xn ; 0 )
=
L(Xn , 0 )dXn
@
@0
= E[S(Xn , 0 )S(Xn , 0 )0 ]
14
then we have
1X
! E[H(Xn ; 0 )] V,
H(xt ; )
n
t=1
n
1X
1
1
H(Xn ; 0 ) =
H(xt ; 0 ) !p E(xt ; 0 ) = E
H(Xn ; 0 )
n
n
n
t=1
1/2 S(X
n ; 0 )
!d N (0, ).
Proposition 3 (Asymptotic normality of MLE) With all the conditions we have outlined above,
p
n( 0 ) !d N (0, 1 ).
around 0 ,
Proof: Do a Taylor expansion of S(Xn ; )
S(Xn ; 0 ) + (
0 = S(Xn ; )
Therefore, we have
p
n(
0 )
=
=
nS(Xn ; 0 )H(Xn ; 0 )
1
1
p S(Xn ; 0 )
H(Xn ; 0 )
n
n
! N (0,
=
Note that
written as
= E[ n1 H(Xn ; 0 )]
N (0,
0 )H(Xn ; 0 ).
= nI(0 )
1,
N (0 , I(0 )
).
However, I(0 ) depends on 0 which is unknown. So we need to find a consistent estimator for
One way is that
it, denoted by V . There are two methods to compute this variance matrix of .
15
i.e. V = H(Xn ; ).
The second way is to
we compute the Hessian matrix, and evaluate it at = ,
use the outer product estimate, which is
V =
n
X
0
[S(xt ; )S(x
t ; ) ].
t=1
2.3
There are three asymptotically equivalent tests for MLE: likelihood ratio (LR) test, Wald test, and
Lagrange multiplier (LM) test or score test. You can probably find discussion on these three tests
on any graduate text book in econometrics, so we only describe them briefly here.
The likelihood ratio test is based on the dierence between the likelihood you computed (maximized) with or without the restriction. Let lu denote the likelihood without restriction and lr
denote the likelihood with restriction (note that lr lu ). If the restriction is valid, then we expect
the lr should not be too much lower than lu . Therefore, to test if the restriction is valid, the
statistics we compute is 2(lu lr ) which follows a 2 distribution with degree of freedom equal to
the number of restrictions imposed.
To do LR test, we have to compute the likelihood under both restricted and unrestricted condition. In comparison, the other two tests only use either the estimator without restriction (denoted
or the estimator with restriction (denoted by ).
by )
Let the restriction be H0 : R() = r, the idea of Wald test is that: if this restriction is valid,
W = (R()
r)0 [V ar(R()
r)] 1 (R()
r),
which also follows a 2 distribution with degree of freedom equal to the number of restrictions
imposed.
To find the ML estimator, we set the score function equal to zero and solve for the estimator,
= 0. If the restriction is valid, and the estimator we obtained with the restriction is ,
i.e., S()
then we expect that S() is close to zero. This idea leads to the LM test or score test. The LM
statistics is
0 I()
1 S(),
LM = S()
which also follows a
imposed.
2.4
LS and MLE
In a regression Yn = Xn
density of Y given X is
f (Y |X; ) = (2
n/2
exp
1
2
2I )
n
X )0 (Y
(Y
X ) .
n
log(2)
2
n
log(
2
16
1
2
(X
X )0 (X
X )
Note that n that maximizes l is the vector that minimizes the sum of squares, therefore, under the
assumption 2, the OLS estimator is equivalent to ML estimator for 0 . It can be shown that this
estimator is unbiased and achieves the Cramer-Rao lower bound, therefore under assumption 2, the
OLS/MLE estimator are efficient (compared to all unbiased linear or nonlinear estimators). Recall
that under assumption 1, we have Gauss-Markov theorem to show that OLS estimator is the best
linear unbiased estimator. Now, the Cramer-Rao inequality tells the optimality of OLS estimator
under assumption2. The ML estimator for 2 is (Y X )0 (Y X )/n. We have introduced this
estimator a moment ago and we showed that the dierence between n2 and the OLS estimator s2n
becomes arbitrarily small as n ! 1.
Next, consider assumption 5, where U |X N (0, 2 V ) and V is known. Then the log likelihood
function omitting constant term is
l(Y |X, ) =
The MLE estimator is
(1/2)logV
X )0 V
(1/2)(Y
n = (X 0 V
X)
(Y
X ).
X 0 Y,
which is equivalent to the GLS estimator. The score vector is Sn ( ) = (Y X )0 V 1 X, the Hessian
matrix Hn ( ) = X 0 V 1 X. Therefore, the information matrix is I( ) = X 0 V 1 X. Therefore, the
GLS/MLE estimator is efficient as it achieves the Cramer-Rao lower bound (X 0 V 1 X) 1 .
When V is unknown, we can parameterize it V ( ), say, and maximizes the likelihood
l(Y |X, , ) =
2.5
(1/2)logV ( )
(1/2)(Y
X )0 V
( )(Y
X ).
In Hamiltons book, you can find many detailed discussions about MLE estimation for an ARMA
model in Chapter 5. We will take an AR(1) model as example.
Consider an AR(1) model,
xt = c + xt 1 + ut
where ut i.i.d.N (0, 2 ). Let = (c, , 2 ) and let the sample size denoted by n. There are
two ways to construct the likelihood function, and the dierence lies in how to specify the initial
observation x1 . If we let x1 be random, we know that the unconditional distribution of xt is
2 )), and this will lead to an exact likelihood function. Alternatively, we can
N (c/(1
), 2 /(1
assume that x1 is observable (known) and this will lead to a conditional likelihood function.
We first consider the exact likelihood function. We know that
(x1 c/(1
))2
2 1/2
p(x1 ; ) = (2 )
exp
.
2)
2 2 /(1
Conditional on x1 , the conditional distribution of x2 is N (c + x2 , 2 ), then the conditional
probability density for the second observation is
(x2 c
x1 ))2
2 1/2
p(x2 |x1 ; ) = (2 )
exp
.
2 2
So the joint probability density for (x1 , x2 ) is
p(x1 , x2 ; ) = p(x2 |x1 ; )p(x1 ; ).
17
(xn c
xn 1 ))2
2 1/2
p(xn |xn 1 ; ) = (2 )
exp
.
2 2
is
n
Y
t=2
p(xt |xt
1 ; ).
Taking log we get the exact likelihood function (omitting constant terms for simplicity)
l(Xn ; ) =
1
log
2
(x1
2
c/(1
2 /(1
))2
2)
n
2
log(
n
X
(xt
t=2
xt
2
2
1)
(11)
Next, to construct the conditional likelihood, assume that x1 is observable, then the log likelihood function is (again, constant terms are omitted)
l(Xn ; ) =
n
2
log(
n
X
(xt
t=2
xt
2
2
1)
(12)
The maximum likelihood estimates c and are obtained by maximizing (12), or solving the
score function. Note that maximizing (12) with respect to is equivalent to minimizing
n
X
(xt
xt
2
1) ,
t=1
Model Selection
In the discussion on estimation above, we assume that the order of the lags is known. However,
in empirical estimation, we have to choose a proper order. A larger number of order (parameters)
will increase the fitness of the model, therefore we need some criterion to balance the goodness of
18
fit and model parsimony. There are three commonly used criterion, Akaike information criterion
(AIC), Schwartzs Bayesian information criterion (BIC), and the posterior information criterion
(PIC) developed by Phillips (1996).
In all these criterion, we specify a maximum order kmax , and then choose k to minimize a
criterion equation.
SSRk
2k
AIC = log
+
(13)
n
n
where n is the sample size, k = 1, 2, . . . , kmax is the number of parameters in the model, and SSRk
is the residual from the fitted model. When k increase, the fit increases, so SSRk decreases, but
the second term increases. So this shows a trade o between fit and parsimony. Since the model is
estimated using dierent lags, the sample size also varies. We can either use the dierent sample
size n k, or we can use a fixed sample size n kmax . Ng and Perron (2000) has recommended
using the fixed sample size and use it to replace n in the criterion. However, the AIC rule is not
consistent and tends to overfit the model by choosing larger k.
With all other issues similar as in the AIC rule, the BIC rule imposes a larger penalty for
increasing number of parameters,
k log(n)
SSRk
+
(14)
BIC = log
n
n
BIC suggests samller k than AIC and BIC rule is consistent in stationary data, i.e., limn!1 kBIC =
k. Further, Hannan and Deistler (1988) has shown that kBIC is consistent when we set kmax =
[c log(n)] (the integer part of c log(n)) for any c > 0. Therefore, we can estimate kBIC consistently
without knowing the upper bound of k.
Finally, to present the PIC criterion, let K = kmax , and let X(K) and X(k) to denote the
regressor matrix with K and k parameters respectively. Similar for , the parameter vector.
Y
A() = X() ()
A(k) = X(k) (k)
A(, k) = X()X(k)
A() = A() A(, k)A(k) 1 A(k, )
() = [X()0 X() X()0 X(k)(X(k)0 X(k))
X()0 X(k)(X(k)0 X(k))
2
K
= SSRK /(n
X(k)X()]
[X()0 Y
X(k)Y ]
K)
then
PIC = |A()/
2 1/2
exp
k|
1
2
2
k
()0 A() () .
PIC is asymptotically equivalent to the BIC criterion when the data is stationary, and when
the data is nonstationary, PIC is still consistent.
Reading: Hamilton, Ch. 5, 8.
19
In this section, we will extend our discussion to vector valued time series. We will be mostly
interested in vector autoregression (VAR), which is much easier to be estimated in applications.
We will fist introduce the properties and basic tools in analyzing stationary VAR process, and then
well move on to estimation and inference of the VAR model.
1.1
1.1.1
VAR processes
A VAR model applies when each variable in the system does not only depend on its own lags, but
also the lags of other variables. A simple VAR example is:
x1t =
11 x1,t 1
12 x2,t 1
+ 1t
x2t =
21 x2,t 1
22 x2,t 2
+ 2t
x1t
x1,t 1
0 0
x1,t 2
1t
11
12
=
+
+
x2t
0
x2,t 1
0
x2,t 2
2t
21
22
or just
xt =
1 xt 1
2 xt 2
2
1
21
12
2
2
+ t
(1)
As you can see, in this example, the vector-valued random variable xt follows a VAR(2) process.
A general VAR(p) process with white noise can be written as
xt =
=
1 xt 1
p
X
j xt j
2 xt 2
+ t
j=1
+ . . . + t
where
(L) = Ik
1L
p
pL .
...
for t = s
E(t 0s ) =
0 otherwise
with a (k k) symmetric positive definite matrix.
Recall that in studying the scalar AR(p) process,
(L)xt = t ,
we have the results that the process {xt } is covariance-stationary as long as all the roots in (2)
1
1z
2z
...
pz
=0
(2)
lies out side of the unit circle. Similarly, for the VAR(p) process to be stationary, we must have
that the roots in the equation
p
|Ik
...
1z
pz | = 0
all lies outside the unit circle.
1.1.2
Recall that we could invert a scalar stationary AR(p) process, (L)xt = t to a MA(1) process,
xt = (L)t , where (L) = (L) 1 . The same is true for a covariance-stationary VAR(p) process,
(L)xt = t . We could invert it to
xt = (L)t
where
(L) = (L)
The coefficients of
then (L) (L) = Ik :
(Ik
2
2L
p
p L )(Ik
...
1.2
= Ik ,
s 2
1,
+ ... +
1L
2
2L
+ . . .) = Ik .
2,
1 (L)
(L),
and in general, we
s p.
Sometime, it is more convenient to write a scalar valued time series, say an AR(p) process, in vector
form. For example,
p
X
xt =
j xt j + t .
j=1
where N (0,
2 ).
0
B
B
B
@
0
..
.
0
10
CB
CB
CB
A@
xt
xt
..
.
xt
p ),
C B
C B
C+B
A @
t
0
..
.
0
1
C
C
C
A
+ t
1 xt 1
let
2 xt 2
B
B
B
F =B
B
@
B
B
t = B
@
1
Ik
0
..
.
0
Ik
..
.
+ ... +
xt
xt 1
..
.
xt
p+1
...
...
...
..
.
...
0
t
B 0
B
vt = B .
@ ..
0
p xt p
+ t ,
C
C
C,
A
p 1
0
0
..
.
0
0
..
.
Ik
1
C
C
C
C,
C
A
C
C
C,
A
+ vt .
otherwise, and
3
... 0
... 0 7
7
.. 7 .
... . 5
0 0 ... 0
(3)
1.3
1.3.1
For a covariance stationary k dimensional vector process {xt }, let E(xt ) = , then the autocovariance is defined to be the following k by k matrix
(h) = E[(xt
)(xt
)0 ].
For simplicity, assume that = 0. Then we have (h) = E(xt x0t h ). Because of the lead-lag eect,
we may not have (h) = ( h), but we have (h)0 = ( h). To show this,
(h) = E(xt+h x0t+h
h)
= E(xt+h x0t ),
taking transpose
(h)0 = E(xt x0t+h ) = ( h).
Similar as in the scalar case, we define the autocovariance generating function of the process x
as
1
X
Gx (z) =
(h)z h
h= 1
6
6
= 6
4
xt
xt 1
..
.
xt
p+1
(0)
(1)0
..
.
(p
1)0
1
C
C
C
A
x0t x0t
(1)
(0)
..
.
...
...
...
2)0 . . .
(p
. . . x0t
p+1
3
1)
2) 7
7
7.
..
5
.
(0)
(p
(p
3
7
7
7
5
+ vt )(F t
+ vt )0 ] = F E( t
0
0
1 t 1 )F
+ E(vt vt0 ),
or
= F F 0 + Q.
(4)
To solve for , we need to use the Kronecker product, and the following result: let A, B, C be
matrices whose dimensions are such that the product ABC exists. Then
vec(ABC) = (C 0 A) vec(B).
where vec is the operator to stack each column of a matrix (k k) into a k 2 -dimensional vector,
for example,
2
3
a11
6 a21 7
a11 a12
7
A=
vec(A) = 6
4 a12 5 .
a21 a22
a22
Apply vec operator on both sides of (4), we get
F F)
vec(Q),
where m = k 2 p2 . We can use this equation to solve for the first p order of autocovariance of x,
(0), . . . , (p). To derive the hth autocovariance of , denoted by (h), we can postmultiplying
(3) by 0t h and take expectations,
E( t 0t
h)
0
1 t h )
= F E( t
+ E(vt 0t
h ),
then
(h) = F (h
1),
or
(h) = F h .
(h
1) +
(h
2) + . . . +
(h
p).
Vector MA processes
1 t 1
2 t 2
+ ... +
q t q .
1
0
0
0
1 E(t 1 t 1 ) 1 +
0
0
1 + 2 2 + . . . +
h+2
h+1 +
0
2
0
0
q E(t q t q )
... +
+ . . . + q
h+2 + . . . +
0
q
0
q.
0
q j
q+h
for h = 1, . . . , q.
0
q for h = 1, . . . , q.
As in the scalar case, any vector MA(q) process is stationary. Next consider the MA(1) process
xt = t +
1 t 1
2 t 2
+ ... =
(L)t .
A sequence of matrices { s }11 is absolutely summable if each of its element forms an absolutely
summable scalar sequence, i.e.
1
X
s=0
(s)
ij |
for i, j = 1, 2, . . . n,
<1
(s)
where ij is the row i column j element (will use ijth for short) of
about MA(1) process is summarized as follows:
s.
1
X
j t j ,
j=0
(a) The autocovaiance between the ith variable at time t and the jth variable s periods earlier,
E(xit xj,t s ) exists and is given by the ijth element of
(s) =
1
X
s+v
0
v
for s = 0, 1, 2, . . . ;
v=0
(b) { (h)}1
h=0 is absolutely summable.
If {t }1
t=
is i.i.d. with E|i1,t i2,t i3,t i4,t | < 1 for i1, i2, i3, i4 = 1, 2, . . . , k then we also have
(c) E|xi1,t1 xi2,t2 xi3,t3 xi4,t4 | < 1 for i1, i2, i3, i4 = 1, 2, . . . , k and all t1, t2, t3, t4.
P
(d) n 1 nt=1 xit xj,t s !p E(xit xj,t s ) for i, j = 1, 2, . . . , k and for all s.
All of these results can be viewed as extensions from the scalar case to vector case, and its proof
can be found on page 286-288 in Hamiltons book.
1.4
n =
x
1X
xt .
n
t=1
0n )
E[(
xn x
1
=
E[(x1 + . . . xn )(x1 + . . . xn )0 ]
n2
n
1 X
=
E(xi x0j )
n2
i,j
(h), where
(h) is absolutely
1
1 X
n2
(h)
h= 1
1
n
n
X1
h= n+1
|h|
n
(h)
Then
0n )]
nE[(
xn x
n
X1
=
1
h= n+1
(0) + 1
!
1
X
|h|
(h)
n
1
( (1) + ( 1)) + 1
n
2
n
( (2) + ( 2)) + . . .
(h)
h= 1
This is very similar as what we did in the scalar case. Then we have the following proposition:
Proposition 2 Let xt be a zero mean stationary process with E(xt ) = 0 and E(xt xt
where (h) is absolutely summable, then the sample mean satisfies
n !p 0
(a) x
0n )] =
(b) limn!1 [nE(
xn x
P1
h= 1
h)
= (h),
(h).
S = (h) +
q
X
( (h) + (h)0 ),
(5)
h=1
where
n
X
(h) = 1
(xt
n
n )(xt
x
n )0 .
x
t=h+1
S defined in (5) provides a consistent estimator for a large class of stationary processes. Even
when the process has time-varying second moments, as long as
n
1 X
(xt
n
n )(xt
x
t=h+1
n )0
x
converges in probability to
n
1 X
E(xt xt
n
h ),
t=h+1
0n ). It is used not only for MA(q) process. Write the autocoS is a consistent estimate of nE(
xn x
variance as E(xt xs ), even it is nonzero for all t and s, if the matrix goes to zero sufficiently fast as
|t s| ! 1, and q is growing with the sample n, then we still have S ! S.
However, a problem with S is that it may not be positive semidefinite in small samples. Therefore, we can use the Newey and West estimate
q
X
h
S = 0 +
1
( (h) + (h)0 ),
q+1
h=1
which is positive semidefinite and has the same consistency properties of S when q, n ! 1 with
q/n1/4 ! 0.
1.5
1.5.1
Impulse-response function gives how a time series variable is aected given a shock at time t. Recall
that for a scalar time series process, say, a AR(1) process xt = xt 1 + t with | | < 1, we can
invert it to a MA process xt = (1 + L + 2 L + . . .)t , and the eects of on x are:
: 0 1 0
x: 0 1
0
2
...
...
In other words, after we invert (L)xt = t to xt = (L)t , the (L) function gives us how x
response to a a unit shock from t .
We could do similar thing on a VAR process. In our earlier example, we have a VAR(2) system,
xt =
and t W N (0, ) where
1 xt 1
+
2
1
21
2 xt 2
12
2
2
+ t
(L)t
(6)
2 1
where (L) = (1
1L
2 L ) , we see that in this representation, the observations xt is a linear
combinations of shocks t. However, suppose we are interested in another form of shocks, say
ut = Qt
where Q is an arbitrary square matrix (in this example, it is 2 by 2), we have
xt =
(L)Q
Qt = A(L)ut
(7)
where we let A(L) = (L)Q 1 . Since this Q is arbitrary, you see that we can have many linear
combinations of shocks, and response functions. Then which combinations shall we use?
8
1.5.2
In economic modeling, we calculate the impulse-response dynamics as we are interested how economic variables response to certain source of shocks. If the shocks are correlated, then it is hard
to identify what is the response to a particular shock. From that view, we may want to choose the
Q to make ut = Qt orthonormal, or uncorrelated across each other and with unit variance, i.e.,
E(ut u0t ) = I. To do so, we need a Q such that
Q
10
= ,
then E(ut u0t ) = E(Qt 0t Q0 ) = QQ0 = Ik . So, we can use Choleski decomposition to find Q.
However, Q is still not unique as you can form other Qs by multiplying an orthogonal matrix.
Sims (1980) proposes that we could specify the model by choosing a particular leading term in
the coefficient, A0 . In (6), we see that 0 = Ik . However, in (7), A0 = Q 1 cannot be identity
matrix unless is diagonal. In our example, we would choose the Q which produces A0 = Q 1 as
a lower triangular matrix. That means after this transformation, shock u2t has no eects on x1t .
The nice thing is that Choleski decomposition itself will produce a triangular matrix.
Example 1 Consider a AR(1) process of a 2-dimensional vector,
x1t
0.5 0.2
x1,t 1
1t
=
+
x2t
0.3 0.4
x2,t 1
2t
where
=
E(t 0t )
2 1
1 4
0
0.5 0.2
0
0.3 0.4
gives 1 = 0.94 and
process,
0.04, both lies inside the unit circle. Invert it to a moving average
xt =
We know that
gives
= I2 ,
Q=
=0
1,
(L)t .
0.70
0
0.27 0.53
xt =
and Q
(L)Q
Qt =
(L)Q
1.41
0
0.70 1.87
1
ut
x1t
x2t
0Q
1.41
0
0.70 1.87
ut +
u1t
u2t
1Q
ut
+ ....
0.85 0.37
0.70 0.75
u1,t
u2,t
1
1
+ ...
In this example you see that we find a unique MA representation which is linear combination of
uncorrelated error (E(ut u0t ) = I2 ), and the second sources of shock does not have instantaneous
eects on x1t . We can then use this representation to compute the impulse-responses.
There are also other ways to specify the representation, depending on the problem of interest.
For example, Quah (1988) suggests that find a Q so that the long-run response of one variable to
another shocks is zero.
1.5.3
Variance decomposition
Now, lets consider how we could decompose the variance of the forecasting errors. xt = (L)t =
A(L)ut where A(L) = (L)Q, ut = Qt and E(ut u0t ) = I. For simplicity, we let (xt = (x1t , x02t ).
Suppose we do a one-period ahead forecasting, and let yt+1 denote the forecast error,
0
A11 A012
u1,t+1
yt+1 = xt+1 Et (xt+1 ) = A0 ut+1 =
.
0
0
A21 A22
u2,t+1
0 )=
Since E(u1t u2t ) = 0, E(u2it ) = 1, the variance of the forecasting error is given by E(yt+1 yt+1
0
0
2
0
2
A0 A0 . So the variance of forecasting error for x1t is given by (A11 ) + (A12 ) . We can interpret
that (A011 )2 is the amount of the one-step ahead forecasting error variance due to shock u1 , and
(A012 )2 is the amount due to shock u2 . Similarly the variance of forecasting error of x2t is given by
(A021 )2 + (A022 )2 , and we can interpret them as amount due to shock u1 and u2 respectively. The
variance for k-period ahead forecasting error can be computed in a similar way.
2.1
Usually we use conditional likelihood in VAR estimation (recall that conditional likelihood functions
are much easier to work with than unconditional likelihood functions).
Given a k-vector VAR(p) process,
yt = c +
1 yt 1
2 yt 2
+ . . . + t ,
c0
B 0 C
B 1 C
B 0 C
=B 2 C
B .. C
@ . A
0
p
B yt
B
B
and xt = B yt
B ..
@ .
yt
1
2
1
C
C
C
C
C
A
If we assume that i.i.d.N (0, ), then we could use MLE to estimate the parameters in =
(c, , ). Following the same way in the scalar case, assume that we have observed (y p+1 , . . . , y0 ),
then the likelihood function for the yt is
L(yt , xt ; ) = (2)
k/2
1 1/2
exp[( 1/2)(yt
10
0 xt )0
(yt
0 xt )]
(1/2)
n
X
(yt
0 xt )0
(yt
t=1
yt xt
xt xt
.
t=1
0n is
The jth row of
0j
"
n
X
0 xt ) .
(8)
t=1
yjt x0t
t=1
#"
n
X
xt x0t
t=1
which is the estimated coefficient vector from an OLS regression of yjt on xt . So the MLE
estimates of the coefficients for the jth equation of a VAR are found by an OLS regression of yjt
on a constant term and p lags of all of the variables in the system.
The MLE estimate of is
n
X
t 0t
n = (1/n)
t=1
where
t = yt
0n xt
The details on the derivations can be found on page 292-296 on Hamilton book. The MLE
and
are consistent even if the true innovations are non-Gaussian. In the next
estimates
subsection, we will consider regression with non-Gaussian errors, and we will use the LS approach
to derive for the asymptotics.
2.2
1 yt 1
2 yt 2
+ ... +
p yt p
+ t ,
t = i.i.d.(0, ), and E(it jt lt mt ) < 1 for all i, j, l, and m and where roots of
|Ik
1z
...
pz
|=0
yt0
. . . yt0
],
11
where
i,n =
"
n
X
xt x0t
t=1
1" n
X
x0t yit ,
t=1
n
X
t 0t ,
t=1
where
0t = [ 1t 2t . . . kt ]
i,n
x0t
it = yit
Then
(a) n
Pn
0
t=1 xt xt
n !p ;
(b)
n !p ;
(c)
p
n ) !d N (0, Q
(d) n(
1 ).
Result (a) is a vector version of that sample second moment converges to the population moment,
and it follows that the coefficients are absolutely summable and it has finite fourth moment. Result
(b) and (c) are similar to the derivations for single OLS regression in case 3 in lecture 5. To show
result (d), let
n
X
1
Qn = n
xt x0t ,
t=1
i,n
n(
and
p
n
n(
Define t to be a km 1 vector
"
i ) = Qn 1 n
2
Qn 1 n
6 Q 1n
6 n
) = 6
4
Qn 1 n
2
6
6
t = 6
4
xt it
t=1
Pn
xt 1t
Pt=1
n
1/2
t=1 xt 2t
..
.
Pn
1/2
t=1 xt kt
1/2
xt 1t
xt 2t
..
.
xt kt
12
1/2
n
X
7
7
.7 .
5
#
3
7
7
7.
5
(9)
n
X
t=1
t 0t !p Q.
n
X
t=1
6
6
) = 6
4
n
n(
t !d N (0, Q).
Qn 1
0
0
Qn 1
..
..
.
.
0
0
= (Ik Qn 1 )n
n
n1/2 (
1.
...
...
32
0
0
..
.
n
76 n
76
76
54
n
...
. . . Qn 1
1/2
n
X
(10)
Pn
xt 1t
Pt=1
n
1/2
t=1 xt 2t
..
.
Pn
1/2
t=1 xt kt
1/2
3
7
7
7
5
t=1
Thus
) !p (Ik Q
)n
1/2
n
X
t .
t=1
From (10) we know that this has a distribution that is Gaussian with mean 0 and variance
(Ik Q
)( Q)(Ik Q
) = (Ik Ik ) (Q
)=Q
2
1
i Q ).
Given that the estimators are asymptotically normal, we can use it to test linear or nonlinear
restrictions on the coefficients with the Wald statistics.
We know that vec is an operator to stack each column of a k k matrix into one k 2 1 vector.
A similar operator, vech, is to stack all elements under the principal diagonal (so it transforms a
k k matrix into one k(k + 1)/2 1 vector). For example,
2
3
a11
a11 a12
A=
vech(A) = 4 a21 5 .
a21 a22
a22
13
We will apply this operator on the variance matrix, which is symmetric. The joint distribution
n is given in the following proposition.
n and
of
Proposition 4
yt = c +
1 yt 1
2 yt 2
+ ... +
1z
...
pz
p yt p
+ t ,
|=0
n ]
n1/2 [
0
Q 1 0
,
.
n ) vech()] !d N
0
0
22
n1/2 [vech(
Let ij denote the ijth element of then the element of 22 corresponding to the covariance between
ij and lm is given by ( il jm + im jl ) for all i, j, l, m = 1, . . . k.
The detailed proof can be found onPpage 341-342 in Hamilton book. Basically there are three
n = n 1 n t t 0 has the same asymptotic distribution as
n =
steps:
first, we show that
t=1
P
n
1
0
n
t=1 t t . In the second step, write
P
n ]
n1/2 [
(Ik Q 1 )n 1/2 nt=1 t
P
n ) vech()] !d
n 1/2 nt=1 t
n1/2 [vech(
where
Now,
( 0t ,
0)
t
6
= vech 4
21t
kt k1
..
.
11
k1
. . . 1t kt
...
...
..
.
2kt
1k
kk
7
5.
is an mds and we apply the CLT for mds to get (with a few more computations)
P
n 1/2 nt=1 t
0
Q 1 0
P
.
!d N
,
0
0
22
n 1/2 nt=1 t
The final step in the proof is to show that E( t 0t ) is given by the matrix 22 as described in the
proposition, which can be proved with a constructed error sequence which is uncorrelated Gaussian
with zero mean and unit variance (see Hamiltons book for details).
n , we can then test if two errors are correlated. For example,
With the asymptotic variance of
for k = 2,
2
3
02 3 2
31
2
2
11,n
0
2 11
2 11 12
11
12
p
2
n 4 12,n
12 5 !d N @4 0 5 , 4 2 11 12
11 22 + 12 2 12 22 5A .
2
2
22,n
0
2 12
2 12 22
2 22
22
by
Then a Wald test of the null hypothesis that there is no covariance between 1t and 2t is given
p
n12
N (0, 1).
2 )1/2
(11 22 + 12
14
The matrix 22 can be expressed more compactly using the duplication matrix. Duplication
matrix Dk is a matrix of size k 2 k(k + 1)/2 matrix that transforms vech() into vec(), i.e.
Dk vech() = vec().
For example,
1
6 0
6
4 0
0
Define
3
0 2
0 7
74
1 5
1
0
1
1
0
11
21
22
6
5=6
4
0
D+
k (Dk Dk )
11
21
12
22
7
7.
5
Dk
+
Note that D+
k Dk = Ik(k+1)/2 . Dk is like the reverse of Dk as it transform vec() into vech(),
vech() = D+
k vec().
For example, when k = 2, we have
2
4
11
21
22
With Dk and D+
k we can write
2
3
1 0
0 0 6
5 = 4 0 1/2 1/2 0 5 6
4
0 0
0 1
3
11
21
12
22
7
7.
5
+ 0
22 = 2D+
k ( )(Dk ) .
Granger Causality
In most regressions in econometrics, it is very hard to discuss causality. For instance, the significance
of the coefficient in the regression
yi = xi + i ,
only tells the co-occurrence of x and y, not that x causes y. In other words, usually the regression
only tells us there is some relationship between x and y, and does not tell the nature of the
relationship, such as whether x causes y or y causes x.
One good thing of time series vector autoregression is that we could test causality in some
sense. This test is first proposed by Granger (1969), and therefore we refer it Granger causality.
We will restrict our discussion to a system of two variables, x and y. y is said to Granger-cause
x if current or lagged values of y helps to predict future values of x. On the other hand, y fails to
Granger-cause x if for all s > 0, the mean squared error of a forecast of xt+s based on (xt , xt 1 , . . .)
is the same as that is based on (yt , yt 1 , . . .) and (xt , xt 1 , . . .). If we restrict ourselves to linear
functions, x fails to Granger-cause x if
t+s |xt , xt
MSE[E(x
1 , . . .)]
t+s |xt , xt
= MSE[E(x
1 , . . . , yt , yt 1 , . . .)].
Equivalently, we can say that x is exogenous in the time series sense with respect to y, or y is not
linearly informative about future x.
15
In the VAR equation, the example we proposed above implies a lower triangular coefficient
matrix:
xt
c1
0
xt 1
0
xt p
1t
11
11
=
+
+ ... +
+
(11)
p
p
1
1
yt
c2
y
y
2t
t 1
t p
21
22
21
22
Or if we use MA representations,
xt
1
=
yt
2
11 (L)
0
22 (L)
21 (L)
1t
2t
(12)
where
ij (L)
with 011 =
(1972).
0
22
= 1 and
0
21
0
ij
1
ij L
2
ij L
+ ...
1
X
bj xt
j=0
1
X
dj xt+j + t ,
(13)
j=1
j=1
= ... =
= 0.
Note: we have to be aware of that Granger causality does not equal to what we usually mean
by causality. For instance, even if x1 does not cause x2 , it may still help to predict x2 , and thus
Granger-causes x2 if changes in x1 precedes that of x2 for some reason. A naive example is that
we observe that a dragonfly flies much lower before a rain storm, due to the lower air pressure.
We know that dragonflies do not cause a rain storm, but it does help to predict a rain storm, thus
Granger-causes a rain storm.
Reading: Hamilton Ch. 10, 11, 14.
16
Introduction
Recall that a process is covariance stationary if it has constant expectation, finite variance, and
its antocovariance functions do not depend on time. In this lecture, we will introduce one class of
processes that are nonstationary processes with deterministic trend. In the next lecture, we will
introduce another type processes with stochastic trend.
We have been familiar with a stationary ARMA process,
x
t = (L)ut .
Now consider an ARMA process with a drift,
xt = t + x
t = t + (L)ut .
(1)
(t + k) + Et (ut+k +
(t + k) +
k ut
1 ut+k 1
k+1 ut 1
+ ... +
+ ... +
k ut
k+1 ut 1
+ ... +
t+k u0 )
t+k
Et (xt+k ))2
= Et (ut+k +
= (1 +
2
1
2
1 ut+k 1 + . . . + k 1 ut+1 )
2
2
2
2 + . . . + k 1)
Note since x
t = (L)t is a stationary process, as k ! 1, the forecasting error converges to
the unconditional variance of x
t , which is bounded. This is a very important dierence between
processes with deterministic trend and those with stochastic trend. Another feature of a trend
stationary process is that given a shock at time t, its eects on the level of x
, hence x eventually
dies o as in a stationary process. This is another dierence compared to a unit root process. We
will discuss more on this in next lecture.
Figure 1 plots a simulated path of (1), where = 1, ut N (0, 1) and (L) = 1 + 0.5L.
3
2
1
0
1
2
3
10
15
20
25
30
35
40
45
50
10
15
20
25
30
35
40
45
50
50
40
30
20
10
0
2
2.1
(2)
where 0 = [, ], x0t = [1, t], and ut i.i.d.N (0, 2 ). We can use MLE to estimate the parameters
, and the MLE estimator is equivalent to the OLS estimator. So we will only discuss OLS
estimation, which is applicable to a more general class of errors. In our following analysis, we
assume ut i.i.d(0, 2 ) and E(u4t ) < 1.
The OLS estimate of is
" n
# 1" n
#
X
X
n =
xt x0
xt yt
(3)
t
t=1
"
n
X
xt x0t
t=1
t=1
1" n
X
xt ut
t=1
(4)
xt x0t
Pn
Pnt=1 t2
n
t=1 t
t=1 t
Pn
2
t=1
or
n
1
1 X
t!
n2
2
n
1
1 X 2
t ! ,
n3
3
t=1
t=1
n
X
1
nr+1
t=1
tr !
1
.
r+1
n
X
xt x0t
t=1
0 0
0 31
Unfortunately, this limiting matrix is singular and cannot be inverted. It turns out that to
obtain a nondegenerate limiting distributions,
n need to be rescaled by n1/2 and n need to be
rescaled by n3/2 . Therefore, to get a proper limit of n , we need to normalize it with a matrix
1/2
n
0
Hn =
.
0
n3/2
Now premultiply n by H.
Hn ( n
0) =
"
n1/2 (
n 0 )
3/2
n ( n
0)
" n
# 1
" n
#
X
X
0
1
= Hn
xt xt
Hn Hn
xt ut
t=1
Hn 1
n
X
xt x0t
t=1
Hn 1
t=1
1"
1
Q= 1
2
1
2
1
3
Hn 1
n
X
t=1
xt ut
!#
n1/2
0
3/2
0
n
1
1
2
1
2
1
3
P
Next, we need to derive the asymptotic distribution for Hn 1 ( nt=1 xt ut ),
!
Pn
P
n
X
n 1/2
0
ut
n 1/2 nt=1 ut
1
t=1
P
P
Hn
xt ut =
=
n
0
n 3/2
n 1/2 nt=1 (t/n)ut
t=1 tut
t=1
t=1
Pn
1/2
n
X
t=1
n
X
/3.
(t2 /n2 )
/3,
t=1
n
X
t=1
2 )}
) ! 0.
(5)
2 2
) < 1,
So (5) holds by law of large numbers for mds. Now all the three conditions are satisfied, we can
then apply the CLT for mds,
n
1/2
n
X
t=1
(t/n)ut ! N (0,
/3).
P
P
The remaining task is to show that {n 1/2 nt=1 ut } and {n 1/2 nt=1 (t/n)ut } are asymptotically
joint normal. To show they are jointly normal, it is suffice to show that any linear combination of
these two series is asymptotically normal, i.e., to show that
n
1/2
n
X
2 (t/n)]ut
t=1
! N (0, ).
1 ut
2 (t/n)ut }
2[ 2
1
+2
1 2 (t/n)
2 0
2 (t/n)2 ]
2
1X
n
2
1
+2
1 2 (t/n)
2
2
2 (t/n) ]
t=1
for
=(
1,
0
2) .
2
1
+2
1 2 (1/2)
2 0
2
2 (1/3)]
Furthermore,
n
1X
[
n
2 2
2 (t/n)] t
t=1
Q .
So we can apply CLT and have this linear combination of the two elements converge to a
Gaussian distribution, this hence imply that this two elements are joint Gaussian.
P
n 1/2 nt=1 ut
P
! N (0, 2 Q).
n 1/2 nt=1 (t/n)ut
Therefore, we got
Hn
n
X
t=1
xt ut
n
n
Pn
t=1 ut
P
n
3/2
t=1 tut
1/2
! N (0,
) = N (0,
).
Proposition 1 Let yt be generated according to the simple deterministic time trend model (2)
where ut i.i.d.(0, 2 ) with finite fourth moment. Then
!
1/2
1
1
n (
n )
0
1
!N
, 2 1 21
.
0
n3/2 ( n
)
2
3
Note that for the estimate of , we not only have n !p , we also have n( n
case, the estimate is said to be superconsistent.
2.2
) !p 0. In this
When the innovation term ut is Gaussian, and since in the simple trend model the regressors are
deterministic, the OLS estimates
n and n are Gaussian and the usual OLS t and F tests have
the exact small sample t and F distribution. In this section, we will consider the case when ut is
non-Gaussian.
We first consider
a test of the null hypothesis on , say, = a. Let s2n is the OLS estimate of
P
n
2 : s2 = 1
2t . Then the t statistics is
n
t=1 u
n 2
tn
n a
1/2
1
s2n 1 0 (Xn0 Xn ) 1
0
p
n(
n a)
p
p
n
2
0
1
n 0 (Xn Xn )
sn
0
5
1/2
n(
n a)
1
2
0
1
sn 1 0 Hn (Xn Xn ) Hn
0
p
n(
n a)
1/2
1
2 1 0 Q 1
0
p
n 0 = 1 0 Hn
and Hn (Xn0 Xn )
1,
1/2
!Q
tn !
.
p
n(
n
n(
n a)
,
p
q11
is an asymptotically Gaussian variable divided by the square root of its variance, so it has a N (0, 1)
distribution.
Similarly, to test the null hypothesis n = b, write
tn
n b
s2n 0 1 (Xn0 Xn )
0
1
n3/2 ( n b)
3/2
s2n 0 n
(Xn0 Xn ) 1
1/2
1/2
n3/2
n3/2 ( n b)
0
2
0
1
sn 0 1 Hn (Xn Xn ) Hn
1
3/2
n ( n b)
,
p
q22
1/2
(r1
n + r2 n r)
r1
2
0
1
sn r1 r2 (Xn Xn )
r2
6
1/2
n(r1
n + r2 n
p
s2n n r1 r2 (Xn0 Xn )
r)
r1 p
n
r2
p
n(r1
n + r2 n r)
1
p
s2n n r1 r2 Hn 1 Hn (Xn0 Xn )
p
n(r1
n + r2 n
where
rn Hn
r1
r2
Since n is superconsistent,
p
n(r1
n + r2 n
So
r)
1H
n ]rn }
n=
r) =
1/2
r) =
n[r1 (
n
n Hn
r1
r2 /n
1/2
r1
r2
1/2
n(r1
n + r2
n(r1
n + r2 n r)
tn !p
r1
2 r
1
1 0 Q
0
1H
1/2
r1
0
r) + op (1).
n(r1
n + r2
r)
+ op (1).
2
1/2
2
(r1 q11 )
) + r1 + r2
r) =
n[r1 (
n
)]
a
H0 :
=
b
or in vector form,
=
=
( n
( n
c)
! [Hn ( n
( n
c)
Then we have
Wn !
7
(2).
c)
2.3
1 yt 1
2 yt 2
+ ... +
+ ut ,
p yt p
or in matrix form,
yt = x0t + ut
where x0t = [yt 1 , yt 2 , . . . , yt p , 1, t], and 0 = [ 1 , . . . , p , , ]. Sims, Stock and Watson (1990)
suggest that we find a matrix G and use it to transform this process to
yt = x0t G0 [G0 ]
+ ut = x
0t
+ ut .
where x
t = Gxt = [
yt 1 , yt 2 , . . . , yt p , 1, t]0 and = [G0 ] 1 = [ 1 , 2 , . . . p , , ]0 .
The idea is that after the transformation, we could write yt in terms of zero-mean covariance
stationary process (
yt j ), a constant and a time trend. In doing this, we could isolate components of
the OLS coefficient vector with dierent rates of convergence. In this case, after the transformation,
, , . . . will converge at the usual rate of pn, while
n , n will behave asymptotically like
n
1,n 2,n
and n in the simple time trend model. The matrix G is of dimension (p + 2) (p + 2):
2
3
1
0
...
0
0 0
6
0
1
...
0
0 0 7
6
7
6
..
..
..
.. .. 7
6
.
.
...
.
. . 7
G0 6
7,
6
7
0
0
.
.
.
1
0
0
6
7
4 +
+2
...
+p
1 0 5
...
0 1
2
[G0 ]
1
0
..
.
6
6
6
6
=6
6
0
6
4
0
1
..
.
...
...
1
...
p
...
0
0
..
.
...
...
3
0 0
0 0 7
7
.. .. 7
. . 7
7.
0 0 7
7
1 0 5
0 1
The relation between the OLS coefficient estimates before and after the transformation is:
= [G0 ] 1 n and n = G0 . A simple example to understand this transformation is the following
n
n
model:
y t = yt 1 + + u t ,
(6)
for which we know that E(yt ) = /(1
). Now, we can write
1
0
1 0
0
0 1
G
[G ]
1
1
We can then solve
from
1 0
1
)
8
/(1
y + + ut
t 1
yt 1
+
1
1
yt 1 + t + ut .
+ ut
The advantage of this transformation is that now yt 1 is a zero mean process. When a time trend
is included in the process, we will see similar fact: yt is demeaned and detrended.
To derive the asymptotic distribution of n , define
2 p
3
n 0 0 ... 0
0
0
p
6 0
n 0 ... 0
0
0 7
6
7
6 ..
..
..
..
..
.. 7
6 .
.
. ... .
.
. 7
Hn = 6
7,
p
6 0
0 0 ...
n 0
0 7
6
7
p
4 0
n
0 5
0 0 ... 0
0
0 0 ... 0
0 n3/2
then the OLS estimates
Hn (
"
) = H 1
n
"
n
X
x
t x
0t
t=1
n
X
x
t x
0t
t=1
1" n
X
t=1
Hn
x
t ut ,
1"
Hn
n
X
x
t ut .
t=1
(7)
n
X
An Bn
1
0
1
Hn
x
t x
t Hn =
.
Bn0 Cn
t=1
P
The elements in An (pp) takes the form of n 1 nt=1 yt i yt j for i, j = 1, . . . p, which converges
to y (|i j|). We can let Q11 to denote
the limiting matrix
of An : An !p Q11 . Next, the elements
P
P
n
n
in Bn (p 2) takes the form of n 1 t=1 yt i and n 1 t=1 (t/n)
yt i , and we know that all of these
elements converges to zero: Bn !p 0. Finally, the matrix of Cn (2 2) is
P
1
n 2P nt=1 t
1 1
P
! 1 21 Q22 ,
n
n
2
3
2
n
t=1 t n
t=1 t
2
3
which is just the Q matrix in our simple time trend model. Thus we have
!
n
X
Q11 0
1
0
1
Hn
x
t x
t Hn ! Q
.
0 Q22
t=1
2 P
P yt 1 ut
6
yt 2 ut
6
n
6
..
X
6
Hn 1
x
t ut = n 1/2 6 P .
6
yt p ut
t=1
6 P
4
P ut
(t/n)ut
9
7
7
7
7
7n
7
7
5
1/2
n
X
t=1
t .
and we have
n
X
t=1
n
X
t=1
Qt ! Q .
x
t ut ! N (0,
Q ).
we have
) ! N (0, [Q ]
1 2
Q [Q ]
) = N (0,
[Q11 ] 1
0
1
[Q ] =
0
[Q22 ] 1
[Q ]
).
Now, given the asymptotic distribution of the estimates n , what are the results for n , the
estimates for the coefficients in the original model? We have that n = G0 n , or in matrix form
2
3 2
32
3
1
1
0
...
0
0 0
1
6 7 6
6
7
0
1
...
0
0 0 7
6 2 7 6
7 6 2 7
6
7 6
7
6
.
.
.
.
.
..
..
..
.. .. 7 6 . . . 7
6 ... 7 6
7
...
6 7=6
7 6 7 .
6 p 7 6
76 p 7
0
0
.
.
.
1
0
0
6
7 6
76 7
4
5 4 +
5
+ 2 . . .
+ p 1 0 5 4
...
0 1
Note that the j is identical to j , so for
p
p
n( n
j,
we have
) ! N (0,
[Q11 ]
).
Next,
n is a linear combination of variables that converge to a Gaussian distribution at rate
n, so
n behaves the same way. Let
+
+ 2 . . .
+ p 1 0 ,
g0 =
then
n = ga0 n ,
n(
n
) ! N (0, g0 [Q ]
g ).
where
g =
...
0 0
Its asymptotic distribution is governed by the variables with the slowest rate of convergence:
p
p
n( n
) =
n( + g 0 n
g0 )
p n
!p
n( + g 0 n
g0 )
p
= g 0 n(
)
n
!p N (0,
g [Q ]
g ).
11
Introduction
1.1
In this lecture, we will discuss a very important type of processes: unit root processes. For an
AR(1) process
xt = xt 1 + ut
(1)
to be stationary, we require that | | < 1. Or in an AR(p) process, we require that all the roots of
1
1z
...
pz
=0
/(1
),
= 1, we no longer have constant unconditional moments, the first two conditional moments
E(xt |Ft
and E(x2 ) =
1)
= xt
1)
80
70
60
4
50
2
40
0
30
2
20
10
100
200
300
400
500
100
200
300
400
= 0.9 and
500
=1
xt .
Since | | < 1, k ! 0 = E(xt ) when k ! 1. So as the forecasting horizon increases, the current
value of xt matters less and less since the conditional expectation converges to the unconditional
expectation.
The variance of the forecasting
V ar(xt+k |Ft ) = (1 +
+ ... +
2k
2k+2
2 ) as k ! 1.
which converges to 2 /(1
Next, consider the case when = 1.
E(xt+k |Ft ) = xt ,
which means that the current value does matter (actually it is the only thing that matters) even
as k ! 1!
The variance of the forecasting is
V ar(xt+k |Ft ) = k
! 1 as
k ! 1.
4
2
10
10
k
k
12
12
14
14
16
16
18
18
20
20
= 0.9 and
= 1 (x0 = 1,
= 1)
ut
+ ... +
t 1
u1
k=0
h,
xt =
t
X
uk ,
k=1
The eect of ut on xt+h is one, which is independent of h. So if a process is a random walk, the
eects of all shocks on the level of {x} are permanent. Or the impulse-response function is flat at
one.
Finally, we can compare the asymptotic distribution of the coefficient estimator of a stationary
and a nonstationary autoregressive process.
For an AR(1) process, xt = xt 1 + ut where | | < 1 and ut i.i.d.N (0, 2 ), we have shown in
lecture note 6 that the MLE estimator of is asymptotically normal,
0
! 11
n
X
2
A.
n1/2 ( n
n 1
x2t
0 ) N @0,
t=1
P
However, if 0 = 1, then (n 1 nt=1 x2t ) 1 goes to zero as n ! 1. This implies that if
then n converges at a order higher than n1/2 .
3
= 1,
Above we have only considered AR(1) process. In a general AR(p) process, if there is one unit
root, then the process is a nonstationary unit root process. Consider an AR(2) example. let 1 = 1,
and 2 = 0.5, then
(1 L)(1 0.5L)xt = t , t i.i.d.(0, 2 ).
Then
(1
So the dierence of xt ,
L)xt = (1
xt = (1
0.5L)
t = (L)t ut .
+ ut ,
1.2
1
(1 +
2
2 cos !)
When ! 1, we have S(!) = 1/[4(1 cos !)]. Then when ! ! 0, S(!) ! 1. So processes
with stochastic trend have infinite spectrum at the origin. Recall that S(!) decomposes the variance
of a process into components contributed by each frequencies. So the variance of a unit root process
are largely contributed by low frequencies. On the other hand, when we do dierence, we filter
out the low frequencies and what remains are the high frequencies.
In the previous lecture, we discussed the processes with deterministic trend. We can compare
a process with deterministic trend (DT) and a process with stochastic trend (ST) from two perspectives. First, when we do k-period ahead forecasting, as k ! 1, the forecasting error for DT
converges to the variance of its stationary components, which is bounded. But as we see from the
previous section, the forecasting error for ST diverges as n ! 1. Second, the impulse-response
function for a DT is the same as in the stationary case: the eect of a shock dies out quickly. While
the impulse-function for ST is flat at one: the eect of all shocks on the level are permanent.
However, note that in Figure 1, we plot a simulated random walk, but part of its path looks
like to have a upward time trend. This turns out to be a quite general problem: over a short time
period, it is very hard to judge whether a process has a stochastic trend, or deterministic trend.
2.1
Brownian Motion
To derive statistical inference of a unit root process, we need to make use of a very important
stochastic process Brownian motion (also called Wiener process). To understand a Brownian
motion, consider a random walk
yt = yt
+ ut ,
y0 = 0,
t
X
s=1
(3)
us N (0, t),
ys = us+1 + us+1 + . . . + ut =
t
X
i=s+1
ui N (0, t
s)
and it is independent of the change between dates r and q for s < t < r < q.
Next, consider the change yt yt 1 = ut i.i.d.N (0, 1). If we view ut as the sum of two
independent Gaussian variables,
1
ut = 1t + 2t , it i.i.d.N 0,
.
2
Then we can associate 1t with the change between yt 1 and the value of y at some interim point
(say, yt (1/2) ), and 2t with the change between yt (1/2) and yt :
yt
yt
(1/2)
= 1t
yt
yt
(1/2)
= 2t .
(4)
Sampled at integer dates t = 1, 2, . . ., the process of (4) has the same properties as (3), since
yt
yt
In addition, the process of (4) is defined also at the non-integer dates and remains the property
for both integer and non-integer dates that yt ys N (0, t s) with yt ys independent of the
change over any other nonoverlapping interval. Using the same reasoning, we could partition the
change between t 1 and t into N separate subperiods:
yt
yt
n
X
i=1
it ,
it i.i.d.N
1
0,
n
when n ! 1, the limit process is known as Brownian motion. The value of this process at date t is
denoted by W (t). A realization of a continuous time process can be viewed as a stochastic function
W (). In particular, we will be interested in Brownian motion over the interval t 2 [0, 1].
Definition 1 (Brownian Motion) A standard Brownian motion W (t), t 2 [0, 1], is a continuous
time stochastic process such that
5
(a) W (0) = 0
(b) For any time 0 < s < t < 1, W (t) W (s) N (0, t s). And the dierences W (t2 )
and W (t4 ) W (t3 ), for any 0 t1 < t2 < t3 < t4 1, are independent.
W (t1 )
|f (xi )
f (xi
1 )|
M.
W (r)dr
= 2
= 2
Z
=
0
1
1Z r
sdsdr
r2 dr =
R1
1Z r
1
3
Therefore, 0 W (r)dr N (0, 1/3). As another exercise, consider the distribution of W (1)
R1
0 W (r)dr. Again, it is a Gaussian process with zero mean. To compute its variance,
E W (1)
W (r)dr
Z 1
1
= 1+
2E W (1)
W (r)dr
3
0
Z 1
4
1
=
2
rdr = .
3
3
0
@g
@g
1 @2g
(r, x)dr +
(r, x)dX(r) +
(r, x) (dX(r))2 ,
@r
@x
2 @x2
where (dX(r))2 = (dX(r)) (dX(r)) is computed according to the rules
dY (r) =
dr dr = dr dW (r) = dW (r) dr = 0,
W (r)2 .
By Itos lemma,
dY (r) =
@g
@g
1 @2g
(dX(r))2 =
dr+ dX(r)+
@r
@x
2 @x2
1
W (r)dW (r)+ ( dW (r))2 =
2
W (r)dW (r)+
1
2
dr.
where the first derivative of g with respect to x gives Xr = W (r), and dXr = dW (r). Hence,
1 2
1 2
2
W (r) = 2 W (r)dW (r) +
dr.
d
2
2
Integrate them from 0 to 1,
1
2
W (1) =
2
W (r)dW (r) +
therefore,
2
W (r)dW (r) =
2.2
2.2.1
1
2
(W (1)2
1
2
1).
Recall that the central limit theorem (CLT) tells that the sample mean of a stationary process is
asymptotically normal and centered at the the population mean. However, if xt is a random walk,
there is no such a thing as population mean. Therefore, to draw inference for processes with unit
root, we need a new tool which is called functional central limit theorem (FCLT), the central limit
theorem defined on the function spaces. FCLT is important to unit root limit theory just as CLT
is important to stationary time series limit theory.
As usual, let n denotes the sample size, and we let r = t/n, so r 2 [0, 1]. And we use the symbol
[nr] to denote the largest integer that is less than or equal to nr.
7
[n/2]
1 X
=
ut .
[n/2]
t=1
[n/2]
u[n/2] ! N (0,
).
Moreover, this estimator would be independent of an estimator tha uses only the second half of the
sample. More generally, lets construct a new random variable Xn (r) for r 2 [0, 1],
Xn (r) = (1/n)
[nr]
X
ut .
t=1
or
8
0
>
>
>
>
>
< u1 /n
(u1 + u2 )/n
Xn (r) =
>
..
>
>
.
>
>
:
(u1 + u2 + . . . + un )/n
1/2
1/2
2 ).
But what
p
[nr]
[nr]
[nr] 1 X
1 X
Xn (r) = p
ut = p p
ut .
n t=1
n
[nr] t=1
P[nr]
t=1 ut
! N (0,
2 ),
(5)
Next if we consider the behavior of a sample mean based on observations [nr1 ] through [nr2 ]
for r2 > r1 , this is also asymptotically normal using similar approach,
n1/2 (Xn (r2 )
r1 )
and it is independent of the estimator in (5) for r < r1 . Therefore, the sequence of stochastic
p
functions { nXn ()/ }1
n=1 has an asymptotic probability law:
n1/2 Xn ()/ ! W ().
(6)
Note that here Xn () is a function, while in (5) Xn (r) is a random variable. The asymptotic
result (6) is known as the functional central limit theorem (FCLT). Later on, we may also write
n1/2 Xn (r) ! W (r), but note this does not mean the variable Xn (r) converges to a variable which
8
has N (0, r) distribution, but that the function converges to a stochastic function: the standard
Brownian motion.
Evaluated at r = 1, the function Xn (r) is just the sample mean. Thus when the function in (6)
is evaluated at t = 1, we get the conventional CLT:
p
2.2.2
n
1 X
nXn (1)/ = p
ut ! W (1) N (0, 1).
n t=1
(7)
In lecture 4, we discussed various convergence and continuous mapping theorem for a random
variable. Now, lets define convergence of a random function, such as Xn (r) we defined earlier.
We first define convergence in distribution for a random function. Let S() represent a continuoustime stochastic process with S(r) representing its value at some date r for r 2 [0, 1]. Also suppose
that for any given realization, S() is a continuous function of r with probability 1. For {Sn ()}1
n=1
a sequence of such continuous functions, we say that Sn () !d S() if all of the following hold:
(a) For any finite collection of k particular dates, 0 r1 < r2 < . . . < rk 1, the sequence of
k-dimensional random vectors {yn }1
n=1 converges in distribution to the vector y, where
2
3
2
3
Sn (r1 )
S(r1 )
6 Sn (r2 ) 7
6 S(r2 ) 7
6
7
6
7
yn 6
7 y6
7;
..
..
4
5
4
5
.
.
Sn (rk )
S(rk )
(b) For each > 0, the probability that Sn (r1 ) diers from Sn (r2 ) for any dates r1 and r2 within
of each other goes to zero uniformly in n as ! 0.
(c) P (|Sn (0)| > ) ! 0 uniformly in n as
! 1.
Next, we will extend convergence in probability for a random function. Let {Sn ()}1
n=1 and
{Vn ()}1
denote
sequences
of
random
continuous
functions
with
S
:
r
2
[0,
1]
!
7
R
and
Vn : r 2
n
n=1
[0, 1] 7! R. Define Yn as:
Yn = sup |Sn (r) Vn (r)|.
r2[0,1]
r2[0,1]
n
P [|n
nP (|n
E(n
1/2
u1 | > ], or [|n
1/2
ut | >
1/2 u )4
t
4
1/2
u2 | > ], . . . or [|n
1/2
o
un | > ]
E(u4t )
n 4
! 0.
=
So we ave Sn () !p 0.
In Lecture 4, we also reviewed that the continuous mapping theorem (CMT) tells that if xn ! x,
and g() is a continuous function, then we have g(xn ) ! g(x). We have a similar results for the
FCLT. If Sn () ! S(), and g() is a continuous functional, then g(Sn ()) ! g(S()). For example,
p
nXn ()/ !d W () implies that
p
nXn () !d W () N (0, 2 r).
(8)
As another example, let
Since
p
Sn (r) [ nXn (r)]2 .
Sn () !d
2.3
(9)
[W ()]2 .
(10)
The simplist case to illustrate how to use FCLT to compute the asymptotics is to consider a random
walk yt with y0 = 0,
t
X
yt = yt 1 + u t =
ui , ut i.i.d.N (0, 2 ).
i=1
Define Xn () as:
8
0
>
>
>
>
>
< y1 /n
y2 /n
Xn (r) =
>
..
>
>
.
>
>
:
yn /n
= n
n
X
yt
t=1
n:
Z
1p
(11)
nXn (r)dr = n
3/2
n
X
t=1
10
yt
1.
1 /n
1p
3/2
nXn (r)dr !d
Pn
W (r)dr.
1
t=1 yt ,
3/2
n
X
yt
t=1
W (r)dr.
(12)
P
Thus, when yt is a driftless random walk, its sample mean n1 nt=1 yt diverges but n
P
converges. An alternative way to find the limit distribution of n 3/2 nt=1 yt follows:
n
3/2
n
X
yt
= n
3/2
= n
3/2
3/2
Pn
t=1 yt
1 )]
t=1
[(n 1)u1 + (n
n
X
3/2
= n
(n t)ut
2)u2 + . . . + un
= n
tut
1]
t=1
1/2
n
X
ut
3/2
t=1
n
X
t=1
P
n 1/2 nt=1 ut
0
Pn
!d N
3/2
0
n
t=1 tut
Therefore n
2(1/2) + 1/3] =
Pn
1
1
2
1
2
1
3
3/2
t=1 yt
3/2
n
X
tut
1/2
t=1
n
X
t=1
W (1)
ut
3/2
n
X
yt
(13)
2 [1
t=1
W (r)dr
(14)
Using similar methods we could compute the asymptotic distribution of the sum of squares of
a random walk. Define
Sn (r) = n[Xn (r)]2 .
and it can be written as
8
0
>
>
>
2
>
>
< y1 /n
y22 /n
Sn (r) =
>
> ..
>
.
>
>
: 2
yn /n
(15)
2 W (r)2 ,
2
3/2
5/2
Pn
n
X
yt2 1
!d
t=1 yt 1
n
X
tyt
=n
by CMT,
R1
0
3/2
t=1
t=1
If we make use of n
1 /n
[W (r)]2 dr.
(16)
n
X
(t/n)yt
t=1
!d
rW (r)dr.
(17)
(18)
n
X
=n
tyt2 1
t=1
n
X
(t/n)yt2 1
!d
! (1/2)
t=1
n
X
yt
1 ut
t=1
Proof: first
yt2 = (yt
+ ut )2 = yt2
[W (1)2
+ 2yt
1 ut
1].
+ u2t ,
so
n
n
X
yt
1 ut
t=1
= n
(1/2)
n
X
(yt2
yt2 1 )
(1/2)
t=1
= n
(1/2)yn2
n
X
u2t
t=1
(1/2)
n
X
u2t
t=1
By (6), we have n
1y
(1/2)yn2 ! (1/2)
By LLN,
n
(1/2)
n
X
t=1
Therefore,
n
n
X
t=1
yt
1 ut
W (1)2 .
u2t ! (1/2)
! (1/2)
12
[W (1)2
1]
(19)
3.1
The asymptotics of a a random walk with i.i.d. shocks is summarized in the following proposition.
The number in bracket shows where the result is first introduced and proved.
Proposition 1 Suppose that t follows a random walk without drift,
t = t
+ ut ,
0 = 0,
ut i.i.d(0,
).
Then
(a)
1/2
(b)
(c)
(d)
(e)
(f )
(g)
(h)
Pn
t=1 ut
Pn
!d W (1)
t=1 t 1 ut
!d
[7];
1 2
[W (1)2
2
1]
[19];
R1
!d W (1)
0 W (r)dr [14];
R1
Pn
3/2
t=1 t 1 !d
0 W (r)dr [12];
R
Pn 2
2
2 1 W (r)2 dr [16];
t=1 t 1 !d
0
R1
P
n
5/2
t=1 tt 1 !d
0 rW (r)dr [17];
R
Pn
3
2
2 1 rW (r)2 dr [18].
t=1 tt 1 !d
0
P
n
v+1
v
for v = 0, 1, 2, . . . [lecture 7]
t=1 t ! 1/(v + 1)
3/2
Pn
t=1 tut
Note that all those W () is the same Brownian motion, so all those results are correlated. If we
are not interested in their correlations, we can find simpler expressions for them. For example, (a)
is just N (0, 2 ), (b) is (1/2) 2 [ 2 (1) 1], (c) and (d) are N (0, 2 /3).
R1
P
In general, the correspondence between the finite sample and their limits are like nt=1 ! 0 ,
1/2 u ! dW , etc. Take (h) as an example, and let v = 2. From previous
(t/n) ! r, (1/n) ! dr, n P
t
lecture we know that n 3 nt=1 t2 ! 1/3. Using the correspondence here, we have
n
n
X
t =n
2
t=1
3.1.1
n
X
t=1
(t/n) !
2
r2 dr = 1/3.
Case 1
Suppose that the data generating process (DGP) is a random walk, and we are estimating the
parameter by OLS in the regression
yt = yt
+ ut ,
ut i.i.d(0,
13
),
(20)
where = 1 and we are interested in the asymptotic distributions of the OLS estimates n :
Pn
yt 1 yt
n = Pt=1
n
y2
Pnt=1 t 1
yt 1 (yt 1 + ut )
t=1 P
=
n
y2
Pn t=1 t 1
yt 1 u t
= 1 + Pt=1
n
2
t=1 yt 1
Then
P
n 1 nt=1 yt 1 ut
P
1) =
.
n 2 nt=1 yt2 1
n(
n
n(
W (1)2 1
.
R1
2 0 W (r)2 dr
(21)
n 1
.
(22)
and
2 = Pn
(23)
,
1
s2n =
1X
(yt
n
t=1
14
n yt
2
1) .
n 1
Pn
2
Pn
t=1 yt 1 ut
.
1/2 2 1/2
2
y
(s
)
n
t=1 t 1
If ! = 1, which is true for OLS estimator in the present problem, then s2n !
And by (19) and (16), we have the limit for tn ,
tn ! h
(1/2) 2 [W (1)2 1]
W (1)2 1
=
i1/2
R
1/2 .
R
1
1/2
2 1 W (r)2 dr
2
2
[ ]
2 0 W (r) dr
0
by LLN.
(24)
For the same reason as in (21), this t-statistics is asymmetric and skewed to the left.
3.1.2
Case 2
+ ut ,
ut i.i.d(0,
),
n
n
= P
n
yt
P
P yt2
yt
+u
t .
1
1
P
P yt
yt 1 yt
Under the null hypothesis H0 : = 0, = 1, the deviations of the estimate vector from the
hypothesis
P
P
1
n
n
y
t
1
P 2
P ut
.
(25)
= P
n 1
yt 1
yt 1
u t 1 yt
Recall in a regression with a constant and time trend, the estimates have dierent convergent
rates. The situation is similar in this case. The order in probability for each terms are
n
n
Op (n)
Op (n3/2 )
3/2
Op (n ) Op (n2 )
Op (n1/2 )
Op (n)
(26)
n1/2
n
n(
n 1)
1
P
3/2
yt
P
n 3/2 yt 1
P
n 2 yt2 1
15
P
n P1/2 ut
n u t 1 yt
(27)
P
1
n 3/2 yt 1
P
P
n 3/2 yt 1 n 2 yt2 1
R
1
R
R W (r)dr
!
2 W (r)2 dr
W (r)dr
R
1 0
1
W (r)dr
R
R
=
0
W (r)dr
W (r)2 dr
P
n 1/2
u
t
P
!
(1/2)
n 1 u t 1 yt
Therefore,
n1/2
n
1
n (
n 1)
!d
=
=
0
0 1
W (1)
2 [W (1)2
0
0 1
0
0 1
where
1]
1 0
0
W (1)
(1/2)[W (1)2
1 0
0
1]
R
1
1
W
(r)dr
W (1)
R
2
W (r)dr
W (r) dr
(1/2)[W (1)2 1]
R
R
2
W (r)dr
W (1)
RW (r) dr
W (r)dr
1
(1/2)[W (1)2 1]
R
R
W (r) dr
2
W (r)dr
So the DF statistics to test the null hypothesis that = 1 has the following limit distribution
R
(1/2)[W (1)2 1] W (1) W (r)dr
n(
n 1) !d
.
(28)
R
2
R
W (r)2 dr
W (r)dr
As in case 1, we can also use a t test,
tn =
which converges to
n 1
,
n
R
(1/2)[W (1)2 1] W (1) W (r)dr
nR
R
2 o1/2 .
W (r)2 dr
W (r)dr
Case 3
Now, suppose that the true process is a random walk with drift:
yt = + yt
+ ut ,
ut i.i.d(0,
).
Without loss of generality, we could set y0 = 0. And we also estimate a linear regression with a
constant,
yt =
+ yt 1 + u
t .
16
Define
t u1 + u2 + . . . + ut ,
then
yt = t + t
and
n
X
yt
t=1
n
X
n
X
t+
t=1
1.
t=1
Pn
Notice that
these
two
terms
have
dierent
divergent
rates.
We
know
that
t=1 t = n(n+1)/2 =
Pn
Pn
2
3/2
3/2
O(n ), while t=1 t 1 = Op (n ) as n
t=1 t 1 converges to a normal distribution with finite
variance (result (d)). Therefore, pick the fastest divergent rate,
n
n
X
yt
= n
t=1
Similarly,
Pn
2
t=1 yt 1 ,
n
X
n
X
t+n
1/2
3/2
t=1
Pn
t=1
t=1
(29)
!p /2.
t=1 yt 1 ut
yt2 1
n
X
n
X
[(t
1) + t
2
1]
t=1
n
X
(t
1) +
2
t=1
n
X
+ 2
t2 1
t=1
n
X
(t
1)t
t=1
P
P
P
where nt=1 (t 1)2 = Op (n3 ) (result (h)), nt=1 t2 1 = Op (n2 ) (result (e)), and nt=1 (t 1)t
Op (n5/2 ) (result (f)). Norm the sequence with the inverse of the fastest divergent rate n3 ,
n
n
X
yt2
t=1
Finally,
n
X
t=1
yt
1 ut
n
X
[(t
1) + t
1 ]ut
t=1
n
X
3/2
1)ut +
t=1
P
P
where nt=1 (t 1)ut = Op (n3/2 ) (result (c)) and nt=1 t
with the fastest divergent rate
n
(t
n
X
t=1
yt
1 ut !p n
3/2
1 ut
n
X
(t
(30)
! 2 /3.
n
X
1 ut ,
t=1
1)ut .
(31)
t=1
Corresponding to the dierent rates, to derive a nondegenerate limit distribution for the estimates, again we need a scaling matrix. In this case, we need
1/2
n
0
Hn =
.
3/2
0
n
17
Premultiply the OLS estimator vector (in deviations from their true value) with Hn we got
n1/2 (
n
n3/2 (
n
)
1)
1
P
2
1
n
P
2
n
yt 1 n
From (13) and (31), we have
P
n 1/2 ut
P
n 3/2 yt 1 ut
!p
yt
P
P yt2
3
yt
2
!p N
P
P yt2
3
yt
2
n
n
!p
n 1/2
Pn
3/2
1
1
P
ut
P
n 3/2 yt 1 ut
n
1
/2
/2 2 /3
1/2
Q.
ut
(t
1)ut
t=1
1
/2
2
,
= N (0,
/2 2 /3
0
0
Therefore we have the following limit distribution for the OLS estimates
1/2
n (
n )
!d N (0, Q 1 2 Q Q 1 ) = N (0, 2 Q 1 ).
3/2
n (
n 1)
Q).
(32)
So in case 3, both estimated coefficients are asymptotically Gaussian, and the asymptotic distribution is the same as
and in the regression with deterministic trends. This is because here yt
has two components: a deterministic time trend and random walk, and the time trend dominates
the random walk.
3.1.4
Case 4
Finally we consider that the true process is a random walk with or without drift,
yt = + yt
+ ut ,
ut i.i.d.(0,
),
where may or may not be zero, and we run the following regression
yt = + yt
+ t + ut .
(33)
Without loss of generality, we assume that y0 = 0. Note that when 6= 0, it is also a time trend,
hence there will be an asymptotic collinear problem between yt and t. Hence rewrite the regression
as
yt = (1
) + [yt
= + t
(t
1)] + ( + )t + ut
t + ut
where = (1 ), = , = ( + ), and t = yt
null hypothesis = 1, = 0, t is a random walk:
t = u1 + u2 + . . . + ut .
18
Therefore, with this transformation, we regress yt on a constant, a driftless random walk, and a
deterministic time trend.
The OLS estimates in this regression are
2 3 2
32 P
3
P
P
n
n
t 1 P t
yt
P
P
P
4 n 5 = 4
54
t 1 P t2 1
t 1 yt 5 .
P
Pt 21 t
P
t
t 1 t
t
tyt
n
n
n
t 1 P t
ut
P
P
P
4 n 1 5 = 4
54
t 1 P t2 1
t 1 ut 5 .
(34)
P
Pt 21 t
P
c
t
t 1 t
t
tut
n
Note that these three estimates have dierent convergent rates (we are already familiar with
them!) n is n1/2 convergent, n is n convergent, and n is n3/2 convergent. Therefore we need a
rescaling matrix
2 1/2
3
n
0
0
n
0 5.
Hn = 4 0
3/2
0
0 n
Premultiply (34) with Hn we have that
P
P
3 2
32
3
2
P
n1/2
n
1
n 3/2 t 1
n 2 t
n 1/2
ut
P
P
P
P
4 n(
n 1) 5 = 4 n 3/2 t 1
n 2 t2 1
n 5/2 t 1 t 5 4 n 1 t 1 ut 5 .
P
P
P
P
n 3/2 tut
n 2 t
n 5/2 t 1 t
n 3 t2
n3/2 ( n c)
The limit distribution of each term in the above equation can be found in the proposition. Plug
them in and we get
R
2
3
2
32
3 12
3
1
n1/2
n
0 0
1
W (r)dr R
W (1)
2
R
R
4 n(
n 1) 5 !d 4 0 1 0 5 4 W (r)dr R W (r)2 dr
rW (r)dr 5 4 (1/2)[WR(r)2 1] 5 .
1
1
3/2
0 0
rW (r)dr
W (1)
W (r)dr
n ( n c)
2
3
(35)
The DF unit root test in this case is given by the middle row of (35). Note that it does not
depend on either or . The DF t test can be derived in a similar way (see page 500 in Hamilton).
3.2
3.2.1
Beveridge and Nelson (1981) proposed that any time series that displays some degree of nonstationarity can be decomposed into two additive parts: a stationary (also called cyclitical or transitory)
part and a nonstationary (also called long-run or permanent) part. Let
ut = C(L)t =
1
X
j=0
19
cj t
j,
(36)
P1
j=0 j
ut = (C(1) + (L 1)C(L))
t
= C(1)t C(L)(t t 1 )
+ ut
t
X
= y0 +
uj
1
j=1
= y0 +
t
X
t
X
C(L)
C(1)j
j=1
= y0 + C(1)
(j
1)
j=0
t
X
C(L)
t + C(L)0
j=1
= y0 + 0
t + C(1)
t
X
j=1
P1
where 0 = C(L)
j t j is a stationary process (note
0 is the initial condition, t = C(L)t =
j=1 c
Pt
that cj is absolutely summable), and C(1) j=1 j is a nonstationary random walk process.
Rewrite yt as
t
t
X
X
yt =
us = C(1)
s + 0 t .
Pt
s=1
s=1
Note that t =
s=1 s is a random walk with serially uncorrelated error and we have that
n 1/2 [nr] ! W (r), while 0 t are bounded in probability, hence we would expect that n 1/2 yt =
C(1)n 1/2 t +op (1) ! W (r). The following proposition summarizes some important limit theories
for unit root process with serially correlated error.
P
P1
2 , ).
Proposition 2 Let ut = C(L)t = 1
4
j=0 cj t j , where
j=0 j |cj | < 1 and i.i.d.(0,
Define that
1
X
2
=
E(u
u
)
=
cj cj+h ,
t t h
h
j=0
1
X
cj = C(1),
j=0
t = u1 + u2 + . . . ut ,
0 = 0.
In the above notation, 2 is known as the long run variance of ut , which is in gerenal dierent
from the variance of ut , which is 0 .
20
(a)
1/2
(b)
1/2
(c)
(d)
Pn
t=1 ut
Pn
t=1 ut ut j
Pn
n
X
Pn
(g)
n3/2
(h)
(i)
5/2
(j)
(k)
(v+1)
3/2
Pn
t=1 tut
Pn
Pn
Pn
!d
2
t=1 t 1
!d
t=1 tt 1
2
t=1 tt 1
Pn
t=1 t
R1
!d
t=1 t 1
!d
!d
(1/2)[
(1/2)[
R1
W (1)
W (r)2 dr;
R1
0
R1
0
1];
2 [W (1)2
0]
2 [W (1)2
0]
W (r)dr;
R1
[W (1)2
!d
t=1
(f )
for j = 1, 2, . . .;
for j = 1, 2, . . .;
1 ut
0)
!d (1/2)
t=1 t 1 t
(e)
!d N (0,
t=1 ut j t
Pn
1
1
W (1);
!d
for h = 0
Ph 1
j=0
for
h = 1, 2, . . .
i
W (r)dr ;
rW (r)dr;
rW (r)2 dr;
! 1/(v + 1)
for
v = 0, 1, 2, . . ..
The proof of all these results can be found in the appendix of Chapter 17 in Hamilton. In the
class, we will discuss (a), (e) and (f) as examples. First to prove (a),
n
1/2
[nr]
X
ut = n
1/2
C(1)
t=1
[nr]
X
t + n
1/2
([nr]
0 ).
t=1
1/2
C(1)
[nr]
X
t=1
t ! C(1)W (r).
0 is the initial condition and [nr] is a zero mean stationary process, both are bounded in probability,
n
1/2
([nr]
0 ) ! 0.
1/2
[nr]
X
ut ! C(1)W (r)
n
X
ut ! C(1)W (1)
t=1
and when r = 1,
n
1/2
t=1
21
(37)
Second, to prove
n
n
X
1 ut
t=1
(1/2)[
(1/2)[
2 (W (1)2
0 )]
2 (W (1)2
for j = 0
Pj 1
0 )]
n
X
1 ut = n
(1/2)n2
(1/2)
t=1
We know that n
1/2
n
i=0
n
X
for j > 0
(38)
u2t .
t=1
Pn
2
t=1 ut
n
X
0.
1 2
n
1 ut
t=1
! (1/2)
W (1)2 .
Therefore,
! (1/2)[
(W (1)2
0 )].
1 ut 1
= (t
Pn
1
n
X
+ ut
1 )ut 1
t 1 t 1 ut ,
1 ut 1
3/2
t=1
= t
2 ut 1
+ ut
1 ut 1 .
0 )]
0.
therefore,
! (1/2)[
(W (1)2
Similar for h = 2, 3, . . . .
Thirdly, consider result (f),
n
X
t=1
Define
then we have
By CMT,
W (r)dr.
Sn (r) ! W (r).
Sn (r)dr !
1
Sn (r)dr = n
W (r)dr,
3/2
n
X
t=1
22
(39)
8
< 0 for r 2 [0, 1/n)
n 1/2 t for r 2 [t/n, (t + 1)/n)
Sn (r) =
:
n 1/2 n for r = 1
and we have
t .
(40)
3.2.2
We will discuss case 2 only and other cases can be derived similarly. Let the true DGP be a random
walk with serially correlated errors,
yt = + yt
+ ut ,
ut = C(L)t
where C(L) and t satisfy the conditions in proposition 2. When || < 1, OLS estimates of
is not consistent when the errors are serially correlated. However, when = 1, OLS estimates
n ! 1. Therefore, Phillips and Perron (1988) proposed estimating the regression with OLS and
then correct the estimates with serial correlation.
Under the null hypothesis H0 : = 0, = 1, the deviations of the OLS estimates vector from
the hypothesis
P 3/2
P
1
1
n
yt 1
n1/2
n
n 1/2
P ut
P
P 2
=
.
(41)
1
3/2
2
n(
n 1)
n
yt 1 u t
n
yt 1 n
yt 1
Use result (f) and (h) in proposition 2,
R
P 3/2
1
1
1
n
yt 1
1
W (r)dr
R
R
P
P
!
2 W (r)2 dr
W (r)dr
n 3/2 yt 1 n 2 yt2 1
R
1
1 0
1
R
R W (r)dr
=
0
W (r)dr
W (r)2 dr
P
W (1)
n 1/2
ut
P
!
d
(1/2)[ 2 W (1)2
n 1 yt 1 u t
0]
W (1)
0
=
+
(1/2)[ 2 W (1)2 1]
(1/2)( 2
0)
1 0
W (1)
0
=
+
2
2
0
(1/2)[ W (1)
1]
(1/2)( 2
Substitute these two results into (41),
R
1
0
1
W (r)dr
W (1)
n1/2
n
R
R
!
0 1
W (r)dr
W (r)2 dr
(1/2)[ 2 W (1)2
n(
n 1)
R
1
1 0
1
W
(r)dr
0
R
R
+
1
2
2
0
W (r)dr
W (r) dr
(1/2)(
1 0
0
0)
1]
0 )/
To test = 1,
n(
n
R
1
W (1)
1
W
(r)dr
R
R
0 1
1) !
W (r)dr
W (r)2 dr
(1/2)[ 2 W (1)2 1]
R
1
2
0
1
W (r)dr
0
R
R
0 1
+
W (r)dr
W (r)2 dr
1
2 2
R
(1/2)[W (1)2 1] W (1) W (r)dr
(1/2)( 2
)
R
R 0
.
R
R
=
+
2
W (r)2 dr
W (r)dr
W (r)2 dr
W (r)dr
23
An alternative unit root test with serially correlated errors is augmented Dickey-Fuller test. Recall
that I used an example of AR(2) process early in this lecture
(1
with one unit root and another root |
yt = yt
2
2 L )yt
1L
2|
= t ,
+ ut ,
ut = (1
2)
t = (L)t .
So this is a unit root process with serially correlated errors. To correct for the serial correlation,
define
= 1 + 2, =
2.
Then we have the following equivalent polynomial,
(1
L)
= 1
= 1
L(1
L)
L + L
2 )L
2
2L .
L)
L(1
L)]yt = t ,
or
yt = yt
+ yt
+ t
(42)
+ ... +
p,
p]
j = 1, 2, . . . , p
and
j =
j+1
j+2
+ ... +
for
1.
Note that when the process contain a unit root, which means one root of
1
1z
2z
...
pz
=0
is unity,
1
...
= 0,
which implies that = 1. Therefore to test if a process contain a unit root is equivalent to test if
= 1 in (42). Furthermore, (42) is a regression with serially uncorrelated errors. For simplicity,
24
in our following discussion, we work with an AR(2) process. Again, we only consider case 2. Our
regression
yt = yt 1 + + yt 1 + t x0t + t
where xt = ( yt
1 , 1, yt 1 ),
Let ut = yt
yt
1,
=
2
"
n
X
xt x0t
t=1
P 2
P ut 1
0
4
xt xt = P ut 1
t=1
yt 1 u t
n
X
1" n
X
ut
n
Hn 1
Here V =
0,
"
n
X
j ),
= C(1) = /(1
2
#
0
xt x0t Hn 1 !d 4 0
t=1
0
5,
t=1
= E(ut ut
P
u t 1 yt
P
P yt2 1
yt 1
3
yt 1
1
2 P
n
ut 1 t
X
P
4
xt t = P t 5 .
t=1
yt 1 t
xt t .
t=1
Define
), where
0
1
R
W (r)dr
Hn 1
"
n
X
t=1
(43)
xt t .
= E(2t ).
3
0
R
V
5
W (r)dr
R
0
2 W (r)2 dr
0
Q
while it would be a matrix with elements j for a general AR(p) model, and
R
1
R
R W (r)dr
Q=
.
2
W (r)dr
W (r)2 dr
"
n
X
t=1
xt t
3
P
ut 1 t
P
= 4 n 1/2 t 5
P
n 1 yt 1 t
n
1/2
V ).
Apply result (a) and (d) of proposition 2 for the other two terms,
P
W (1)
n 1/2
t
P
!d h 2
1
(1/2) [W (1)2 1]
n
yt 1 t
Substituting the above results into (43) and we get
V 0
h1
Hn ( n
) !d
0 Q
h2
V
Q
1h
1h
(44)
Since the limit distribution is block diagonal, we can discuss the coefficients on the stationary
components and the nonstationary components seperately. For the stationary components,
p
n(
n ) !d V 1 h1 N (0, 2 V 1 ).
In this AR(2) problem, the variance is simply 2 / 0 . The limit distribution on the constant and
the I(1) components are
R
0
1
W (r)dr
W (1)
n1/2
n
1
R
R
!d Q h2 =
.
2
0
/
W (r)dr
W (r) dr
(1/2) [W (1)2 1]
n(
n 1)
This implies that n ( / ) (
n 1) has the same distribution as in (28). Since
/ = C(1) = 1/(1 ). Therefore, the ADF -test is
R
(1/2)[W (1)2 1] W (1) W (r)dr
n(
n 1)
!d
.
R
2
R
1
n
W (r)2 dr
W (r)dr
For the general AR(p) process, simply replace (1
t-test can be found in Hamiltons book.
26
n ) with (1
1,n
...
1,n ).
C(1),
(45)
The ADF
From univariate unit root processes to multivariate unit root processes, we need to extend the
scalar Brownian motion to the vector Brownian motion.
Definition 1 k-dimensional standard Brownian motion W() is a continuous time process associating each date r 2 [0, 1] with the (k 1) vector W(r) satisfying the following
(a) W(0) = 0;
(b) For any dates 0 r1 < r2 < . . . < rk 1, the changes [W(r2 ) W(r1 )], [W(r3 ) W(r2 )],
. . . , [W(rk ) W(rk 1 )], are independent multivariate Gaussian with [W(s) W(r)] N (0, (s
r)Ik );
(c) For any given realization, W(r) is continuous in r with probability 1.
Let vt be a k-dimensional i.i.d. vector process with E(vt ) = 0 and E(vt vt0 ) = Ik . Define that
n (r) = n 1 (v1 + . . . + v[nr] ), then the vector version FCLT is given by
X
p
n () !d W(),
nX
(1)
Let t be a k-dimensional i.i.d. vector process with E(t ) = 0 and E(t 0t ) = . Cholesky
decomposition of gives
= PP0
Or we can write t = P vt . Define Xn (r) = n 1 (1 + . . . + [nr] ), Then (1) and CMT gives that
p
nXn () ! P W().
(2)
P1
s
Finally consider serially correlated errors ut =
s=0 Cs t s , where if Cij denotes the ijth
element of Cs ,
1
X
s
s|Cij
| < 1.
s=0
t
X
s=1
us = C(1)
t
X
s=1
s + t
0 ,
P1
where C(1) = (C0 + C1 + . . .), t =
(Cs+1 + Cs+2 + . . .) and s is
s=0 s t s for s =
absolutely summable. Now define Xn (r) = (1/n)(u1 + . . . u[nr] ), then
p
nXn () ! C(1)P W ().
(3)
The proposition 18.1 in Hamilton (p. 547) (we will use P 18.1 for short in this lecture) summarized many useful asymptotic results for vector unit root processes. Most of them are analogous to
the univariate cases: you replace with , replace with , etc.
Spurious Regression
Consider two independent I(1) variables, x1 and x2 . If we regress x1 on x2 , despite the fact that they
are actually independent, the OLS estimates of the coefficient may be significant. This phenomenon
is called spurious regression (Granger and Newbold (1974), Phillips (1986)). Proposition 18.2 in
Hamilton (1994) gives the results that have been developed by Phillips (1986). We will reproduce
a two-variable version for simplicity (some degree) of representation. I think that this will be easier
to read, but you are still encouraged to read the original propositions and proofs.
Let yt = (x1t , x2t )0 , and it is generated by
yt =
(L)t =
1
X
Cj t ,
j=0
where the error t satisfies our standard assumption: mean zero and finite fourth moment and sCs
is absolutely summable.
Consider the regression
x1t = + 0 x2t + ut .
(4)
The OLS coefficient estimates for a sample of size n are given by
P
P
1
n
x1t
Pn
P x2t
P
=
n
x2t
x22t
x2t x1t
P
1
0
n
x
0
2t
P
P 2
Fn = (Rn r)0 s2n 0 R
(Rn r)/m.
(5)
x2t
x2t
R0
P
where s2n = (n 2) 1 nt=1 u
2t . To derive the asymptotics for the estimates and test statistics, we
will do some transformation. Let E(t 0t ) = P P 0 , and = (1)P . Partition 0 as
2
12
0
1
=
2
21
) =
2
1
2
2
12 / 2
.
To further simplify the problem we assume that 1 and 2 are independent, then 12 = 21 = 0
and = 1 . And the L22 matrix in proposition 18.2 is just 2 1 . Then the three part of proposition
18.2 in this problem become
2
n 1/2
n
!d
(
n
1 h1
(6)
1 / 2 )h2
where
h1
h2
R
1
[W2 (r)]0 dr
R
R
W2 (r)dr
[W2 (r)][W2 (r)]0 dr
W1 (r)dr
W1 (r)W2 (r)dr
F,
where W1 (r) and W2 (r) are independent standard Brownian motion and Q and F are defined
to be the matrix and vector.
(b) The sum of squared residual RSSn from OLS estimation satisfy
2
n
where
H=
RSSn !
2
1H
F 0 QF.
W1 (r)2 dr
Fn ! [(
1 / 2 )R
r] (
1/ 2)
0 R
00
R0
[(
1 / 2 )R
r]/m.
(7)
The proof of the above results for the general case can be found on page P564-548 in Hamilton.
In our simple case, (6) tells neither of the estimates (
, ) is consistent. Recall that if is a
consistent estimate for , then
! 0 we we have to scale it with nr where r > 0 to obtain
a nondegenerate limit distribution. Actually in this problem the OLS estimate of diverges with
sample size n, since we have to scale it with n 1/2 to obtain a limit distribution.
Result (b) then tells that the OLS estimate of the variance of ut also diverges:
s2n = (n
k)
RSSn ! 1.
x1t
n x2t = [ 1
n ]
x1t
x2t
![ 1
1 / 2 )h2
] yt ,
which
I(0) variable, so u
t is I(0), and u
t is I(1). Hence we s2n =
Pnis a 2random vector times
Pan
n
1
2
2
n
t diverges, and n
t converges.
t=1 u
t=1 u
Result (c) tells that any OLS t or F statistics based on spurious regression also diverges. The
usual t statistics has be divided by n1/2 to converge, and the usual F statistics has to be divided
by n to converge. If we draw inference based on the usual test statistics, we tend to accept that
x1t and x2t are significantly related even when they are independent.
There are three ways to cure for spurious regression. First, we may include the lags of both the
independent and dependent variables, say
x1t = + x1,t
+ x2t + x2,t
+ ut .
Now the value = 1, = = 0 will make ut I(0), therefore most of usual OLS inferences are valid
now, although tests of some hypothesis will still involve non-standard distribution.
The second cure is to dierence the data before regression,
x1t = +
x2t + ut .
(8)
Now ut is again I(0) and all usual OLS inferences are valid. However, in dierencing the data, we
may lose some information in the data.
Finally, we could apply GLS to estimate the system. We first apply an AR(1) regression on the
residual u
t , u
t = u
t 1 + et , then define x
1t = x1t x1,t 1 , x
2t = x2t x2,t 1 , and then regress
x
1t on x
2t . Since u
t is a unit root process, ! 1. Therefore, this Cochrane-Orcutt GLS regression
is asymptotically equivalent to running OLS on the dierenced data (8).
3
3.1
Cointegration
Introduction
In the previous section, we showed that when we regress one I(1) variable on another I(1) variable,
and when the residuals of the regression is also I(1), then it is a spurious regression. Even when
these two variables are independent, usual OLS inference may imply that they are significantly
related. Now you may wonder when it is valid to run an OLS regression between I(1) variables? It
turns out the regression is valid only when the residual is stationary, and in this case, we say that
those I(1) variables are cointegrated.
There are two facts about cointegration. First, a cointegration is a relationship that applies
only to I(1) series. Second, although each individual series, say x1t , x2t , . . . xkt , are I(1), and let
yt = (x1t , x2t , . . . xkt )0 , there exist a nonzero k by 1 vector , such that the series yt is I(0).
There are many examples of cointegration in economic applications. For instance, both income and
consumption maybe nonstationary, but they seem to keep a stable relation with each other. Or
if we look at some data of short rate and 3-month forward rate, they also tend to have a stable
relationship over time, although they all wander around.
Example: consider the following system of processes
x1t =
1 x2t
x2t =
3 x3t
+ u2t
x3t = x3,t
2 x3t
+ u1t
+ u3t
where the three error terms are uncorrelated white noise processes. Clearly, all those three processes
are individually I(1). Let yt = (x1t , x2t , x3t )0 and = (1,
1,
2 ), then yt = u1t which is a I(0)
process. Another cointegrating relationship is between x2t and x3t . So we can let = (0, 1,
3 ),
then yt = u2t is also I(0).
3.1.1
Cointegrating Matrix
In the above example, we see that the cointegrating vector is not unique. Also, note that and
are lineally independent. In general, if the cointegrating system has k I(1) series, we can have h
lineally independent cointegrating vectors, with h < k. In the example, we have 3 I(1) series and
we have 2 cointegrating relations. Let i , i = 1, . . . , h denote each of these vectors, then we could
construct a h by k matrix
0
1
0
1
B
C
A0 = @ ... A .
0
h
Then the vector A0 yt is h-vector valued stationary time series. In our above example,
1
1
2
0
A =
0
1
3
and
0
A yt =
u1t
u2t
MA representations
In univariate time series analysis, sometimes we would like to use dierenced data if the original
data is I(1). For example, if xt is I(1), then we may take dierence to get xt and specify an
AR(p) process for xt . However, we cannot do this for a cointegrated system. Assume that yt
is I(0), and let = E( yt ). Define
ut = yt
.
(9)
The ut is a stationary process by assumption. Suppose that ut has the Wold decomposition
ut = (L)t where t is a vector white noise. Let (1) = Ik + 1 + 2 + . . ..
The dierence equation (9) implies that the BN-decomposition of yt gives
yt = y0 + t + u1 + u2 + . . .
= y0 + t +
(1)(1 + 2 + . . . + t ) + t
Premultiply by A0 ,
A0 yt = A0 y0 + A0 t + A0 (1)(1 + 2 + . . . + t ) + A0 t
A0 0 .
Pt
i=1 i
Note that A0 (1) = 0 implies that the | (z)| = 0 at z = 1. This in turn means that (L) is
noninvertible. Thus a cointegrated system can never be represented by a finite order VAR in the
dierenced data yt . Intuitively, this is because in the dynamics of the system, the level of the
variables matters.
5
3.1.3
Obviously the cointegrating matrix is not unique for a cointegrated system. Therefore researchers
can choose to use a representation that is convenient for their problems.
Phillips (1991) suggested that the h by k cointegrating matrix A be transformed as
A=
where
is a matrix of size h by g = k
Ih
h. Define zt as
zt = A0 yt .
Correspondingly, rearrange yt = (y1t , y2t )0 , then we could represent y1t and y2t separately,
y1t = y2t + zt
and
y2t =
+ v2t
where 2 and u2t are the last g elements of and ut . I will show how this works with our example.
In our example,
1
1
2
0
A =
0
1
3
Transform A to the form
A =
Then
1 0
0 1
zt =
1 3
1 3
u1t
u2t
x1t
y1t =
=
x2t
y2t =
x3t = u3t
+
3
x3t + zt
The cointegrating relationships is very clear from this representation. There are also other
representations which are convenient in some problems (reading: page 578 - 582).
3.2
Cointegration tests
In our discussion of spurious regression, we learn that given two I(1) processes, even if they are
independent, OLS estimator may turn out to be significant according to regular test statistics, such
as t and F test. Therefore, we should be cautious when running regressions between nonstationary
time series variables. However, if we know that two or more I(1) series are cointegrated, then it is
valid to apply our linear regression techniques. Therefore, to test whether a system of nonstationary
processes are cointegrated becomes critical in multivariate nonstationry time series studies. In the
test for cointegration, we let the null be no integration among the elements of a (k 1) vector yt ;
rejection of the null is then taken as evidence of cointegration.
3.2.1
If the cointegration vector is unknown, then we could first test if each series is I(1), then estimate
the cointegrating vector using OLS, and finally, we test the null hypothesis of cointegration, which
is equivalent to test that the residual u
t is I(0). With the OLS estimates, , if u
t is I(0), then
the vector is cointegrated; if u
t is I(1), then the regression is spurious. The following proposition
(proposition 19.2) summarized the asymptotic results in this approach (for simplicity, we let y2 be
a scalar)
Proposition 1 Suppose
y1t = + y2t + u1t
y2t = u2t
u1t
ut =
u2t
(10)
= C(L)t ,
where t is an i.i.d. vector with mean zero, finite fourth moments, and positive definite variancecovariance matrix E(t 0t ) = P P 0 . Further suppose that sCs is absolutely summable and that the
rows of C(1) are linearly independent. Let
and be OLS estimate
P
P
n
P y2t
P y1t
= P
(11)
2
y2t
y2t
y2t y1t
Partition C(1) P as
then
C(1) P =
n1/2 (
n )
n(
)
!d
0
1
2
0
0
R W(r) dr 02
2 W(r)W(r) dr
R 1
2 W(r)dr
0
2
0
1 W (1)
Z 1
W(r)dW(r)
0
1
1
X
h1
h2
E(u2t u1,t+v ).
v=0
To understand the results, consider the simple example that u1t and u2s are uncorrelated,
2
0 0
0
0
1
C(L) = I2 +
, E(t t ) =
.
2
0 cL
0
2
Then
0
1
=[
0]
= [0
(1 + c)
2 ].
Hence
h1 =
1 W1 (1)
h2 = (1 + c)
1 2
W2 (r)dW1 (r)
The proof of this proposition can be found on page 618-619 in Hamilton. Basically the proof
uses the results from proposition 18.1 on multivariate unit root processes. P
Note that in this example we assume u1t and u2s are uncorrelated, so 1
v=0 E(u2t u1,t+v ) = 0.
In the general case, u1t and u2s could be correlated, therefore induce bias in the estimates, however
this bias is in n is Op (n 1 ). To correct this bias caused by correlations between u1 and u2 , we can
add leads and lags in the regression. Define u
1t as the residual from a linear projection of u12 on
{u2,t p , . . . , u2,t 1 , u2t , u2,t+1 , . . . , u2,t+p },
u1t =
p
X
0
s u2,t s
+u
1t ,
s= p
then u
1t is uncorrelated with u2t . We can then rewrite the regression (10) can be written as
y1t = + y2t +
p
X
0
s u2,t s
+u
1t .
(12)
s= p
Still with our 2-variable model, suppose that there is a time trend in y2t ,
y1t = + y2t + u1t
y2t =
with
+ u2t
(13)
2t
t
X
u2s
s=1
is asymptotically dominated by the deterministic time trend 2 t. Then the OLS estimates
and
in (13) have the same limit distribution as regressing an I(1) series on a constant and a time trend.
If y1t also contain a deterministic time trend:
y1t =
+ u1t ,
3.3
(14)
y2t = y2,t
+ u2t ,
(15)
where y1t , y2t are I(1) while u1t , u2t are i.i.d. normal sequence and they are independent of each
other.
u1t
0
0
1
i.i.d.N
,
.
2
u2t
0
0
2
And if we consider a null hypothesis about the cointegrating vector, say,
R1 + R2 = r.
It turns out the correct approach is just to estimate (14) with OLS and use standard t or F statistics
to test any hypothesis about the cointegrating vector. No special procedures or unusual critical
values are needed.
There are a few key features of financial data. First, the variance seems to be varying from time to
time, and usually one large movement tends to be followed by another large movement. In other
words, large movements tend to cluster. This can be seen from Figure 1.
0.05
0.05
1996
1997
1998
1999
time
2000
2001
2002
1997
1998
1999
time
2000
2001
2002
x 10
4
3
2
1
0
1996
Second, the distributions of financial data have heavy tails (heavier than Gaussian). For the
same data described above, I plot the empirical density and the normal density (with mean zero
and standard deviation equal to the standard deviation of the data) in Figure 2.
1
1
Empirical Density
Normal Density
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0.08
Empirical Density
Normal Density
0.9
Density
Density
0.9
0.07
0.06
0.05
0.04
Left Tail of the S&P 500 Return
0
0.03
0.03
0.04
0.05
0.06
0.07
Right Tail of the S&P 500 Return
0.08
1.1
p
X
k xt k
+ t ,
k=1
2
t,
follows an autoregressive
(1)
i=1
W N (0, 2 ).
where ut
Then we say t follows an ARCH(m) process.
Note: first, we must have that t2 > 0. A sufficient condition is that c
0 and i
0 for
i = 1, . . . , m.
Second, we are modeling a time-varying conditional variance for xt , but wed still like to restrict
our discussion to covariance-stationary process, therefore, we want that the unconditional variance
1z
...
mz
to be
=0
lie outside the unit circle. Combine this with the condition that all
to impose the following condition on the coefficient,
m
X
2
t
is
< 1.
i=1
= E(
2
t)
m
X
= c/(1
i ).
i=1
ht = c +
m
X
2
i t i .
i=1
ht = c + 2t 1 .
E(t ) = 0,
E(2t )
= E(ht t2 ) =
E(4t ) = 3
so the kurtosis is
ks =
= c/(1
c2 + 2c
1 3
E(4t )
1
=3
2
2
1 3
(E(t ))
2
2
>3
2
2
for
< 1/3.
1.2
=c+
2
t
2
t
2
t i
i=1
q
X
2
i t i
+ ut .
(2)
i=1
We use GARCH(p, q) to denote such a process. The sufficient condition for t2 > 0 and that
the process is stationary is that i 0 for i = 1, . . . , p, j 0 for j = 1, . . . , q, and
p
X
i +
i=1
q
X
< 1.
i=1
2
t)
p
X
= c/(1
q
X
i=1
i ).
i=1
For the S&P 500 data we have displayed, we estimate a GARCH(1, 1) process for the return rt
and the estimates we got are:
2
t
= 0.00212 + 0.876
2
t 1
+ 0.0972t
+u
t .
= E(
) = 0.00212 /(1
0.876
0.097) = 0.01272 .
Figure 3 plots the estimated t2 (solid line in the lower graph) and the unconditional variance
(dashed line in the lower graph).
1.3
Multivariate GARCH
h
X
i xt i
+ t
i=1
= cij +
p
X
2
ij,l ij,t
l
q
X
k=1
l=1
2
ij,k ij,t k
+ uij,t .
Index Returns
0.06
0.04
0.02
0
0.02
0.04
0.06
0.08
1996
1997
1998
1999
Time
2000
2001
2002
x 10
Conditional Variance
Unconditional Variance
0
1996
1997
1998
1999
Time
2000
2001
2002
The problem of this approach is the number of parameters may get too big when k is large.
For instance, even if we assume a GARCH(1, 1) process, when k = 10, we need to estimate
3 10 11/2 = 165 parameters.
To solve this problem, we can impose some structures of t . For instance, Bollerslev (1990)
suggested that the conditional correlations are constant over time. Then ij,t = ij i,t j,t , with
only one parameter, ij , instead of cij , ij and ij .
1.4
We will briefly introduce a few other members in the GARCH model family.
IGARCH In a GARCH(p, q) model, when the coefficient satisfy
p
X
i +
i=1
q
X
= 1,
j=1
Engle and Bollerslev (1986) referred to it as integrated GARCH process. In this case, the unconditional variance of t is infinite so the process is no longer covariance (weakly) stationary but still
strictly stationary.
EGARCH Recall that we let the innovation take the form t = ht t with t is i.i.d. with mean
zero and unit variance. Nelson (1991) proposed the following specification for ht .
loght = c +
1
X
i=1
i (|t i |
E|t i | + t i )
The parameter i captures the eects of the deviation of |t | from its expectations. A more
interesting specification in an EGARCH model is the parameter .
We have discussed two features of financial data: dependence in volatility and heavy tails.
There is another feature of financial return data: negative skewness. For the normal distribution,
we have zero skewness. But for the data set we have used as example, the empirical skewness is
-0.1806. In other words, in the financial return data, negative shock tends to have larger volatility
than positive shocks. The parameter in the EGARCH model can capture this eect. If = 0
then the volatilities for positive and negative shocks are symmetric; if < 0, then negative shocks
tend to have larger eect to the volatility.
There are still other models that belongs to GARCH family, GARCH with threshold (Zakoian
1990), GARCH with regime-switching (Cai 1994), etc.
Readings: Hamilton Ch 21
2
2.1
Let st be a random variable that only take integer value. If the probability of st takes a particular
value j depends on the past only through the most recent value st 1 :
P {st = j|st
= i, st
= i} = Pij .
This process is called a Markov Chain, and Pij is called transition probability: the probability that
state i will be followed by state j. Suppose there are N states, then we must have
N
X
Pij = 1.
j=1
For example, suppose there is a squirrel, who may stay inside a house (in the roof) or stay in
the tree (tree by the house). We can specify the transition matrix (P 0 ) of this squirrel as:
t
t
House
0.7
0.1
House
Tree
Tree
0.3
0.9
To study the forecast of a Markov chain, using our two state example, we can assign integer 1
and 2 to the two states, and we can define
1 , . . .)
= P t .
(3)
where
vt+1 = t+1
E(t+1 |t , t
1 , . . .).
1 , . . .)
= P m t .
(4)
Now, suppose p11 = 1 instead of 0.7, then the squirrel will stay in the roof forever. In this
case, the state House is an absorbing state and that the Markov chain is reducible. On the other
hand, if the Markov chain is not reducible, we say it is irreducible. For our two-state example, this
requires that P11 < 1, and P22 < 1.
7
For a transition matrix P , suppose that one of the eigenvalues is unity and that all other
eigenvalues of P are inside the unit circle. Then the Markov chain is said to be ergodic. The vector
corresponding to the unit eigenvalue is the ergodic probablity (after rescaled so that its elements
sum to unity (10 = 1). It can be shown that (Hamilton, page 681)
lim P m = 10 .
m!1
1 , . . .)
= P m t ! 10 t = .
Hence the forecast of t+m converge to no matter what is t . So, we can see that this is the
unconditional probability for the process (the matrix P gives the conditional probability).
In our example of the squirrels, we can compute that = [0.25, 0.75], or, the squirrel stays in
the house with about one fourth of the time.
In general, for a two-state Markov chain to be ergodic, besides the conditions for irreducible,
which is P11 < 1, P22 < 1, we also require P11 + P22 > 0, which means at least one of these two
probability is positive. If both probability is zero, then in our example with squirrel, we got that
the squirrel jump from the house to the tree and jump from the tree to the house, then at time
t + m, the position of the squirrel depends on its position at time t. If the squirrel is in the house
at time t and m is even number, we know that the squirrel is in the house at time t + m. Hence,
no matter how large m is, we can always tell where is the squirrel given the position of the squirrel
at time t.
2.2
Let st be a Markov chain and there are N possible states. Let xt denote another sequence, and
the distribution of xt at time t depends on st . For example, suppose there are two states and st
take values 1 or 2. When st = 1, xt equals 0 with probability 0.9 and equals 1 with probabilty
0.1; while when st = 2, xt equals 0 with probability 0.1 and equals 1 with proababilty 0.9. Further
assume that we could not observe st and we can only observe xt , this is a simple example of a
hidden Markov chain. (draw a picture here)
xt can also be drawn from a continuous distribution, such as a normal distribution. For example,
when st = 1, xt is drawn from N (0, 1), and when st = 2, xt is drawn from N (2, 4). We write the
density of xt conditional on st as follows
2
1
xt
f (xt |st = 1) = p exp
,
2
2
(xt 2)2
1
p exp
f (xt |st = 2) =
.
8
2 2
To compute the unconditional distributions, we need to know the distribution of st . For example,
if st is i.i.d., and P (st = 1) = 1/3, then the unconditional distribution of xt is
f (xt ) = (1/3)f (xt |st = 1) + (2/3)f (xt |st = 2).
Figure (4) plots the density of this mixture distribution as well as those two normal distributions.
8
0.4
Mixture distribution
N(0, 1)
N(2, 4)
0.35
0.3
f(x)
0.25
0.2
0.15
0.1
0.05
0
4
2
x
Now, although we could not observe st , we can make an inference for st based on xt . In our
example,
p(xt , st = 1)
(1/3) f (xt |st = 1)
P (st = 1|xt ) =
=
.
(5)
f (xt )
f (xt )
Similarly, we can write this for P (st = 2|xt ). From this expression, we see that two factors
jointly determine this probability: one is the unconditional probability of st , and the other is the
probability that each component generate xt . Consider some numerical examples. Suppose we
observe that xt = 3, then we know that N (2, 4) is more likely to generate this observation and also
we know that the unconditional probability of st = 2 is larger. Hence we believe that st is much
more likely to be 2. Using (5), we can compute that P (st = 1|xt = 3) = 0.01, which supports
our hypothesis. On the other hand, if we observe that xt = 1, then things are not that clear.
Although st has larger unconditional probability to be 2, from figure (4) we can see that N(0, 1)
has a much larger probability to generate xt = 1 than N (2, 4) distribution. Using (5) we can then
compute that the probability that st = 1 conditional on xt = 1 is about 0.65.
Above is a simple example, where we assume that we know all coefficient, and it largely illustrate
how to work with an i.i.d. mixture models. In general, if there are N states (N number of individual
distributons in the mixture), and if we assume xt is drawn from N (i , i2 ) when st = i, then we
can write
1
(xt i )2
2
f (xt |st = i; i , i ) = p
exp
.
(6)
2 i2
2 i2
Let i denote the probability that st = i, and let = (1 , . . . , N ,
the joint probability density for xt and st = i is
p(xt , st = i; ) = i f (xt |st = i; i ,
2
i ),
2
1, . . . ,
2
N , 1 , . . . , N ),
then
N
X
i=1
i f (xt |st = i; i ,
2
i ).
T
X
log f (xt ; ).
t=1
We
PNcan then solve for the maximum likelihood estimator for with the restrictions that i 0 and
i=1 i = 1.
Note that to maximize this function, we first take sum over dierent component and then take
log, hence it is not possible to solve them analytically for as a function of the data. In empirical
studies, MLE of mixture models is computed using the EM algorithm. E represent expectation,
and M represent maximization. This is an iterative method, and the likelihood is guaranteed to
increase in each iteration.
The MLE estimator for the system can be shown as (P699-701 in Hamilton)
i =
i2 =
xt P (st = i|xt ; )
PT
(xt
i )2 P (st = i|xt ; )
PT
i = T
T
X
P (si = i|xt ; )
t=1
EM algorithm was originally designed to solve estimation with missing data. In a mixture
model, if we know what each observation xt is drawn from which regime (state), then the problem
is much easier.
i and i2 are just the mean and variance computed using the data that from regime
i. And
i is just the proportion of data from regime i. Since we dont know this information, we
can use an iterative algorithm.
We can start with an arbitrary value for , denote it 0 , plug this 0 to the right hand side
of the above equations, we can obtain a new estimate for , denoted by 1 . We can continue this
iteration and stop till m and m+1 are close.
2.3
Next, we apply the hidden Markov model to time series studies which allow time varying parameters.
The idea is that under dierent regimes, the parameters, which represent the level or relationships,
maybe dierent. For simplicity, we assume that there are two regimes: st = 1 or st = 2 and let
P denote the transition matrix. We continue to assume that we could not observe st , and we can
only observe yt , whose process is specified as
yt = cst + yt
10
+ ut
(7)
2 ).
(yt c1
yt 1 )2
p 1 exp
2
f (yt |st = 1, yt 1 ; )
2
i 5.
h
t =
= 4 2
(yt c2
yt 1 )2
f (yt |st = 2, yt 1 ; )
p 1 exp
2 2
2
(8)
To analyze this system, lets first assume that we know all parameters = (c1 , c2 , , 2 , p11 , p22 ).
Unlike in the i.i.d. case, now we can draw inference about st based on all observations. Let Yt denote
observations up to time t, then we can write the conditional probability about st at time t as
P (st = 1|Yt ; )
t|t =
.
P (st = 2|Yt ; )
Lets first consider how to compute density of yt conditional on Yt 1 . If we take point by point
product of t|t 1 and t , which can be written as (t|t 1 t ), we get
p(yt , st = 1|Yt
1 ; )
= P (st = 1|Yt
1 ; )
p(yt , st = 2|Yt
1 ; )
= P (st = 2|Yt
1 ; )
f (yt |st = 1, yt
f (yt |st = 2, yt
1 , )
= 10 (t|t
1 ; )
1 ; ).
1,
i.e.
t ).
(9)
T
X
t=1
(10)
1 , ).
To derive a rule in updating the forecast and optimal inference about st , note that if we divide
each element in (t|t 1 t ) by f (yt |Yt 1 , ) = 10 (t|t 1 t )., we have
p(yt , st = i|Yt 1 ; )
p(yt , st = i|Yt 1 ; )
=
= P (st = i|yt , Yt
0
1 (t|t 1 t )
f (yt |Yt 1 , )
1 ; )
= P (st = i|Yt ; ).
We can do this for each element in the vector and obtain that
t|t =
(t|t 1 t )
.
10 (t|t 1 t )
(11)
(12)
These two equations, (11) and (12) compose an iterating algorithm to compute the optimal inference
for st . This iteration starts from 1|0 , which can be specified in several ways (see page 693 in
Hamilton).
So far when we draw inference about st , we base on information up to time t. While, as we
obtain more information and look back, we may have dierent ideas about what happened at time
t. Such an inference, say, P (st = i|Y ; ) for > t, is called the smoothed inference.
Above we assume that we know the parameter . To estimate , we can find the estimator that
maximizes the likelihood (10), using some numerical optimization techniques.
11