Beruflich Dokumente
Kultur Dokumente
Slides 7
Bilkent
In this part, we will talk about estimation. Our focus will almost exclusively be on
the maximum likelihood method.
We have worked with many distributions so far, calculated their expectations,
variances or derived their moment generating functions etc.
Importantly, the setting was such that we knew what distribution we were
considering AND we had full knowledge of the parameter values for these
distributions. Or, to put it more precisely, we never contemplated the possibility that
they might not be known.
Then, there are two implicit assumptions:
1 We know the distribution.
2 We know the parameters of the distribution.
In real life, this is rarely the case. We will …rst relax the second assumption and later
on will dispense with the …rst assumption.
The treatment will sacri…ce on formality and will rather focus on ideas. References
for more formal treatments will be provided at the end of this set of slides.
Now, let’s assume we have a random sample consisting of X1 , X2 , ..., Xn from the
density fX (x jθ 0 ).
We would like to determine the value of θ 0 , which is unknown.
We could use an estimator.
An estimator is some function of the data
The index n underlines that fact that the particular value of the estimate depends on
the sample (and, so, on its size). Note that usually n is dropped and instead simply
θ̂ is used.
Note the di¤erence between the estimator and the estimate. The estimator is a
concept while the estimate is the value of the estimator for a given sample. So, if
the estimator is W (X1 , ..., Xn ) , then the estimate for a particular realisation of
X1 , ..., Xn is given by W (x1 , ..., xn ) .
Now, although the de…nition given in (1) implies that any function of the data could
be a valid estimator, we usually look for those that have desirable properties.
In other words, an estimator is a statistic, meaning that it cannot depend on θ or
any other unknown parameters, which has desirable properties.
We have actually introduced one of these desirable properties: consistency. Others
are unbiasedness, minimum mean squared error, minimum variance etc.
Let Θ be the parameter space for θ. An estimator θ̂ of θ 0 is a minimum mean
squared error estimator if for every θ 0 2 Θ.
h i
θ̂ = arg min E (θ θ 0 )2 .
θ 2Θ
E [θ̂ ] = θ 0 .
You will learn more about these in your future econometrics courses.
Let us …rst disect the notation. Suppose we are dealing with some generic
distribution such that
F Y (y ; θ ), θ 2 Θ.
F is the cdf, Y is the random variable and y is a particular realisation of Y .
θ is a vector which contains the distribution parameters. This is generally known as
the parameter vector.
The parameter vector takes on values on a set, Θ, known as the parameter space.
For example, for a normal random variable,
θ : population,
θ̂ : sample.
The maximum likelihood method is a very popular and strong method for estimating
θ when the underlying distribution function, FY , is known (or when one believes that
one actually knows the underlying distribution).
fY (y ; θ ) = L (θ; y ).
1500
2
µ=0, σ =1
2
µ=0, σ =4
1000
500
0
-8 -6 -4 -2 0 2 4 6 8 10
Figure: We have 10,000 iid observations from the true distribution N 0, σ2 . Two possible
values for σ2 . Which one is the correct one? Can we use the data to make a decision?
1500
1000
500
0
-8 -6 -4 -2 0 2 4 6 8 10
1500
His togram of Data
2
µ=0, σ =1
2
µ=0, σ =4
1000
500
0
-8 -6 -4 -2 0 2 4 6 8 10
The dataset would preferably consist of many observations on the same random
variable. This ensures that we have su¢ cient information to estimate θ. Consider
some simple examples.
Example: Let Yi be an iid random sequence where i = 1, ..., n. Let also
Yi N µ, σ2 , where Θ = f µ, σ2 : ∞ < µ < ∞, σ2 > 0g gives the parameter
space. Then, thanks to the independence assumption, the joint likelihood function is
given by
n
fY (y ; θ ) = ∏ fY i
(yi ; θ ) ,
i =1
where y = (y1 , ..., yn ) .
Notice that the parameter vector, θ, is common for all variables.
where ui is independent of xi .
Here,
β = (0.3, 0.4)0 , σ2 = 1 and xi = (xi 1 , xi 2 ) 0 .
where
σ2t = ω + αyt2 1 .
fY 1 ,...,Y T = fY 2 ,...,Y T jY 1 fY 1
= fY 3 ,...,Y T jY 1 ,Y 2 fY 2 jY 1 fY 1
= fY 4 ,...,Y T jY 1 ,Y 2 ,Y 3 fY 3 jY 1 ,Y 2 fY 2 jY 1 fY 1
..
.
T
= ∏ fY jY t t 1 ,...,Y 1
.
t =1
∂ log L (θ; y )
= 0.
∂θ
θ =θ̂
∂2 log L (θ; y )
< 0,
∂θ∂θ 0
θ =θ̂
Then,
n n 1 n
2σ2 i∑
`(θ; y ) = log L(θ; y ) = log 2π log σ2 (yi µ )2 .
2 2 =1
0
Obviously, θ = µ, σ2 . Let’s …nd the ML estimators.
Now,
∂`(θ; y ) 1 n
∂µ
= ∑ (yi
σ̂2 i =1
µ̂) = 0,
µ=µ̂,σ2 =σ̂2
and
n
∂`(θ; y ) n 1
∂σ2
= 2
+ 2 ∑ (yi µ̂)2 = 0.
µ=µ̂,σ2 =σ̂2
2σ̂ 2 σ̂2 i =1
1 n 1 n 1 n
n i∑ n i∑ n i∑
µ̂ = ȳ = yi and σ̂2 = (yi µ̂)2 = (yi ȳ )2 .
=1 =1 =1
0
Therefore, θ̂ = ȳ , n1 ∑ni=1 (yi ȳ )2 .
In the next few slides, we will cover some important common properties of likelihood
functions.
In this discussion, we will assume that the data generating process and the chosen
underlying distribution are the same:
g Y (y ) = fY (y ; θ 0 ) .
is crucial.
We will do this later. For the time being, we will stick to the simpler case.
Now,
2 3
Z
∂ log L (θ; y ) ∂L (θ; y ) 1
Ef 4 5 = f (y ; θ 0 )dy
∂θ ∂θ L (θ; y )
θ =θ 0 θ =θ 0
Z
∂L (θ; y )
= dy
∂θ
θ =θ 0
Z
∂
= f (y ; θ )dy
∂θ
θ =θ 0
∂
= 1
∂θ
θ =θ 0
= 0,
where we implicitly assumed that the order of integration and di¤erentiation can be
exchanged. An aside: this requires that the range of y does not depend on θ.
Hence, the expectation of the …rst-order condition, evaluated at the true parameter
value, is zero!
(Bilkent) ECON509 This Version: 16 December 2013 21 / 73
Maximum Likelihood Estimation
Properties of the Maximum Likelihood Estimator
where, as before, the expectation and the covariance are taken with respect to
fY (y ; θ ) .
Proof: Now, one can show that
∂2 log L (θ; y ) 1 ∂2 L (θ; y ) 1 ∂L (θ; y ) ∂L (θ; y )
0 = .
∂θ∂θ L (θ; y ) ∂θ∂θ 0 [L (θ; y )]2 ∂θ ∂θ 0
Then,
Z
∂2 log L (θ; y ) 1 ∂2 L (θ; y )
Ef = f (y ; θ 0 )dy
∂θ∂θ 0 L (θ; y ) ∂θ∂θ 0
Z
1 ∂L (θ; y ) ∂L (θ; y )
f (y ; θ 0 )dy .
[L(θ; y )]2 ∂θ ∂θ 0
∂2
= 1
∂θ∂θ 0
θ =θ 0
= 0.
These hold if, of course, we can exchange the order of integration and
di¤erentiation, which is implicitly assumed here.
Then,
2 3
Z
∂2 log L (θ; y ) 1 ∂L (θ; y ) ∂L (θ; y )
Ef 4 5 = f (y ; θ 0 )dy
∂θ∂θ 0 [L(θ; y )] 2 ∂θ ∂θ 0
θ =θ 0 θ =θ 0
Z
∂ log L (θ; y ) ∂ log L (θ; y )
= f (y ; θ 0 )dy
∂θ ∂θ 0
θ =θ 0
2 3
∂ log L ( θ; y ) ∂ log L ( θ; y )
= Ef 4 5
∂θ ∂θ 0
θ =θ 0
8 9
< ∂ log L (θ; y ) =
= Covf ,
: ∂θ ;
θ =θ 0
2 3
∂ log L (θ;y )
since E 4 ∂θ
5 is zero-mean.
θ =θ 0
in the sense that the di¤erence between the two matrices is non-negative de…nite.
Proof: For simplicity of exposition we focus on the univariate case, where θ is a
scalar. Since θ̃ is unbiased for θ 0 , we have
Z
Ef [θ̃ ] = θ̃f (y ; θ 0 )dy = θ 0 .
∂ log f (y ; θ )
θ̃ and .
∂θ
θ =θ 0
Cov (X , Y ) = E [XY ] ,
∂ log f (y ; θ )
1 = Ef θ̃
∂θ θ =θ 0
∂ log f (y ; θ )
= Covf θ̃, .
∂θ θ =θ 0
Remember that by the Cauchy-Schwarz Inequality, for any two random variables X
and Y ,
Covf (X , Y )2 Varf (X )Varf (Y ).
Then,
2
∂ log f (y ; θ ) ∂ log f (y ; θ )
Covf θ̃, = 12 Varf θ̃ Varf .
∂θ θ =θ 0 ∂θ θ =θ 0
This gives
n 1 n
2 i∑
` (θ; y ) = log 2π (yi θ )2 .
2 =1
implying that
n n
∑ yi ∑ θ̂ = 0,
i =1 i =1
and, so,
1 n
n i∑
θ̂ = yi .
=1
In addition,
" #
n
∂ log f (y ; θ )
Varf
∂θ θ =θ 0
= Varf ∑ (yi θ0 )
i =1
n
= ∑ Var{z
f (yi ) = n,
i =1 | }
1
Therefore, as far as this problem is concerned, the Cramér-Rao bound for any
unbiased estimator θ̃ is given by
1
Varf θ̃ .
n
Now, the variance of the ML estimator is very easy to …nd.
!
1 n 1 n
Var (θ̂ ) = Var ∑
n i =1
yi = 2 ∑ Var (yi )
n i =1
1 1
= n= .
n2 n
But this the same as the Cramér-Rao bound. Hence, the ML estimator in this
particular case is an e¢ cient estimator.
Remember that we restrict ourselves to the case where our random sequence
Y1 , ..., Yn is iid. Let y = (y1 , ..., yn ) .
Let fY i (yi ; θ ) = L (θ; yi ) be the pdf (or the likelihood function) for Yi . Then, the
joint pdf or the joint likelihood function is given by
n
L (θ; y ) = ∏ L (θ; yi ) ,
i =1
Now, suppose that log L (θ; y1 ) , log L (θ; y2 ) , ... is an iid sequence where
E [log L (θ; yi )] < ∞ for all i .
Then, by the strong Law of Large Numbers,
1 n a.s .
n i∑
log L (θ; yi ) ! Ef (y jθ 0 ) [log L (θ; yi )].
=1
Let θ̂ be the maximum likelihood estimator or the true parameter value θ 0 . Can we
use the previous convergence result to argue that we will also have
a.s .
θ̂ ! θ 0 ?
1 n a.s .
n i∑
sup log L (θ; yi ) Ef (y jθ 0 ) [log L (θ; yi )] ! 0
θ 2Θ =1
Moreover, by de…nition,
θ̂ = arg max log L (θ; y ) .
θ 2Θ
a.s .
Then, since n1 ∑ni=1 log L (θ; yi ) ! Ef (y jθ 0 ) [log L (θ; yi )] uniformly for all θ, we can
intuitively argue that argument that maximises n1 ∑ni=1 log L (θ; yi ) will also converge
to the argument that maximises Ef (y jθ 0 ) [log L (θ; yi )]. In other words,
1 n a.s .
n i∑
log L (θ; yi ) ! Ef (y jθ 0 ) [log L (θ; yi )] uniformly for all θ,
=1
1 n a.s .
so θ̂ = arg max ∑ log L (θ; y ) ! arg max Ef (y jθ 0 ) [log L(θ; y )] = θ 0 . (3)
θ 2 Θ n i =1 θ 2Θ
a.s . p
Now, suppose that we already proved that θ̂ ! θ 0 (which implies that θ̂ ! θ 0 ).
How to show the asymptotic normality of θ̂?
We start with a …rst-order expansion of the score function about θ̂ = θ 0 . Our
attention will again be on the univariate case, where θ is a scalar.
For sake of notational simplicty, whenever we write
1 n ∂ log L (θ ; yi )
n i∑
etc,
=1 ∂θ
we will mean
1 n ∂ log L (θ; yi )
n i∑
.
=1 ∂θ
θ =θ
Throughout, we also assume that the order of integration and di¤erentiation can be
interchanged and that f∂ log L (θ; yi ) /∂θ gi =1,...,n and f∂2 log L (θ; yi ) /∂θ 2 gi =1,...,n
are both iid sequences.
We will this time use a mean value expansion. This is almost the same as the Taylor
expansion. The only di¤erence is that, instead of including a remainder term, the
…nal term in the expansion is evaluated at the so called mean value.
Hence, a k th order mean value expansion of some function f (x ) about x = x0 is
given by
1
f (x ) = f (x0 ) + f (1 ) (x0 )(x x0 ) + ... + f (k 1)
(x0 )(x x0 )k 1
(k 1)!
1 (k )
+ f (x̃ )(x x0 )k ,
k!
where x̃ 2 [min(x , x0 ), max(x , x0 )].
Then, we have
" #
1 n ∂ log L θ̂; yi 1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
n i∑
= ∑
n i∑
+ θ̂ θ0 ,
=1 ∂θ n i =1 ∂θ =1 ∂θ 2
a.s . a.s .
Now, since θ̂ ! θ 0 and since θ̃ is always between θ̂ and θ 0 , we have θ̃ ! θ 0 as well,
a.s .
as θ̂ ! θ 0 .
Remember that, by de…nition,
1 n ∂ log L θ̂; yi
n i∑
= 0.
=1 ∂θ
Then, we have
" #
1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
0= ∑
n i∑
+ θ̂ θ0 ,
n i =1 ∂θ =1 ∂θ 2
Let’s try to prove that there indeed is a CLT for the highlighted term on the previous
slide.
Remember what we need to have for the existence of a CLT: an iid sequence with
…nite mean and variance.
Firstly, by assumption
∂ log L (θ 0 ; yi )
∂θ i =1,...,n
is an iid sequence.
Also, by the unbiasedness of the score and by a moment assumption we made
previously,
∂ log L (θ 0 ; yi ) ∂ log L (θ 0 ; yi )
E =0<∞ and 0 < Var < ∞.
∂θ ∂θ
Note that the second result above follows from
" #
∂ log L (θ 0 ; yi ) 2 ∂ log L (θ 0 ; yi )
E = Var .
∂θ ∂θ
De…ne
∂ log L (θ 0 ; yi )
I = Var .
∂θ
Then, we have the following Central Limit Theorem result:
0
z }| {
n
∂ log L (θ 0 ; yi )
∑
1 ∂ log L (θ 0 ;yi )
n ∂θ E
p i =1
∂θ d
n p ! N (0, 1) .
I
Equivalently,
p 1 n
∂ log L (θ 0 ; yi ) d
n
n ∑ ∂θ
! N (0, I) .
i =1
∂2 log L (θ; yi )
∂θ 2 i =1,...,n
is iid while
∂2 log L (θ; yi )
E < ∞.
∂θ 2
Then, under some other standard assumptions, one can show that
Note that the main tricky point here is that the arguments of the two functions are
di¤erent (θ̃ vs θ 0 ). Nevertheless, this type of results is pretty standard.
d
where = stands for “equal in distribution.”
The term
∂ log L (θ 0 ; yi )
I =Var
∂θ
is generally referred to as the expected (Fisher) observation, or more precisely, the
Fisher Information.
Moreover,
∂2 log L (θ 0 ; yi )
,
∂θ 2
is generally known as the Hessian, with the associated expected Hessian given by
∂2 log L (θ 0 ; yi )
J =E
∂θ 2
The quantities I and J can easily be consistently estimated , by using their sample
counterparts; that is,
n 2
1 ∂ log L (θ 0 ; yi ) 1 n ∂2 log L (θ 0 ; yi )
Ib =
n ∑ ∂θ
and Jb =
n i∑ ∂θ 2
i =1 =1
Of course, when θ 0 is unknown (as is always the case), one uses the estimated
parameter value, θ̂.
Therefore,
" #2
1 n ∂ log L θ̂; yi 1 n ∂2 log L θ̂; yi
Ib =
n ∑ ∂θ
and Jb =
n i∑ ∂θ 2
,
i =1 =1
Remember that,
1
1 ∂ log L (θ 0 ; yi )
I = Var ,
∂θ
is the Cramér-Rao lower bound. Hence, this con…rms that, in this given framework,
the ML estimator’s asymptotic variance has the minimum variance property.
Before we move on, let’s do the derivation of the asymptotic distribution for the
case where θ is a (p 1) vector rather than a scalar.
Again, we start with a Taylor expansion, but note that this time, we are dealing with
vectors.
" #
1 n ∂ log L θ̂; yi 1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
n i∑
= ∑
n i∑
+ θ̂ θ 0 . (4)
=1 ∂θ n i =1 ∂θ =1 ∂θ∂θ 0 (p 1 )
(p 1 ) (p 1 ) (p p )
Here,
2 3
∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi )
6 ∂θ 1 ∂θ 1 ∂θ 1 ∂θ 2 ∂θ 1 ∂θ p 7
6 ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) 7
∂2 log L θ̃; yi 6 7
6 ∂θ 2 ∂θ 1 ∂θ 2 ∂θ 2 ∂θ 2 ∂θ p 7
=6 .. .. .. 7.
∂θ∂θ 0 6
6
.. 7
7
. . . .
4 5
∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi )
∂θ p ∂θ 1 ∂θ p ∂θ 2 ∂θ p ∂θ p
A technical aside: the parameter θ̃ appearing in each entry in the above matrix is
understood to (possibly) di¤er from entry to entry.
All that we had assumed for the scalar terms in the scalar case are now assumed to
hold entry by entry for all matrices.
Now, rearranging (4) gives,
" #
1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
n i∑ n i∑
0 = + θ̂ θ 0
=1 ∂θ =1 ∂θ∂θ 0
" #
1 n ∂2 log L θ̃; yi 1 n ∂ log L (θ 0 ; yi )
) ∑
n i =1 ∂θ∂θ 0
θ̂ θ 0 =
n i∑ ∂θ
=1
" # 1 !
1 n ∂2 log L θ̃; yi 1 n ∂ log L (θ 0 ; yi )
n i∑ n i∑
) θ̂ θ 0 = ,
=1 ∂θ∂θ 0 =1 ∂θ
and
∂ log L (θ 0 ; Yi )
J =E .
∂θ∂θ 0
(p p )
∂ log L (θ 0 ; yi ) 6 7
6 ∂θ 2 7
=6 . 7,
∂θ 6 .. 7
4 5
∂ log L (θ 0 ;yi )
∂θ p
∂ log L (θ 0 ;yi )
and ∂θ 0
is simply the transpose of this vector.
and
1 n ∂ log L θ̂; yi
n i∑
Jb = .
=1 ∂θ∂θ 0
Then,
" # 1 !
p 1 n ∂2 log L θ̃; yi 1 n ∂ log L (θ 0 ; yi )
n i∑
n θ̂ θ0 = p ∑ .
=1 ∂θ∂θ 0 n i =1 ∂θ
| {z }| {z }
p 1 d
!J !N (0,I)
Also note that, this time we have J 1 IJ 1 as the asymptotic covariance matrix,
rather than J 2 I , which is the equivalent of J 1 IJ 1 in the scalar case.
Up to know, we have assumed that we know the true distribution that generates the
observed data and tried to estimate the true parameter vector, θ 0 . In reality, the
data generating process would rarely (if at all) be known by the researcher.
To make this point clear, suppose that the data generating process is given by the
cdf
G Y (y ) ,
and by its associated density function, gY (y ).
Do we know what the data generating process is? We might have some idea about
it but usually the short answer is “no.”
So what do we do then? When we model data, we believe (hope?) that the
distribution function we have chosen is indeed the data generating process. Let’s
de…ne this chosen distribution as
F Y (y ; θ ),
with its associated density function, fY (y ; θ ) .
Usually,
G Y (y ) 6 = F Y (y ; θ ) .
But this is not the end of the story.
When the chosen likelihood function is not the true data generating process,
maximum likelihood estimation is known as quasi or pseudo maximum likelihood
estimatition.
We will only outline the main ideas without getting into formal details.
Under standard assumptions,
Z
1 n p
n i∑
log L (θ; yi ) ! log f (y ; θ )g (y )dy = Eg [log f (y ; θ )],
=1
where
∂ log L (θ ; Yi ) ∂ log L (θ ; Yi ) ∂2 log L (θ ; Yi )
Ig = E g and Jg = Eg .
∂θ ∂θ 0 ∂θ∂θ 0
These can consistently be estimated by
n
1 ∂ log L (θ̂; yi ) ∂ log L (θ̂; yi ) 1 n ∂2 log L (θ̂; yi )
Ibg =
n ∑ ∂θ ∂θ 0
and Jbg =
n i∑ ∂θ∂θ 0
,
i =1 =1
where
1 p
Jbg 1 Ibg Jbg ! Jg 1 Ig Jg 1 .
All these results can be proved by using the same arguments as for the case where
the likelihood function is identical to the data generating process. The only
di¤erence is that this time the expansion should be about θ̂ = θ .
Clearly, the main ideas are the same as before. What changes is the interpretation of
what θ̂ converges to.
(Bilkent) ECON509 This Version: 16 December 2013 55 / 73
Maximum Likelihood Estimation
Quasi Maximum Likelihood Estimation
It is crucial to underline that the three nice properties we have proved do not
necessarily hold for quasi maximum likelihood estimation.
In particular, generally,
Ig 6 = Jg .
Therefore, the asymptotic variance matrix does not simplify to Ig 1 anymore.
The term Jg 1I
g Jg is known as the sandwich matrix. Likewise, Jbg 1 Ibg Jbg
1 1 is
generally called the sandwich estimator.
Remember that all these results are for the iid case. As we relax the iid assumption,
we may have to deal with further issues, especially in terms of consistent estimation
of the sandwich matrix. We will not deal with these.
In many cases, convergence to the true parameter value is still possible, even when
the chosen density function is not the correct one.
Example: We suppose that Y1 , ..., Yn is an iid sequence from N (µ, σ2 ). However,
the truth is that although Yi are iid, the distribution is not Normal. In addition, the
true mean and variance are given by µ0 and σ20 , respectively.
p
Let θ̂ = µ̂, σ̂2 and θ 0 = µ0 , σ20 . Now, in order to ensure θ̂ ! θ 0 , we need to have
2 3
∂ log L ( θ, Y i )
Eg 4 5 = 0.
∂θ
θ =θ 0
Then, we have
2 3 " #
n
∂ log L (θ, Yi ) n 1
Eg 4
∂σ2
5=
2σ20
+ 2
Eg ∑ (Y i µ0 ) 2
.
θ =θ 0
2 σ20 i =1
Now, h i
E g (Y i µ0 )2 = σ20 ,
by de…nition, so
2 3
∂ log L (θ, Yi ) n 1
Eg 4 5= + nσ20 = 0.
∂σ2 2σ20 2 σ20
2
θ =θ 0
Moreover,
2 3 " #
n
∂ log L (θ, Yi ) 5 = 1 Eg
Eg 4
∂µ σ20
∑ (Y i µ0 ) = 0.
θ =θ 0 i =1
-5
200 400 600 800 1000 1200 1400 1600
S&P500 Squared Daily Returns (%)
30
20
10
Figure: Daily returns and squared returns on S&P500 index from 2 February 2001 to 16 January
2008.
Engle’s (1982) idea: there is no predictability in asset returns, but there is some
predictability in the asset return volatility.
High volatility periods are clustered together. Likewise, low volatility periods are
clustered together.
Here is the model. Let rt be the return on some asset at time t, t = 1, ..., T . Let
Ft 1 be the information set at time t 1 (e.g. all stock returns, volatilities etc up
to and including period t 1).
Suppose for simplicity that E [rt jFt 1 ] = 0 for all t. This is not a crazy assumption
for stock returns. Indeed, they simply ‡uctuate around zero.
εt jFt 1 N 0, σ2t ,
σ2t = ω + αε2t 1,
ω > 0, α 0.
His doctoral student Tim Bollerslev came up with an extension in 1986, which
proved to be one of the most popular models in econometrics: the Generalised
ARCH (GARCH) model given by,
σ2t = ω + αε2t 1 + βσ2t 1.
This paper is among the most cited papers published in Journal of Econometrics.
This is not a small di¤erence. If anything, it has been more successful empirically.
In 2003, Robert Engle shared the Nobel prize with Sir Clive Granger.
The idea is simple: today’s volatility is a¤ected by (i) yesterday’s shock (ε2t ) and (ii)
yesterday’s volatility (σ2t ).
This model is usually estimated using the ML method.
(Bilkent) ECON509 This Version: 16 December 2013 62 / 73
Maximum Likelihood Estimation
MLE in Action: Financial Volatility Estimation
rt jFt 1 N 0, σ2t ,
Then, " #
2
1 1 rt
L (ω, α, β; rt jFt 1) = q exp ,
2πσ2t 2 σt
and " #
T 2
1 1 T rt
∏ L(ω, α, β; rt jFt 1) = q exp
2 t∑
.
(2π )n ∏T
t =2 2 =2 σt
t =2 σ t
80
70
60
Variance (%)
50
40
30
20
10
0
2001 2002 2003 2004 2005 2006 2007 2008
Before we …nish this part, let’s look at the basic consistency proof for maximum
likelihood (and related) estimators.
Now, as before, let for some iid sequence Yi , i = 1, ..., n, the log-likelihood function
be given by
`(θ; yi ),
where θ is a possibly vector-valued parameter.
Remember that the maximum likelihood estimator that we considered so far is given
by
1 n
θ̂ = arg max ∑ `(θ; yi ), where Θ is the parameter space.
θ 2 Θ n i =1
p
What is the basic proof for θ̂ ! θ 0 ?
The following is a slightly less di¢ cult version of Theorem 2.1 (and of its Proof)
from Newey and McFadden (1994, pp. 2121-2122).
Let Q̂n (θ ) = ∑ni=1 `(θ; yi ) and Q0 (θ ) = E [`(θ; yi )].
1
n
Theorem 2.1 (based on Newey and McFadden, 1994): If there is a function
Q0 (θ ) such that (i) for each η > 0,
Q0 (θ 0 ) sup Q0 (θ ) > 0
fθ:jjθ θ 0 jj>η g
and (ii)
p
sup Q̂n (θ ) Q0 (θ ) ! 0,
θ 2Θ
then
p
θ̂ ! θ 0 .
Another way of stating Assumption (i) is to say that θ 0 is uniquely identi…able.
Proof: For any ε > 0, we can show that with probability approaching one (w.p.a.1)
How? (i ) follows from the fact that θ̂ is the maximiser of Q̂n (θ ); (ii ) and (iii ) follow
from the Uniform convergence result which implies that w.p.a.1
Then, w.p.a.1
w.p.a.1.
Therefore, for any ε > 0, Q0 (θ̂ ) > Q0 (θ 0 ) ε, w.p.a.1.
Which ε to choose? Actually, let our choice for ε be such that
ε = Q0 (θ 0 ) sup Q0 (θ ) > 0.
fθ:jjθ θ 0 jj>η g
Then, w.p.a.1,
" #
Q0 (θ̂ ) > Q0 (θ 0 ) Q0 (θ 0 ) sup Q0 (θ ) > sup Q 0 ( θ ).
fθ:jjθ θ 0 jj>η g fθ:jjθ θ 0 jj>η g
Therefore, w.p.a.1, θ̂ 2
/ fθ : jjθ θ 0 jj > η g or, equivalently, θ̂ 2 fθ : jjθ θ 0 jj η g.
Now, this is all valid for any ε > 0. ε can be arbitrarily close to zero, which requires
η to be arbitrarily close to zero, which in turn implies that θ̂ will be arbitrarily close
to θ 0 , w.p.a.1. In other words,
p
θ̂ ! θ 0 .
In the original version of the Theorem, Assumption (i) is replaced by the following
assumptions:
1 Q0 (θ ) is uniquely maximised at θ 0 ,
2 Θ is compact,
3 Q0 (θ ) is continuous.