Slides 7 PDF

ECON509 Probability and Statistics
Slides 7
Bilkent
This Version: 16 December 2013
(Bilkent) ECON509 This Version: 16 December 2013 1 / 73

Introduction
In this part, we will talk about estimation. Our focus will almost exclusively be on
the maximum likelihood method.
We have worked with many distributions so far, calculated their expectations,
variances or derived their moment generating functions etc.
Importantly, the setting was such that we knew what distribution we were
considering AND we had full knowledge of the parameter values for these
distributions. Or, to put it more precisely, we never contemplated the possibility that
they might not be known.
Then, there are two implicit assumptions:
1 We know the distribution.
2 We know the parameters of the distribution.
In real life, this is rarely the case. We will …rst relax the second assumption and later
on will dispense with the …rst assumption.
The treatment will sacri…ce on formality and will rather focus on ideas. References
for more formal treatments will be provided at the end of this set of slides.

Introduction
Now, let’s assume we have a random sample consisting of X1 , X2 , ..., Xn from the
density fX (x jθ 0 ).
We would like to determine the value of θ 0 , which is unknown.
We could use an estimator.
An estimator is some function of the data
θ̂ n = W (X1 , ..., Xn ) . (1)
The index n underlines that fact that the particular value of the estimate depends on
the sample (and, so, on its size). Note that usually n is dropped and instead simply
θ̂ is used.
Note the di¤erence between the estimator and the estimate. The estimator is a
concept while the estimate is the value of the estimator for a given sample. So, if
the estimator is W (X1 , ..., Xn ) , then the estimate for a particular realisation of
X1 , ..., Xn is given by W (x1 , ..., xn ) .

Introduction
Now, although the de…nition given in (1) implies that any function of the data could
be a valid estimator, we usually look for those that have desirable properties.
In other words, an estimator is a statistic, meaning that it cannot depend on θ or
any other unknown parameters, which has desirable properties.
We have actually introduced one of these desirable properties: consistency. Others
are unbiasedness, minimum mean squared error, minimum variance etc.
Let Θ be the parameter space for θ. An estimator θ̂ of θ 0 is a minimum mean
squared error estimator if for every θ 0 2 Θ.
h i
θ̂ = arg min E (θ θ 0 )2 .
θ 2Θ
An estimator θ̂ of θ 0 is unbiased if for every θ 0 2 Θ,
E [θ̂ ] = θ 0 .
You will learn more about these in your future econometrics courses.

Maximum Likelihood Estimation
Motivation and the Main Ideas
Let us …rst disect the notation. Suppose we are dealing with some generic
distribution such that
F Y (y ; θ ), θ 2 Θ.
F is the cdf, Y is the random variable and y is a particular realisation of Y .
θ is a vector which contains the distribution parameters. This is generally known as
the parameter vector.
The parameter vector takes on values on a set, Θ, known as the parameter space.
For example, for a normal random variable,
θ = µ, σ2 and Θ = f(µ, σ2 ) : ∞ < µ < ∞, σ2 > 0g,
where µ is the mean and σ2 is the variance.

Suppose that we actually know the distribution.

However, usually θ is unknown. How to …nd out the value of θ?
We have to distinguish between the population and the sample. Population contains
all the unknown values. The sample, on the other hand can only provide an
approximation.
For example,
θ : population,
θ̂ : sample.
The maximum likelihood method is a very popular and strong method for estimating
θ when the underlying distribution function, FY , is known (or when one believes that
one actually knows the underlying distribution).

Maximum likelihood estimation (MLE) is based on maximisation of a likelihood

function.
Where to …nd this “likelihood function?” It’s actually pretty easy!
The likelihood function is the same as the probability density function:
fY (y ; θ ) = L (θ; y ).
The only change is the interpretation. When we consider a probability density

function, we implicitly consider θ as …xed and y as random. When we consider a
likelihood function, we assume that data, y , are given and …xed. Instead, it is θ
which is modi…ed.
How to make sense of this? MLE is based on the idea that, if we know the
underlying distribution function, then we should choose θ such that the probability
of the data, y , being observed is maximised.
In other words, we are trying to …nd out the values of θ which are most likely to
generate the observed data. This likelihood principle is due to R. A. Fisher.

1500
2
µ=0, σ =1
2
µ=0, σ =4
1000
500
0
-8 -6 -4 -2 0 2 4 6 8 10
Figure: We have 10,000 iid observations from the true distribution N 0, σ2 . Two possible
values for σ2 . Which one is the correct one? Can we use the data to make a decision?

1500
1000
500
0
-8 -6 -4 -2 0 2 4 6 8 10
Figure: This is what the data tell us.

1500
His togram of Data
2
µ=0, σ =1
2
µ=0, σ =4
1000
500
0
-8 -6 -4 -2 0 2 4 6 8 10
Figure: Can you now decide on which σ2 to choose?

The dataset would preferably consist of many observations on the same random
variable. This ensures that we have su¢ cient information to estimate θ. Consider
some simple examples.
Example: Let Yi be an iid random sequence where i = 1, ..., n. Let also
Yi N µ, σ2 , where Θ = f µ, σ2 : ∞ < µ < ∞, σ2 > 0g gives the parameter
space. Then, thanks to the independence assumption, the joint likelihood function is
given by
n
fY (y ; θ ) = ∏ fY i
(yi ; θ ) ,
i =1
where y = (y1 , ..., yn ) .
Notice that the parameter vector, θ, is common for all variables.

Example: Let Yi be an iid random sequence, conditional on Xi = xi , where

i = 1, ..., n. Let also Yi jXi = xi N β0 xi , σ2 , where xi = (xi 1 , ..., xik )0 and
0
β = ( β1 , ..., βk ) .
Let, also,
y = (y1 , ..., yn ) and x = (x1 , ..., xn ) .
Then,
n
fY jX =x (y ; θ ) = ∏ fY jX =x i i i
(yi ; θ ) .
i =1
An example? A possible structure for Yi jXi = xi would be
iid
yi = 0.3xi 1 + 0.4xi 2 + ui , ui N (0, 1), xi t6 ,
where ui is independent of xi .
Here,
β = (0.3, 0.4)0 , σ2 = 1 and xi = (xi 1 , xi 2 ) 0 .

One of the most important models in …nancial econometrics (and, indeed,

econometrics) is the autoregressive conditional heteroskedasticity (ARCH) mode due
to Engle (1982, Econometrica).
Example: Let Yt be the daily return on some equity on day t, where t = 1, ..., T .
The model is given by
Yt jYt 1 = yt 1 N 0, σ2t ,
where
σ2t = ω + αyt2 1 .

We can construct the likelihood by using a representation known as prediction

decomposition. Omitting the arguments of the likelihood/density function for
conciseness, we obtain
fY 1 ,...,Y T = fY 2 ,...,Y T jY 1 fY 1
= fY 3 ,...,Y T jY 1 ,Y 2 fY 2 jY 1 fY 1
= fY 4 ,...,Y T jY 1 ,Y 2 ,Y 3 fY 3 jY 1 ,Y 2 fY 2 jY 1 fY 1
..
.
T
= ∏ fY jY t t 1 ,...,Y 1
.
t =1
Then, for the ARCH model we have

T
fY 1 ,...,Y T = ∏ fY jY t t 1
,
t =2
since the conditional distribution of Yt depends on Yt 1 only.

Now that we know how to construct the joint likelihood function for a collection of
random variables Y1 , ..., Yn , we can start thinking about how to estimate parameters
by MLE.
Remember our discussion about the logic behind MLE. The idea is to …nd the values
of the parameters that maximise the possibility of obtaining the data that we
observe in the sample.
Our notation is
L (θ; y ) = fY (y ; θ ) ,
where θ and y are the parameter and data matrices, respectively.
Usually, it is more convenient to use the log-likelihood which is
` (θ; y ) = log L (θ; y ) .

Notice that log is a monotone transformation. Hence, as will be obvious in a
moment, for our purposes there is no di¤erence between using ` (θ; y ) and L (θ; y ) .
The maximum likelihood method is based on …nding the parameter values which
maximise the likelihood (or probability) of obtaining the particular sample we have:
θ̂ = arg max ` (θ; y ) .

θ 2Θ

Hence, the likelihood function is the objective function. Consequently, there must be
a …rst order condition.
Caution: never confuse estimator with estimate!
This …rst order condition has a special name: the score. The score is a key concept
and deeply in‡uences the behaviour of the ML estimator.
Let θ be a (k 1) vector. When the derivative exists, the score is given by
∂ log L (θ; y ) ∂` (θ; y )
= .
∂θ ∂θ
Of course, this is a (k 1) vector, as well.
Consequently, θ̂ is the value of θ which satis…es,
∂ log L (θ; y )
= 0.
∂θ
θ =θ̂
Importantly, one also has to ensure that
∂2 log L (θ; y )
< 0,
∂θ∂θ 0
θ =θ̂
in the sense that the matrix is negative de…nite.

Example: Let Yi N µ, σ2 where Yi ?? Yj for all i 6= j and i , j = 1, ..., n.

Let, as before, y = (y1 , ..., yn ) . The joint likelihood is given by
n
L (θ; y ) = ∏ fY (y i ; θ )
i
i =1
( )
n
1 1 n
2σ2 i∑
p 2
= exp (yi µ) .
2πσ2 =1
Then,
n n 1 n
2σ2 i∑
`(θ; y ) = log L(θ; y ) = log 2π log σ2 (yi µ )2 .
2 2 =1
0
Obviously, θ = µ, σ2 . Let’s …nd the ML estimators.

Now,
∂`(θ; y ) 1 n
∂µ
= ∑ (yi
σ̂2 i =1
µ̂) = 0,
µ=µ̂,σ2 =σ̂2
and
n
∂`(θ; y ) n 1
∂σ2
= 2
+ 2 ∑ (yi µ̂)2 = 0.
µ=µ̂,σ2 =σ̂2
2σ̂ 2 σ̂2 i =1
Solving for the …rst-order conditions yields,
1 n 1 n 1 n
n i∑ n i∑ n i∑
µ̂ = ȳ = yi and σ̂2 = (yi µ̂)2 = (yi ȳ )2 .
=1 =1 =1
0
Therefore, θ̂ = ȳ , n1 ∑ni=1 (yi ȳ )2 .

Properties of the Maximum Likelihood Estimator
In the next few slides, we will cover some important common properties of likelihood
functions.
In this discussion, we will assume that the data generating process and the chosen
underlying distribution are the same:
g Y (y ) = fY (y ; θ 0 ) .
where θ 0 is, by de…nition, the true parameter value.

This is not necessarily true in general. In fact, thinking about what happens when
g Y (y ) 6 = fY (y ; θ ) for all possible θ
is crucial.
We will do this later. For the time being, we will stick to the simpler case.

Property 1 (Unbiasedness of the Score):

2 3
∂ log L (θ; y )
Ef 4 5 = 0,
∂θ
θ =θ 0
where the expectation is taken with respect to the distribution fY (y ; θ 0 ) .

Proof: Now,
∂ log L (θ; y ) 1 ∂L (θ; y )
= .
∂θ L (θ; y ) ∂θ
Then, Z
∂ log L (θ; y ) 1 ∂L (θ; y )
Ef = f (y ; θ 0 )dy ,
∂θ L (θ; y ) ∂θ
by de…nition. Observe that this is a function of both θ and θ 0 .

Now,
2 3
Z
∂ log L (θ; y ) ∂L (θ; y ) 1
Ef 4 5 = f (y ; θ 0 )dy
∂θ ∂θ L (θ; y )
θ =θ 0 θ =θ 0
Z
∂L (θ; y )
= dy
∂θ
θ =θ 0
Z
∂
= f (y ; θ )dy
∂θ
θ =θ 0
∂
= 1
∂θ
θ =θ 0
= 0,
where we implicitly assumed that the order of integration and di¤erentiation can be
exchanged. An aside: this requires that the range of y does not depend on θ.
Hence, the expectation of the …rst-order condition, evaluated at the true parameter
value, is zero!
Property 2 (The Information Equality):

0 1 2 3
∂ log L ( θ; Y ) ∂2 log L (θ; Y )
Covf @ A= Ef 4 5,
∂θ ∂θ∂θ 0
θ =θ 0 θ =θ 0
where, as before, the expectation and the covariance are taken with respect to
fY (y ; θ ) .
Proof: Now, one can show that
∂2 log L (θ; y ) 1 ∂2 L (θ; y ) 1 ∂L (θ; y ) ∂L (θ; y )
0 = .
∂θ∂θ L (θ; y ) ∂θ∂θ 0 [L (θ; y )]2 ∂θ ∂θ 0
Then,
Z
∂2 log L (θ; y ) 1 ∂2 L (θ; y )
Ef = f (y ; θ 0 )dy
∂θ∂θ 0 L (θ; y ) ∂θ∂θ 0
Z
1 ∂L (θ; y ) ∂L (θ; y )
f (y ; θ 0 )dy .
[L(θ; y )]2 ∂θ ∂θ 0

Take the …rst term.

Z
1 ∂2 L (θ; y )
f (y ; θ 0 )dy
L (θ; y ) ∂θ∂θ 0
θ =θ 0
Z
∂2 L (θ; y ) 1
= f (y ; θ 0 )dy
∂θ∂θ 0 L(θ 0 ; y )
θ =θ 0
Z
∂2
= L (θ; y )dy
∂θ∂θ 0
θ =θ 0
∂2
= 1
∂θ∂θ 0
θ =θ 0
= 0.
These hold if, of course, we can exchange the order of integration and
di¤erentiation, which is implicitly assumed here.

Then,
2 3
Z
∂2 log L (θ; y ) 1 ∂L (θ; y ) ∂L (θ; y )
Ef 4 5 = f (y ; θ 0 )dy
∂θ∂θ 0 [L(θ; y )] 2 ∂θ ∂θ 0
θ =θ 0 θ =θ 0
Z
∂ log L (θ; y ) ∂ log L (θ; y )
= f (y ; θ 0 )dy
∂θ ∂θ 0
θ =θ 0
2 3
∂ log L ( θ; y ) ∂ log L ( θ; y )
= Ef 4 5
∂θ ∂θ 0
θ =θ 0
8 9
< ∂ log L (θ; y ) =
= Covf ,
: ∂θ ;
θ =θ 0
2 3
∂ log L (θ;y )
since E 4 ∂θ
5 is zero-mean.
θ =θ 0

Property 3 (Cramér-Rao Inequality): Let θ̃ be some estimator of θ 0 and assume
that Ef [θ̃ ] = θ 0 . Then,
8 2 39 1
< ∂ log L ( θ; y ) =
Varf θ̃ Varf 4 5 0,
: ∂θ ;
θ =θ 0
in the sense that the di¤erence between the two matrices is non-negative de…nite.
Proof: For simplicity of exposition we focus on the univariate case, where θ is a
scalar. Since θ̃ is unbiased for θ 0 , we have
Z
Ef [θ̃ ] = θ̃f (y ; θ 0 )dy = θ 0 .
Now, di¤erentiating with respect to θ 0 gives

Z Z
∂ ∂f (y ; θ )
1 = θ̃f (y ; θ )dy = θ̃ dy
∂θ θ =θ 0 ∂θ θ =θ 0
Z ∂ Z
∂θ f (y ; θ )jθ =θ 0 ∂ log f (y ; θ )
= θ̃ f (y ; θ )dy = θ̃ f (y ; θ )dy
f (y ; θ ) ∂θ θ =θ 0
∂ log f (y ; θ )
= Ef θ̃ .
∂θ θ =θ 0

Now, the previous term looks like the covariance of
∂ log f (y ; θ )
θ̃ and .
∂θ
θ =θ 0
And, actually it is!

Remember that
∂ log f (y ; θ )
Ef = 0. (2)
∂θ θ =θ 0
Since, for any two random variables X and Y ,
Cov (X , Y ) = E [XY ] ,
if either E [X ] = 0 or E [Y ] = 0 (you can easily verify this), by (2) we have
∂ log f (y ; θ )
1 = Ef θ̃
∂θ θ =θ 0
∂ log f (y ; θ )
= Covf θ̃, .
∂θ θ =θ 0

Remember that by the Cauchy-Schwarz Inequality, for any two random variables X
and Y ,
Covf (X , Y )2 Varf (X )Varf (Y ).
Then,
2
∂ log f (y ; θ ) ∂ log f (y ; θ )
Covf θ̃, = 12 Varf θ̃ Varf .
∂θ θ =θ 0 ∂θ θ =θ 0
Rearranging gives the desired result.

We note two important results:
1 The only time the Cramér-Rao bound is achieved is when the estimator is the ML
estimator. In many problems, no estimator would actually achieve this bound.
2 For regular problems, asymptotically the ML estimator achieves the Cramér-Rao
bound. In other words, for large n, ML can achieve the Cramér-Rao bound, that is ML
is e¢ cient in large samples.
In general, an estimator is said to be e¢ cient if it achieves the lowest possible
variance.

Example: Let Y1 , Y2 , ... be an iid sequence where Yi N (θ 0 , 1) for all i . We will

…rst …nd the Cramér-Rao bound for unbiased estimators of θ 0 and then show that
the ML estimator achieves this bound.
First, let’s construct the log-likelihood function. Let y = (y1 , ..., yn )0 . Then
( ) ( )
1 1 (y1 θ )2 1 1 (yn θ )2
L (θ; y ) = p exp ... p exp
2π1 2 1 2π1 2 1
( )
n n
1 1
2 i∑
= p exp (yi θ )2 .
2π =1
This gives
n 1 n
2 i∑
` (θ; y ) = log 2π (yi θ )2 .
2 =1

Then, the …rst order condition is given by

n
∂` (θ; y ) 1
= ( 2 ) ∑ yi θ̂ = 0,
∂θ 2 i =1
θ =θ̂
implying that
n n
∑ yi ∑ θ̂ = 0,
i =1 i =1
and, so,
1 n
n i∑
θ̂ = yi .
=1
In addition,
" #
n
∂ log f (y ; θ )
Varf
∂θ θ =θ 0
= Varf ∑ (yi θ0 )
i =1
n
= ∑ Var{z
f (yi ) = n,
i =1 | }
1
due to the iid assumption.

Therefore, as far as this problem is concerned, the Cramér-Rao bound for any
unbiased estimator θ̃ is given by
1
Varf θ̃ .
n
Now, the variance of the ML estimator is very easy to …nd.
!
1 n 1 n
Var (θ̂ ) = Var ∑
n i =1
yi = 2 ∑ Var (yi )
n i =1
1 1
= n= .
n2 n
But this the same as the Cramér-Rao bound. Hence, the ML estimator in this
particular case is an e¢ cient estimator.

Asymptotics of ML Estimators
Remember that we restrict ourselves to the case where our random sequence
Y1 , ..., Yn is iid. Let y = (y1 , ..., yn ) .
Let fY i (yi ; θ ) = L (θ; yi ) be the pdf (or the likelihood function) for Yi . Then, the
joint pdf or the joint likelihood function is given by
n
L (θ; y ) = ∏ L (θ; yi ) ,
i =1
and the joint log-likelihood function is

n n
` (θ; y ) = log ∏ L (θ; yi ) = ∑ log L (θ; yi ) .
i =1 i =1
Now, suppose that log L (θ; y1 ) , log L (θ; y2 ) , ... is an iid sequence where
E [log L (θ; yi )] < ∞ for all i .
Then, by the strong Law of Large Numbers,
1 n a.s .
n i∑
log L (θ; yi ) ! Ef (y jθ 0 ) [log L (θ; yi )].
=1

Let θ̂ be the maximum likelihood estimator or the true parameter value θ 0 . Can we
use the previous convergence result to argue that we will also have
a.s .
θ̂ ! θ 0 ?
To be able to show this, we need to assume that
1 n a.s .
n i∑
sup log L (θ; yi ) Ef (y jθ 0 ) [log L (θ; yi )] ! 0
θ 2Θ =1

The intuitive idea is as follows. Remember that
θ 0 = arg max Ef (y jθ 0 ) [log L (θ; y )],

θ 2Θ
which follows from the property that

2 3
∂ log L (θ; y )
E f (y j θ 0 ) 4 5 = 0.
∂θ
θ =θ 0
Moreover, by de…nition,
θ̂ = arg max log L (θ; y ) .
θ 2Θ

a.s .
Then, since n1 ∑ni=1 log L (θ; yi ) ! Ef (y jθ 0 ) [log L (θ; yi )] uniformly for all θ, we can
intuitively argue that argument that maximises n1 ∑ni=1 log L (θ; yi ) will also converge
to the argument that maximises Ef (y jθ 0 ) [log L (θ; yi )]. In other words,
1 n a.s .
n i∑
log L (θ; yi ) ! Ef (y jθ 0 ) [log L (θ; yi )] uniformly for all θ,
=1
1 n a.s .
so θ̂ = arg max ∑ log L (θ; y ) ! arg max Ef (y jθ 0 ) [log L(θ; y )] = θ 0 . (3)
θ 2 Θ n i =1 θ 2Θ
Note that there is no harm in dividing log L (θ; y ) by n. This is a monotonic

transformation, so both n1 log L (θ; y ) and log L (θ; y ) will yield the same estimates.
This argument is, of course, not formal. For a more formal proof, see Newey and
McFadden (1994, pp.2121-2122). We will cover this proof later on.
Asymptotic normality, on the other hand, can be proved by using the tools we have
already acquired. The main ingredient will, again, be Taylor’s expansion.

a.s . p
Now, suppose that we already proved that θ̂ ! θ 0 (which implies that θ̂ ! θ 0 ).
How to show the asymptotic normality of θ̂?
We start with a …rst-order expansion of the score function about θ̂ = θ 0 . Our
attention will again be on the univariate case, where θ is a scalar.
For sake of notational simplicty, whenever we write
1 n ∂ log L (θ ; yi )
n i∑
etc,
=1 ∂θ
we will mean
1 n ∂ log L (θ; yi )
n i∑
.
=1 ∂θ
θ =θ

We also assume that for all θ,

h i
∂2 log L (θ;yi )
1 E
∂θ 2
< ∞,
∂ log L (θ;yi ) 2
2 0<E ∂θ < ∞.
Throughout, we also assume that the order of integration and di¤erentiation can be
interchanged and that f∂ log L (θ; yi ) /∂θ gi =1,...,n and f∂2 log L (θ; yi ) /∂θ 2 gi =1,...,n
are both iid sequences.

We will this time use a mean value expansion. This is almost the same as the Taylor
expansion. The only di¤erence is that, instead of including a remainder term, the
…nal term in the expansion is evaluated at the so called mean value.
Hence, a k th order mean value expansion of some function f (x ) about x = x0 is
given by
1
f (x ) = f (x0 ) + f (1 ) (x0 )(x x0 ) + ... + f (k 1)
(x0 )(x x0 )k 1
(k 1)!
1 (k )
+ f (x̃ )(x x0 )k ,
k!
where x̃ 2 [min(x , x0 ), max(x , x0 )].

Then, we have
" #
1 n ∂ log L θ̂; yi 1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
n i∑
= ∑
n i∑
+ θ̂ θ0 ,
=1 ∂θ n i =1 ∂θ =1 ∂θ 2
where θ̃ 2 [min θ̂, θ 0 , max θ̂, θ 0 ].

We could also have used
" #
1 n ∂2 log L (θ 0 ; yi )
n i∑
θ̂ θ0 + R
=1 ∂θ 2
where R is a remainder term, instead. However, a mean value expansion will be

easier for us to work with.

a.s . a.s .
Now, since θ̂ ! θ 0 and since θ̃ is always between θ̂ and θ 0 , we have θ̃ ! θ 0 as well,
a.s .
as θ̂ ! θ 0 .
Remember that, by de…nition,
1 n ∂ log L θ̂; yi
n i∑
= 0.
=1 ∂θ
Then, we have
" #
1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
0= ∑
n i∑
+ θ̂ θ0 ,
n i =1 ∂θ =1 ∂θ 2
which implies that

" #" # 1
p p 1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
n i∑ n i∑
n θ̂ θ0 = n .
=1 ∂θ =1 ∂θ 2
| {z }| {z }
there is a CLT! there is an LLN!

Let’s try to prove that there indeed is a CLT for the highlighted term on the previous
slide.
Remember what we need to have for the existence of a CLT: an iid sequence with
…nite mean and variance.
Firstly, by assumption
∂ log L (θ 0 ; yi )
∂θ i =1,...,n
is an iid sequence.
Also, by the unbiasedness of the score and by a moment assumption we made
previously,
∂ log L (θ 0 ; yi ) ∂ log L (θ 0 ; yi )
E =0<∞ and 0 < Var < ∞.
∂θ ∂θ
Note that the second result above follows from
" #
∂ log L (θ 0 ; yi ) 2 ∂ log L (θ 0 ; yi )
E = Var .
∂θ ∂θ

De…ne
∂ log L (θ 0 ; yi )
I = Var .
∂θ
Then, we have the following Central Limit Theorem result:
0
z }| {
n
∂ log L (θ 0 ; yi )
∑
1 ∂ log L (θ 0 ;yi )
n ∂θ E
p i =1
∂θ d
n p ! N (0, 1) .
I
Equivalently,
p 1 n
∂ log L (θ 0 ; yi ) d
n
n ∑ ∂θ
! N (0, I) .
i =1

What about the LLN result?

Remember that, by assumption, the sequence
∂2 log L (θ; yi )
∂θ 2 i =1,...,n
is iid while
∂2 log L (θ; yi )
E < ∞.
∂θ 2
Then, under some other standard assumptions, one can show that
1 n ∂2 log L θ̃; yi a.s . ∂2 log L (θ 0 ; yi )

n i∑
!E .
=1 ∂θ 2 ∂θ 2
Note that the main tricky point here is that the arguments of the two functions are
di¤erent (θ̃ vs θ 0 ). Nevertheless, this type of results is pretty standard.

Now we have our CLT and LLN results.

De…ne
∂2 log L (θ 0 ; yi )
J =E .
∂θ 2
We can now use Slutsky’s Theorem.
" #" # 1
p p 1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
n i∑ n i∑
n θ̂ θ 0 = n
=1 ∂θ =1 ∂θ 2
| {z }| {z }
d a.s . 1
!N (0,I) !J
d 1 d 2
!J N (0, I) = N 0, IJ ,
d
where = stands for “equal in distribution.”

The term
∂ log L (θ 0 ; yi )
I =Var
∂θ
is generally referred to as the expected (Fisher) observation, or more precisely, the
Fisher Information.
Moreover,
∂2 log L (θ 0 ; yi )
,
∂θ 2
is generally known as the Hessian, with the associated expected Hessian given by
∂2 log L (θ 0 ; yi )
J =E
∂θ 2

The quantities I and J can easily be consistently estimated , by using their sample
counterparts; that is,
n 2
1 ∂ log L (θ 0 ; yi ) 1 n ∂2 log L (θ 0 ; yi )
Ib =
n ∑ ∂θ
and Jb =
n i∑ ∂θ 2
i =1 =1
Of course, when θ 0 is unknown (as is always the case), one uses the estimated
parameter value, θ̂.
Therefore,
" #2
1 n ∂ log L θ̂; yi 1 n ∂2 log L θ̂; yi
Ib =
n ∑ ∂θ
and Jb =
n i∑ ∂θ 2
,
i =1 =1
and under general assumptions one can ensure that

p p
Ib ! I and Jb ! J .

Notice that, by Property (2),

I = J,
implying that
p d 1
n θ̂ θ 0 ! N 0, I .
Remember that,
1
1 ∂ log L (θ 0 ; yi )
I = Var ,
∂θ
is the Cramér-Rao lower bound. Hence, this con…rms that, in this given framework,
the ML estimator’s asymptotic variance has the minimum variance property.

Before we move on, let’s do the derivation of the asymptotic distribution for the
case where θ is a (p 1) vector rather than a scalar.
Again, we start with a Taylor expansion, but note that this time, we are dealing with
vectors.
" #
1 n ∂ log L θ̂; yi 1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
n i∑
= ∑
n i∑
+ θ̂ θ 0 . (4)
=1 ∂θ n i =1 ∂θ =1 ∂θ∂θ 0 (p 1 )
(p 1 ) (p 1 ) (p p )
Here,
2 3
∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi )
6 ∂θ 1 ∂θ 1 ∂θ 1 ∂θ 2 ∂θ 1 ∂θ p 7
6 ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) 7
∂2 log L θ̃; yi 6 7
6 ∂θ 2 ∂θ 1 ∂θ 2 ∂θ 2 ∂θ 2 ∂θ p 7
=6 .. .. .. 7.
∂θ∂θ 0 6
6
.. 7
7
. . . .
4 5
∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi ) ∂2 log L (θ̃;yi )
∂θ p ∂θ 1 ∂θ p ∂θ 2 ∂θ p ∂θ p
A technical aside: the parameter θ̃ appearing in each entry in the above matrix is
understood to (possibly) di¤er from entry to entry.

All that we had assumed for the scalar terms in the scalar case are now assumed to
hold entry by entry for all matrices.
Now, rearranging (4) gives,
" #
1 n ∂ log L (θ 0 ; yi ) 1 n ∂2 log L θ̃; yi
n i∑ n i∑
0 = + θ̂ θ 0
=1 ∂θ =1 ∂θ∂θ 0
" #
1 n ∂2 log L θ̃; yi 1 n ∂ log L (θ 0 ; yi )
) ∑
n i =1 ∂θ∂θ 0
θ̂ θ 0 =
n i∑ ∂θ
=1
" # 1 !
1 n ∂2 log L θ̃; yi 1 n ∂ log L (θ 0 ; yi )
n i∑ n i∑
) θ̂ θ 0 = ,
=1 ∂θ∂θ 0 =1 ∂θ
where 0 is a (p 1) matrix of zeroes.

Always keep in mind that we are dealing with matrices now. For exampl, for two
matrices A and B which are (q q ) and (q 1), respectively, we do not necessarily
have
AB = BA!!!

Of course, this time we have to use the multivariate versions of I and J .

∂ log L (θ 0 ; Yi ) ∂ log L (θ 0 ; Yi ) ∂ log L (θ 0 ; Yi )
I = Var =E
∂θ ∂θ ∂θ 0
(p p )
and
∂ log L (θ 0 ; Yi )
J =E .
∂θ∂θ 0
(p p )
Here, 2 ∂ log L (θ 0 ;yi ) 3

6 ∂ log ∂θ
L (θ 0 ;yi ) 7
1
∂ log L (θ 0 ; yi ) 6 7
6 ∂θ 2 7
=6 . 7,
∂θ 6 .. 7
4 5
∂ log L (θ 0 ;yi )
∂θ p
∂ log L (θ 0 ;yi )
and ∂θ 0
is simply the transpose of this vector.

These can be consistently estimated using

n ∂ log L θ̂; yi ∂ log L θ̂; yi
1
Ib =
n ∑ ∂θ ∂θ 0
,
i =1
and
1 n ∂ log L θ̂; yi
n i∑
Jb = .
=1 ∂θ∂θ 0

Then,
" # 1 !
p 1 n ∂2 log L θ̃; yi 1 n ∂ log L (θ 0 ; yi )
n i∑
n θ̂ θ0 = p ∑ .
=1 ∂θ∂θ 0 n i =1 ∂θ
| {z }| {z }
p 1 d
!J !N (0,I)
Therefore, by Slutsky’s Theorem

p d 1 d 1 1
0 d 1
n θ̂ θ0 ! J N (0, I) = N 0, J I J = N 0, I ,
where we used the fact that J 1 is symmetric and that I = J .

A consistent estimator for the asymptotic variance will be
1 p
Jb 1b
I Jb !J 1
IJ 1
.
Also note that, this time we have J 1 IJ 1 as the asymptotic covariance matrix,
rather than J 2 I , which is the equivalent of J 1 IJ 1 in the scalar case.

Quasi Maximum Likelihood Estimation
Up to know, we have assumed that we know the true distribution that generates the
observed data and tried to estimate the true parameter vector, θ 0 . In reality, the
data generating process would rarely (if at all) be known by the researcher.
To make this point clear, suppose that the data generating process is given by the
cdf
G Y (y ) ,
and by its associated density function, gY (y ).
Do we know what the data generating process is? We might have some idea about
it but usually the short answer is “no.”
So what do we do then? When we model data, we believe (hope?) that the
distribution function we have chosen is indeed the data generating process. Let’s
de…ne this chosen distribution as
F Y (y ; θ ),
with its associated density function, fY (y ; θ ) .
Usually,
G Y (y ) 6 = F Y (y ; θ ) .
But this is not the end of the story.

When the chosen likelihood function is not the true data generating process,
maximum likelihood estimation is known as quasi or pseudo maximum likelihood
estimatition.
We will only outline the main ideas without getting into formal details.
Under standard assumptions,
Z
1 n p
n i∑
log L (θ; yi ) ! log f (y ; θ )g (y )dy = Eg [log f (y ; θ )],
=1
and therefore, intuitively one can argue that

p
θ̂ ! θ ,
where θ is the value of θ which satis…es

∂ log f (y ; θ )
Eg = 0,
∂θ
Now, θ is called the pseudo-true value of θ. This value has a special statistical
meaning.

θ is the value of θ that minimises what is known as the Kullback-Leibler

discrepancy:
Z
g (y )
D (f θ , g ) = log g (y ) dy
f (y ; θ )
Z Z
= log [g (y )] g (y ) dy log [f (y ; θ )] g (y ) dy
= Eg [log g (y )] Eg [log f (y ; θ )].
This is a measure of the di¤erence between what we use as our likelihood/density

function, f (y ; θ ), and the true data generating process, g (y ).
This is always non-negative, provided that g (y ) and f (y ; θ ) are continuous in y .
Therefore, the maximum likelihood estimator converges to the value of θ, which
minimises the di¤erence between the density we use and the true density.
Put di¤erently, the maximum likelihood estimator converges to what minimises our
mistake.
If, g (y ) = f (y ; θ ), then the density we are using is correctly speci…ed and θ = θ 0 .

The asymptotic distribution is now

p d
n θ̂ θ ! N (0, Jg 1 Ig Jg 1 ),
where
∂ log L (θ ; Yi ) ∂ log L (θ ; Yi ) ∂2 log L (θ ; Yi )
Ig = E g and Jg = Eg .
∂θ ∂θ 0 ∂θ∂θ 0
These can consistently be estimated by
n
1 ∂ log L (θ̂; yi ) ∂ log L (θ̂; yi ) 1 n ∂2 log L (θ̂; yi )
Ibg =
n ∑ ∂θ ∂θ 0
and Jbg =
n i∑ ∂θ∂θ 0
,
i =1 =1
where
1 p
Jbg 1 Ibg Jbg ! Jg 1 Ig Jg 1 .
All these results can be proved by using the same arguments as for the case where
the likelihood function is identical to the data generating process. The only
di¤erence is that this time the expansion should be about θ̂ = θ .
Clearly, the main ideas are the same as before. What changes is the interpretation of
what θ̂ converges to.
It is crucial to underline that the three nice properties we have proved do not
necessarily hold for quasi maximum likelihood estimation.
In particular, generally,
Ig 6 = Jg .
Therefore, the asymptotic variance matrix does not simplify to Ig 1 anymore.
The term Jg 1I
g Jg is known as the sandwich matrix. Likewise, Jbg 1 Ibg Jbg
1 1 is
generally called the sandwich estimator.
Remember that all these results are for the iid case. As we relax the iid assumption,
we may have to deal with further issues, especially in terms of consistent estimation
of the sandwich matrix. We will not deal with these.

In many cases, convergence to the true parameter value is still possible, even when
the chosen density function is not the correct one.
Example: We suppose that Y1 , ..., Yn is an iid sequence from N (µ, σ2 ). However,
the truth is that although Yi are iid, the distribution is not Normal. In addition, the
true mean and variance are given by µ0 and σ20 , respectively.
p
Let θ̂ = µ̂, σ̂2 and θ 0 = µ0 , σ20 . Now, in order to ensure θ̂ ! θ 0 , we need to have
2 3
∂ log L ( θ, Y i )
Eg 4 5 = 0.
∂θ
θ =θ 0
Let’s check this. The log-likelihood function is given by

2
n n 1 n yi µ
2 i∑
log L (θ ) = log 2π log σ2 .
2 2 =1 σ

Then, we have
2 3 " #
n
∂ log L (θ, Yi ) n 1
Eg 4
∂σ2
5=
2σ20
+ 2
Eg ∑ (Y i µ0 ) 2
.
θ =θ 0
2 σ20 i =1
Now, h i
E g (Y i µ0 )2 = σ20 ,
by de…nition, so
2 3
∂ log L (θ, Yi ) n 1
Eg 4 5= + nσ20 = 0.
∂σ2 2σ20 2 σ20
2
θ =θ 0
Moreover,
2 3 " #
n
∂ log L (θ, Yi ) 5 = 1 Eg
Eg 4
∂µ σ20
∑ (Y i µ0 ) = 0.
θ =θ 0 i =1

This con…rms that 2 3

∂ log L (θ, Yi )
Eg 4 5 = 0.
∂θ
θ =θ 0
Therefore, despite misspeci…cation, the QML estimator is still consistent for
µ0 , σ20 !

MLE in Action: Financial Volatility Estimation
S&P500 Daily Returns (%)

5
-5
200 400 600 800 1000 1200 1400 1600
S&P500 Squared Daily Returns (%)
30
20
10
200 400 600 800 1000 1200 1400 1600
Figure: Daily returns and squared returns on S&P500 index from 2 February 2001 to 16 January
2008.

Engle’s (1982) idea: there is no predictability in asset returns, but there is some
predictability in the asset return volatility.
High volatility periods are clustered together. Likewise, low volatility periods are
clustered together.
Here is the model. Let rt be the return on some asset at time t, t = 1, ..., T . Let
Ft 1 be the information set at time t 1 (e.g. all stock returns, volatilities etc up
to and including period t 1).
Suppose for simplicity that E [rt jFt 1 ] = 0 for all t. This is not a crazy assumption
for stock returns. Indeed, they simply ‡uctuate around zero.

Engle (1982) proposed the Autoregressive Conditional Heteroskedasticity (ARCH)

model given by
rt = E [rt jFt 1 ] + εt ,
εt jFt 1 N 0, σ2t ,
σ2t = ω + αε2t 1,
ω > 0, α 0.
His doctoral student Tim Bollerslev came up with an extension in 1986, which
proved to be one of the most popular models in econometrics: the Generalised
ARCH (GARCH) model given by,
σ2t = ω + αε2t 1 + βσ2t 1.
This paper is among the most cited papers published in Journal of Econometrics.
This is not a small di¤erence. If anything, it has been more successful empirically.
In 2003, Robert Engle shared the Nobel prize with Sir Clive Granger.
The idea is simple: today’s volatility is a¤ected by (i) yesterday’s shock (ε2t ) and (ii)
yesterday’s volatility (σ2t ).
This model is usually estimated using the ML method.
Now, when E [rt jFt 1] = 0, we have
rt jFt 1 N 0, σ2t ,
σ2t = ω + αε2t 1+ βσ2t 1 ,

ω > 0, α, β 0 and α + β < 1.
Then, " #
2
1 1 rt
L (ω, α, β; rt jFt 1) = q exp ,
2πσ2t 2 σt
and " #
T 2
1 1 T rt
∏ L(ω, α, β; rt jFt 1) = q exp
2 t∑
.
(2π )n ∏T
t =2 2 =2 σt
t =2 σ t
The …rst observation is dropped due to the conditioning: t = 1, ..., T , so we have to

start with the second observation as we will condition on the …rst observation.
The joint density is obtained by using the prediction decomposition argument, which
we mentioned at the beginning of this slide set.

Now, the log-likelihood function is given by

2
n 1 T 1 T rt
2 t∑ 2 t∑
`(ω, α, β; r1 , ..., rT ) = log 2π log σ2t .
2 =2 =2 σt
The …rst observation is dropped due to the conditioning: t = 1, ..., T , so we have to

start with the second observation as we will condition on the …rst observation.
The joint density is obtained by using the prediction decomposition argument, which
we mentioned at the beginning of this slide set.
Bollerslev and Wooldridge (1992, Econometric Reviews) showed that under certain
assumptions, even if the likelihood function is misspeci…ed, we will still have
p
θ̂ ! θ 0 .
So, it is …ne to use the Quasi Maximum Likelihood method here.

Unfortunately, we cannot solve this model in closed form due to the recursive
structure of σ2t . You can try to see this for yourselves by trying to take the …rst-order
derivative with respecto to, say, α.
So, estimation is conducted by computer software.

Examples of popular computer programmes/languages are Matlab, C++, R, SAS

etc.
How we estimate this model using a computer programme is as follows.
1 We write the log-likelihood function as a code and feed this into the software as the
objective function to be maximised. Most importantly, we “tell” the software that our
parameters are ω, α, β.
2 The computer then uses an optimisation algorithm. We can do constrained or
unconstrained optimisation. In the case of constrained optimisation we should, for
example, de…ne the parameter space. For the GARCH example, we do not want
Matlab to look for values of ω below zero, for example.
3 The computer simultaneously tries di¤erent values of (ω, α, β) until it has decided that
the maximum of the objective function (the log-likelihood function) has been achieved.
4 In some case, if the software cannot …nd the maximum (because, for example, the
objective function is ‡at, or is not concave, or there are many local maxima or some
other reason), then we get an error message along with (hopefully) an explanation.
Actually, this model is so well-established that it is very likely that your favourite
econometrics software will have an option to do GARCH estimation. So, all you
have to do is to upload the data and click on a few buttons.

Here are some examples of parameter estimates.
Stock Annual Variance α β

Amazon 46.78% .0133 .9840
IBM 9.47% .0741 .9188
JP Morgan 17.95% .0719 .9261
Procter & Gamble 6.66% .0285 .9693
Walt Disney 12.53% .0808 .9090
Table: GARCH parameter estimates for a selection of stocks. The estimation period is from 4
Jan 2000 to 1 Dec 2008

Fitted Variances Using the GARCH Model

100
Amazon
90 IBM
JP Morgan
80
70
60
Variance (%)
50
40
30
20
10
0
2001 2002 2003 2004 2005 2006 2007 2008

Appendix: Estimator Consistency
Before we …nish this part, let’s look at the basic consistency proof for maximum
likelihood (and related) estimators.
Now, as before, let for some iid sequence Yi , i = 1, ..., n, the log-likelihood function
be given by
`(θ; yi ),
where θ is a possibly vector-valued parameter.
Remember that the maximum likelihood estimator that we considered so far is given
by
1 n
θ̂ = arg max ∑ `(θ; yi ), where Θ is the parameter space.
θ 2 Θ n i =1
p
What is the basic proof for θ̂ ! θ 0 ?

The following is a slightly less di¢ cult version of Theorem 2.1 (and of its Proof)
from Newey and McFadden (1994, pp. 2121-2122).
Let Q̂n (θ ) = ∑ni=1 `(θ; yi ) and Q0 (θ ) = E [`(θ; yi )].
1
n
Theorem 2.1 (based on Newey and McFadden, 1994): If there is a function
Q0 (θ ) such that (i) for each η > 0,
Q0 (θ 0 ) sup Q0 (θ ) > 0
fθ:jjθ θ 0 jj>η g
and (ii)
p
sup Q̂n (θ ) Q0 (θ ) ! 0,
θ 2Θ
then
p
θ̂ ! θ 0 .
Another way of stating Assumption (i) is to say that θ 0 is uniquely identi…able.

Proof: For any ε > 0, we can show that with probability approaching one (w.p.a.1)
(i ) : Q̂n (θ̂ ) > Q̂n (θ 0 ) ε/3,

(ii ) : Q0 (θ̂ ) > Q̂n (θ̂ ) ε/3,
(iii ) : Q̂n (θ 0 ) > Q0 (θ 0 ) ε/3.
How? (i ) follows from the fact that θ̂ is the maximiser of Q̂n (θ ); (ii ) and (iii ) follow
from the Uniform convergence result which implies that w.p.a.1
Q̂n (θ ) Q0 (θ ) < ε/3 and Q0 (θ ) Q̂n (θ ) < ε/3, for any θ 2 Θ.
Then, w.p.a.1
Q̂n (θ̂ ) Q0 (θ̂ ) < ε/3 and Q0 (θ 0 ) Q̂n (θ 0 ) < ε/3.

Now, using (i ), (ii ) and (iii ) we get
Q0 (θ̂ ) > Q̂n (θ̂ ) ε/3 > Q̂n (θ 0 ) 2ε/3 > Q0 (θ 0 ) ε,

(ii ) (i ) (iii )
w.p.a.1.
Therefore, for any ε > 0, Q0 (θ̂ ) > Q0 (θ 0 ) ε, w.p.a.1.
Which ε to choose? Actually, let our choice for ε be such that
ε = Q0 (θ 0 ) sup Q0 (θ ) > 0.
fθ:jjθ θ 0 jj>η g

Then, w.p.a.1,
" #
Q0 (θ̂ ) > Q0 (θ 0 ) Q0 (θ 0 ) sup Q0 (θ ) > sup Q 0 ( θ ).
fθ:jjθ θ 0 jj>η g fθ:jjθ θ 0 jj>η g
Therefore, w.p.a.1, θ̂ 2
/ fθ : jjθ θ 0 jj > η g or, equivalently, θ̂ 2 fθ : jjθ θ 0 jj η g.
Now, this is all valid for any ε > 0. ε can be arbitrarily close to zero, which requires
η to be arbitrarily close to zero, which in turn implies that θ̂ will be arbitrarily close
to θ 0 , w.p.a.1. In other words,
p
θ̂ ! θ 0 .

In the original version of the Theorem, Assumption (i) is replaced by the following
assumptions:
1 Q0 (θ ) is uniquely maximised at θ 0 ,
2 Θ is compact,
3 Q0 (θ ) is continuous.
These three assumptions can also be used to obtain Assumption (i).

Slides 7 PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Slides 7 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

ECON509 Probability and Statistics

This Version: 16 December 2013

(Bilkent) ECON509 This Version: 16 December 2013 1 / 73

(Bilkent) ECON509 This Version: 16 December 2013 2 / 73

θ̂ n = W (X1 , ..., Xn ) . (1)

(Bilkent) ECON509 This Version: 16 December 2013 3 / 73

An estimator θ̂ of θ 0 is unbiased if for every θ 0 2 Θ,

(Bilkent) ECON509 This Version: 16 December 2013 4 / 73

θ = µ, σ2 and Θ = f(µ, σ2 ) : ∞ < µ < ∞, σ2 > 0g,

where µ is the mean and σ2 is the variance.

(Bilkent) ECON509 This Version: 16 December 2013 5 / 73

Suppose that we actually know the distribution.

(Bilkent) ECON509 This Version: 16 December 2013 6 / 73

Maximum likelihood estimation (MLE) is based on maximisation of a likelihood

The only change is the interpretation. When we consider a probability density

(Bilkent) ECON509 This Version: 16 December 2013 7 / 73

(Bilkent) ECON509 This Version: 16 December 2013 8 / 73

Figure: This is what the data tell us.

(Bilkent) ECON509 This Version: 16 December 2013 9 / 73

Figure: Can you now decide on which σ2 to choose?

(Bilkent) ECON509 This Version: 16 December 2013 10 / 73

(Bilkent) ECON509 This Version: 16 December 2013 11 / 73

Example: Let Yi be an iid random sequence, conditional on Xi = xi , where

(Bilkent) ECON509 This Version: 16 December 2013 12 / 73

One of the most important models in …nancial econometrics (and, indeed,

(Bilkent) ECON509 This Version: 16 December 2013 13 / 73

We can construct the likelihood by using a representation known as prediction

Then, for the ARCH model we have

since the conditional distribution of Yt depends on Yt 1 only.

(Bilkent) ECON509 This Version: 16 December 2013 14 / 73

` (θ; y ) = log L (θ; y ) .

θ̂ = arg max ` (θ; y ) .

(Bilkent) ECON509 This Version: 16 December 2013 15 / 73

Importantly, one also has to ensure that

in the sense that the matrix is negative de…nite.

Example: Let Yi N µ, σ2 where Yi ?? Yj for all i 6= j and i , j = 1, ..., n.

(Bilkent) ECON509 This Version: 16 December 2013 17 / 73

Solving for the …rst-order conditions yields,

(Bilkent) ECON509 This Version: 16 December 2013 18 / 73

where θ 0 is, by de…nition, the true parameter value.

g Y (y ) 6 = fY (y ; θ ) for all possible θ

(Bilkent) ECON509 This Version: 16 December 2013 19 / 73

Property 1 (Unbiasedness of the Score):

where the expectation is taken with respect to the distribution fY (y ; θ 0 ) .

(Bilkent) ECON509 This Version: 16 December 2013 20 / 73

Property 2 (The Information Equality):

(Bilkent) ECON509 This Version: 16 December 2013 22 / 73

Take the …rst term.

(Bilkent) ECON509 This Version: 16 December 2013 23 / 73

(Bilkent) ECON509 This Version: 16 December 2013 24 / 73

Now, di¤erentiating with respect to θ 0 gives

(Bilkent) ECON509 This Version: 16 December 2013 25 / 73

Now, the previous term looks like the covariance of

And, actually it is!

Since, for any two random variables X and Y ,

if either E [X ] = 0 or E [Y ] = 0 (you can easily verify this), by (2) we have

(Bilkent) ECON509 This Version: 16 December 2013 26 / 73

Rearranging gives the desired result.

(Bilkent) ECON509 This Version: 16 December 2013 27 / 73

Example: Let Y1 , Y2 , ... be an iid sequence where Yi N (θ 0 , 1) for all i . We will

(Bilkent) ECON509 This Version: 16 December 2013 28 / 73

Then, the …rst order condition is given by

due to the iid assumption.