Beruflich Dokumente
Kultur Dokumente
Lecture 7:
We now turn our attention to the limiting behavior of estimators as the sample size tends
to infinity. This is referred to as the asymptotic behavior of an estimator.
7.1
Consistency
(1 )
0.
Var (nMLE ) =
n
Now consider the Bayes estimator nB = (X + a)/(n + a + b), which is the posterior mean of
under a Beta(a, b) prior. Then
n + a
E (nB ) =
,
n+a+b
n(1 )
Var (nB ) =
0,
(n + a + b)2
For now, it will suffice to briefly summarize these regularity conditions as follows:
The data X = (X1 , . . . , Xn ) is an iid sample with likelihood Lx () = ni=1 Lxi ().
The parameter space is an open subset of R. (Note that = R is allowed.)
The set X = {x1 R Lx1 () > 0} (called the support) does not depend on .
If Lx1 (1 ) = Lx1 (2 ) for all x1 X (except possibly for some set of values with probability zero), then 1 = 2 .
The likelihood Lx1 () must satisfy certain smoothness conditions as a function of .
Practically speaking, the most commonly encountered violation of these conditions is a
situation where the set X = {x1 R Lx1 () > 0} depends on (for example, if the distribution
of the data is uniform over some interval that depends on ). However, if all conditions are
satisfied, then we have the following result.
Theorem 7.1.5. Let n be the maximum likelihood estimator of based on the sample
Xn = (X1 , . . . , Xn ). Then under the regularity conditions of Section 7.4, n is a consistent
estimator of .
Proof. The proof is beyond the scope of this course.
Note: The result of Theorem 7.1.5 actually follows as a corollary of Theorem 7.2.4,
which will be presented in the next section. However, this logical implication is useless
for actually proving Theorem 7.1.5 because the result of Theorem 7.1.5 will be used in
the proof of Theorem 7.2.4.
Recall that there are some settings in which the maximum likelihood estimator for particular
data values (i.e., the maximum likelihood estimate) may not exist or be unique. The result
of Theorem 7.1.5 implicitly includes the result that the maximum likelihood estimate exists
and is unique for all sufficiently large n.
Note: Again, this conclusion about existence and uniqueness of the MLE for all sufficiently large n is a result of Theorem 7.1.5 that automatically holds when the regularity
conditions are satisfied. In other words, it is not an additional assumption.
7.2
Earlier in the course, we derived the asymptotic distribution of certain estimators by various
ad hoc approaches. In particular, we have often exploited the central limit theorem and delta
method for estimators that can be expressed as some type of average from an iid sample.
However, we now develop a much more general theory to address the asymptotic distribution
of the maximum likelihood estimator.
Restrictions and Extensions
Some regularity conditions will still be required (see Section 7.4), so the results developed
in this section will not apply to all maximum likelihood estimators. On the other hand, this
basic approach is general enough to be extended to other M-estimators as well, though such
extensions are beyond the scope of this course.
Derivatives of the Log-Likelihood
Throughout the lecture, we will write derivatives of the likelihood and log-likelihood as
LX () =
LX (),
LX () =
2
LX (),
2
`X () =
`X (),
`X () =
2
`X ().
2
Lemma 7.2.1. Under the regularity conditions of Section 7.4, E [`X ()] = 0, and
In () = Var [`X ()] = E [`X ()] = n E [`X1 ()].
Proof. See the proof of Theorem 8.8.1 of DeGroot & Schervish for all but the last equality.
For the last equality, simply note that `X () = ni=1 `Xi (), and the terms in the sum are
identically distributed.
Lemma 7.2.1 provides the formula by which we usually calculate the Fisher information in
practice, for two reasons:
The second derivative may have fewer terms to consider than the first derivative.
We only need to find a single expectation, as opposed to finding a variance or the
expectation of a squared quantity.
When Lemma 7.2.1 applies, the quantity E [`X1 ()] can be called I1 (), the information
per observation.
Example 7.2.2: Let X1 , . . . , Xn iid Poisson(), where > 0 is unknown. Suppose we want
to find the Fisher information In (). We find
`X1 () =
X1
X1
2
[
+
X
log
log(X
!)]
=
(1
+
)
=
,
1
1
2
2
1
(X1 )2
X1
1
2
[
log(2
)
]=
(
) = 2,
2
2
2
2
2
n(n ) D N [0,
1
].
I1 ()
Proof. A fully rigorous proof is beyond the scope of this course, but we can still provide the
basic idea. Begin with a Taylor expansion of `Xn (n ) around :
`Xn (n ) = `Xn () + (n )`Xn () + ,
where we can ignore the higher-order terms. (The justification of this claim is where most of
the regularity conditions are used.) Now observe that the left-hand
side is zero, so (neglecting
` n ()
]=
n(n ) n [ X
`Xn ()
1
n [ `Xn () 0]
n
.
1
`Xn ()
n
n(n ) D N [0,
1
]
I1 ()
As you might expect, we will exploit these asymptotic results heavily for inference procedures
as the course continues.
n=1
n=1
= 1 P (Xi = 0) = 1 exp() = 1 0 = 1,
so the MLE exists for sufficiently large n with probability 1.
n ) D N (0, ).
n(
(Of course, we could have obtained the same result by the central limit theorem.)
n1 Jn
Jn (n ) =
n I1 () (n ) D N (0, 1)
I1 ()
by Theorem 7.1.5 and Slutskys theorem. Thus, if n is large, then the distribution of n
is approximately normal with mean and variance 1/Jn . Notice that Jn is simply a second derivative evaluated at a particular point, which is usually straightforward to calculate
numerically. (This approach is often used by statistical software to calculate approximate
variances of maximum likelihood estimators.)
Asymptotic Distribution of Functions of MLEs
is an MLE of g(). If g is continuously differentiable
Recall that if is an MLE of , then g()
at the true value , then we can combine the delta method with Theorem 7.2.4 to obtain
Example 7.2.6: Let X1 , . . . , Xn iid Poisson(), where > 0 is unknown. Note that each
observation Xi has variance and hence standard deviation 1/2 . The MLE of 1/2 is
1/2
1/2
,
n = ( X n)
so we may apply the delta method to the result of Example 7.2.5 to obtain
`X ()
`X ()) .
`X () = (
1
p
Next, under certain generalized versions of the regularity conditions of Section 7.4, the Fisher
information (now a p p matrix) can be calculated as
I() = Var [
where
2
2
`X ()] = E [
`
()]
=
n
E
[
`X ()] = n I1 (),
X
T
T 1
2
2
`X ()
`X ()
1
1 p 1
12
`
()
=
X
T
2
2
`
()
`
()
X
X
1
2
1 p 1
p
is the matrix of second partial derivatives, sometimes called the Hessian. Then, again under
suitable generalizations of the regularity conditions of Section 7.4, the MLE n satisfies
1
n(n ) D Np {0p , [I1 ()] }.
Note: You might notice that even if we are willing to overlook the regularity conditions,
there is still a problem with the statement above: we have not defined convergence in
distribution for random vectors. Thus, we will have to settle for understanding the
result above at a more conceptual and imprecise level.
Again, the material here is presented only to provide some basic exposure to these more
advanced multivariate concepts. We will not discuss them rigorously or in any further detail.
7.3
Asymptotic Efficiency
In some situations it may be possible to find the asymptotic distribution of estimators other
than the MLE. Many sensible estimators n of exhibit similar distributional convergence
results to that of the MLE n . Specifically, we often obtain a result of the form
Note: The factor of n in the convergence result above can introduce confusion over
the term asymptotic variance. Although we call v() the asymptotic variance, the
approximate distribution of n for large n is normal with mean and variance v()/n.
The asymptotic variance provides another way to compare and evaluate estimators. Among
estimators with convergence results of the form (7.3.1), we would usually prefer the estimator
with the smallest asymptotic variance v().
Asymptotic Relative Efficiency
We quantify this type of asymptotic variance comparison through a quantity called the
(1)
(2)
asymptotic relative efficiency (ARE) of one estimator compared to another. If n and n
are estimators of such that
(1)
n[n ] D N [0, v (1) ()],
(2)
n[n ] D N [0, v (2) ()],
(1)
(2)
then the asymptotic relative efficiency of n compared to n is
exp(2)
1/{exp()[1 exp()]}
=
=
.
1/[ exp(2)]
exp()[1 exp()] exp() 1
Note that ARE (n , n ) < 1 for all > 0, so n has larger asymptotic variance than n for all
parameter values. In fact, if is even moderately large, then the advantage of the MLE can
be quite substantial, e.g., ARE (n , n ) 1/67 if = 5.
Asymptotic relative efficiency can be interpreted in terms of sample sizes. Suppose that
(1) (2)
ARE [n , n ] = 3. Then the distribution of (1) based on a sample of size n has approximately the same variance as the distribution of (2) based on a sample of size 3n. In other
words, an estimator that is three times as efficient as another (based on ARE) needs a sample
size only a third as large as the other estimator in order to achieve approximately the same
precision.
Asymptotic Efficiency
We might wish to go a step further than simply comparing two estimators using asymptotic
relative efficiency. Specifically, we might ask whether there exists an estimator that minimizes
the asymptotic variance. The following theorem suggests an answer.
Theorem 7.3.2. Let n be an estimator of such that (7.3.1) holds for some v(). Then
under the regularity conditions of Section 7.4, v() [I1 ()]1 .
Proof. The proof is beyond the scope of this course.
An estimator n for which (7.3.1) holds with v() = [I1 ()]1 (i.e., an estimator that attains
the bound specified by Theorem 7.3.2) is called asymptotically efficient. Then the following
result is immediately obvious by Theorem 7.2.4.
Corollary 7.3.3. Let n be the maximum likelihood estimator of based on the sample
Xn = (X1 , . . . , Xn ). Then under the regularity conditions of Section 7.4, the estimator n is
asymptotically efficient.
Estimators that are close enough to the MLE as n can also be asymptotically efficient.
In particular, Bayes estimators are often asymptotically efficient as well.
Example 7.3.4: Let X1 , . . . , Xn iid Poisson(), where > 0 is unknown. It can be shown
that the posterior mean of under a Gamma(a, b) prior is
n
B = a + i=1 Xi = ( n )X n + ( b ) a .
b+n
b+n
b+n b
B ) = n[( n )X n + ( b ) a ] n[( n ) + ( b )]
n(
b+n
b+n b
b+n
b+n
n
b
a
1
=(
)
n( X n ) + n(
)( ) D N [0,
]
b+n
b+n b
I1 ()
1
D N {0,[I1 ()]1 }
10
The occasional very large values tend to cause X n to also take very large values, and
thus n = exp(X n ) tends to underestimate quite badly.
On the other hand, n still performs reasonably well as an estimator of .
The asymptotic relative efficiency as calculated in Example 7.3.1 tells us that the efficiency
of n relative to n drops off rapidly as increases. Thus, the larger the true value of is,
the more a switch from n to n hurts us if it turns out that all of the assumptions of the
model are correct after all, i.e., if it actually is true that X1 , . . . , Xn iid Poisson().
7.4
11
L
()
dx
=
L
()
dx
or
L
()
=
Lx ().
x
X n x
x
d3 X n
d3 xX n
xX n
Note that the right-hand side of either equation is equal to (d3 /d3 )(1) = 0.
For any 0 , there exist 0 > 0 and a function M0 (x) such that
`
x1 () M0 (x1 )